Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: connections can be leaked leaving clients blocked on Connect() #41521

Closed
ajwerner opened this issue Oct 10, 2019 · 7 comments · Fixed by #41533
Closed

rpc: connections can be leaked leaving clients blocked on Connect() #41521

ajwerner opened this issue Oct 10, 2019 · 7 comments · Fixed by #41533
Assignees
Labels
S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@ajwerner
Copy link
Contributor

Describe the problem

Calls to rpc.Connection.Connect() block until the first heartbeat has completed (either succeeded or failed). The rpc.Context.runHeartbeat loop when gRPC tries to redial the connection (leading to a send on redialChan). The hazard is if gRPC sends on that redialChan before runHeartbeat ever sends its first heartbeat. In this case, clients can get stuck in Connect() forever.

@ajwerner ajwerner self-assigned this Oct 10, 2019
@andreimatei
Copy link
Contributor

Some more words: the Connection.Connect() call blocks forever here:

select {

because nobody pings initialHeartbeatDone. ctx.runHeartbeat() is supposed to close that channel, but it might never do so because it races with gRPC calling into the onlyOnceDialer a 2nd time - which closes redialChan and so the following early return is taken:
return grpcutil.ErrCannotReuseClientConn

@bdarnell can you please verify that we're not missing something.

This can result in a node being completely stalled. In a customer cluster, we've seen a leaseholder's attempts to dial two other replicas leak this way when invoked by the Raft transport, which made any further attempt by the node to propose any commands to either of those two replicas futile (for all ranges shared with those replicas).

cc @nvanbenschoten

@andreimatei andreimatei added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Oct 10, 2019
@bdarnell
Copy link
Contributor

I don't think you're missing anything. If the heartbeat loop doesn't start we still need to do something to signal the waiting Connect thread.

@ajwerner
Copy link
Contributor Author

The patch in #41533 should fix it.

I'd like to write a unit test which reproduces the rapid redial, at least under stress/stressrace. I've tried creating a client and server context and then either immediately or rapidly after connection on the server side close the underlying conn with a bunch of concurrent goroutines calling connect. In brief testing so far it seems like gRPC is not just redialing in this case. I'll keep working on a test.

I'm curious if gRPC has changed its behavior with regards to this backoff in a way that makes this bug much less likely/hard to repro without injecting some behavior. This patch seems relevant:
grpc/grpc-go#2740

@sidds
Copy link

sidds commented Oct 25, 2019

@ajwerner just checking if you are still planning to write any tests before closing this issue.

@ajwerner
Copy link
Contributor Author

@ajwerner just checking if you are still planning to write any tests before closing this issue.

I really want to. I spent a few hours trying to write a test that would hit this on current master and didn't succeed and then moved on. I'll look at revisiting this soon. Perhaps I should try to just merge the patch and backport it leaving a test as a TODO.

It seems that gRPC has a backoff now (well they had it before but a bug prevented it from being honored) so it's hard to get a redial in the time it takes from launching a goroutine to hitting the select in question.

@garvitjuniwal
Copy link
Contributor

@ajwerner Would it be possible to make a call on whether this patch is safe to merge and backport without a unittest? Perhaps a manual test by inserting strategic sleeps in some places? Do you think this backport is safe and will fix the problem for us if we merge it in our internal branch?

@ajwerner
Copy link
Contributor Author

ajwerner commented Nov 1, 2019

I feel pretty confident that this is safe. Sorry I haven’t pushed it over the finish line.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Nov 4, 2019
Before this patch callers would be left blocked on connect forever.

Fixes cockroachdb#41521.

Release note (bug fix): Fix bug whereby rapid network disconnections can
lead to cluster unavailability because goroutines remain blocked waiting
for a connection which will never be initialized to send its first
heartbeat.
craig bot pushed a commit that referenced this issue Nov 4, 2019
41533: rpc: notify callers to Connect when redial occurs before first heartbeat r=andreimatei a=ajwerner

Before this patch callers would be left blocked on connect forever.

Fixes #41521.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in 1d3c9de Nov 4, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this issue Nov 5, 2019
Before this patch callers would be left blocked on connect forever.

Fixes cockroachdb#41521.

Release note (bug fix): Fix bug whereby rapid network disconnections can
lead to cluster unavailability because goroutines remain blocked waiting
for a connection which will never be initialized to send its first
heartbeat.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Nov 5, 2019
Before this patch callers would be left blocked on connect forever.

Fixes cockroachdb#41521.

Release note (bug fix): Fix bug whereby rapid network disconnections can
lead to cluster unavailability because goroutines remain blocked waiting
for a connection which will never be initialized to send its first
heartbeat.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Jan 8, 2020
Before this patch callers would be left blocked on connect forever.

Fixes cockroachdb#41521.

Release note (bug fix): Fix bug whereby rapid network disconnections can
lead to cluster unavailability because goroutines remain blocked waiting
for a connection which will never be initialized to send its first
heartbeat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants