rpc: connections can be leaked leaving clients blocked on Connect() #41521

ajwerner · 2019-10-10T21:32:29Z

Describe the problem

Calls to rpc.Connection.Connect() block until the first heartbeat has completed (either succeeded or failed). The rpc.Context.runHeartbeat loop when gRPC tries to redial the connection (leading to a send on redialChan). The hazard is if gRPC sends on that redialChan before runHeartbeat ever sends its first heartbeat. In this case, clients can get stuck in Connect() forever.

The text was updated successfully, but these errors were encountered:

andreimatei · 2019-10-10T21:40:58Z

Some more words: the Connection.Connect() call blocks forever here:

cockroach/pkg/rpc/context.go

Line 346 in 44b39ee

select {

because nobody pings initialHeartbeatDone. ctx.runHeartbeat() is supposed to close that channel, but it might never do so because it races with gRPC calling into the onlyOnceDialer a 2nd time - which closes redialChan and so the following early return is taken:

cockroach/pkg/rpc/context.go

Line 1051 in 44b39ee

return grpcutil.ErrCannotReuseClientConn

@bdarnell can you please verify that we're not missing something.

This can result in a node being completely stalled. In a customer cluster, we've seen a leaseholder's attempts to dial two other replicas leak this way when invoked by the Raft transport, which made any further attempt by the node to propose any commands to either of those two replicas futile (for all ranges shared with those replicas).

cc @nvanbenschoten

bdarnell · 2019-10-11T14:15:04Z

I don't think you're missing anything. If the heartbeat loop doesn't start we still need to do something to signal the waiting Connect thread.

ajwerner · 2019-10-11T14:35:09Z

The patch in #41533 should fix it.

I'd like to write a unit test which reproduces the rapid redial, at least under stress/stressrace. I've tried creating a client and server context and then either immediately or rapidly after connection on the server side close the underlying conn with a bunch of concurrent goroutines calling connect. In brief testing so far it seems like gRPC is not just redialing in this case. I'll keep working on a test.

I'm curious if gRPC has changed its behavior with regards to this backoff in a way that makes this bug much less likely/hard to repro without injecting some behavior. This patch seems relevant:
grpc/grpc-go#2740

sidds · 2019-10-25T16:38:40Z

@ajwerner just checking if you are still planning to write any tests before closing this issue.

ajwerner · 2019-10-25T17:06:14Z

@ajwerner just checking if you are still planning to write any tests before closing this issue.

I really want to. I spent a few hours trying to write a test that would hit this on current master and didn't succeed and then moved on. I'll look at revisiting this soon. Perhaps I should try to just merge the patch and backport it leaving a test as a TODO.

It seems that gRPC has a backoff now (well they had it before but a bug prevented it from being honored) so it's hard to get a redial in the time it takes from launching a goroutine to hitting the select in question.

garvitjuniwal · 2019-11-01T23:05:02Z

@ajwerner Would it be possible to make a call on whether this patch is safe to merge and backport without a unittest? Perhaps a manual test by inserting strategic sleeps in some places? Do you think this backport is safe and will fix the problem for us if we merge it in our internal branch?

ajwerner · 2019-11-01T23:21:32Z

I feel pretty confident that this is safe. Sorry I haven’t pushed it over the finish line.

Before this patch callers would be left blocked on connect forever. Fixes cockroachdb#41521. Release note (bug fix): Fix bug whereby rapid network disconnections can lead to cluster unavailability because goroutines remain blocked waiting for a connection which will never be initialized to send its first heartbeat.

41533: rpc: notify callers to Connect when redial occurs before first heartbeat r=andreimatei a=ajwerner Before this patch callers would be left blocked on connect forever. Fixes #41521. Release note: None Co-authored-by: Andrew Werner <[email protected]>

Before this patch callers would be left blocked on connect forever. Fixes cockroachdb#41521. Release note (bug fix): Fix bug whereby rapid network disconnections can lead to cluster unavailability because goroutines remain blocked waiting for a connection which will never be initialized to send its first heartbeat.

ajwerner self-assigned this Oct 10, 2019

andreimatei added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Oct 10, 2019

ajwerner mentioned this issue Oct 11, 2019

rpc: notify callers to Connect when redial occurs before first heartbeat #41533

Merged

craig bot closed this as completed in 1d3c9de Nov 4, 2019

ajwerner mentioned this issue Nov 4, 2019

release-19.1: rpc: notify callers to Connect when redial occurs before first heartbeat #42165

Merged

ajwerner mentioned this issue Nov 4, 2019

release-2.1: rpc: notify callers to Connect when redial occurs before first heartbeat #42166

Merged

ajwerner mentioned this issue Nov 4, 2019

release-2.0: rpc: notify callers to Connect when redial occurs before first heartbeat #42167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: connections can be leaked leaving clients blocked on Connect() #41521

rpc: connections can be leaked leaving clients blocked on Connect() #41521

ajwerner commented Oct 10, 2019

andreimatei commented Oct 10, 2019

bdarnell commented Oct 11, 2019

ajwerner commented Oct 11, 2019

sidds commented Oct 25, 2019

ajwerner commented Oct 25, 2019

garvitjuniwal commented Nov 1, 2019

ajwerner commented Nov 1, 2019

rpc: connections can be leaked leaving clients blocked on Connect() #41521

rpc: connections can be leaked leaving clients blocked on Connect() #41521

Comments

ajwerner commented Oct 10, 2019

andreimatei commented Oct 10, 2019

bdarnell commented Oct 11, 2019

ajwerner commented Oct 11, 2019

sidds commented Oct 25, 2019

ajwerner commented Oct 25, 2019

garvitjuniwal commented Nov 1, 2019

ajwerner commented Nov 1, 2019