-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: connections can be leaked leaving clients blocked on Connect() #41521
Comments
Some more words: the Line 346 in 44b39ee
because nobody pings initialHeartbeatDone . ctx.runHeartbeat() is supposed to close that channel, but it might never do so because it races with gRPC calling into the onlyOnceDialer a 2nd time - which closes redialChan and so the following early return is taken:Line 1051 in 44b39ee
@bdarnell can you please verify that we're not missing something. This can result in a node being completely stalled. In a customer cluster, we've seen a leaseholder's attempts to dial two other replicas leak this way when invoked by the Raft transport, which made any further attempt by the node to propose any commands to either of those two replicas futile (for all ranges shared with those replicas). |
I don't think you're missing anything. If the heartbeat loop doesn't start we still need to do something to signal the waiting Connect thread. |
The patch in #41533 should fix it. I'd like to write a unit test which reproduces the rapid redial, at least under stress/stressrace. I've tried creating a client and server context and then either immediately or rapidly after connection on the server side close the underlying conn with a bunch of concurrent goroutines calling connect. In brief testing so far it seems like gRPC is not just redialing in this case. I'll keep working on a test. I'm curious if gRPC has changed its behavior with regards to this backoff in a way that makes this bug much less likely/hard to repro without injecting some behavior. This patch seems relevant: |
@ajwerner just checking if you are still planning to write any tests before closing this issue. |
I really want to. I spent a few hours trying to write a test that would hit this on current master and didn't succeed and then moved on. I'll look at revisiting this soon. Perhaps I should try to just merge the patch and backport it leaving a test as a TODO. It seems that gRPC has a backoff now (well they had it before but a bug prevented it from being honored) so it's hard to get a redial in the time it takes from launching a goroutine to hitting the select in question. |
@ajwerner Would it be possible to make a call on whether this patch is safe to merge and backport without a unittest? Perhaps a manual test by inserting strategic sleeps in some places? Do you think this backport is safe and will fix the problem for us if we merge it in our internal branch? |
I feel pretty confident that this is safe. Sorry I haven’t pushed it over the finish line. |
Before this patch callers would be left blocked on connect forever. Fixes cockroachdb#41521. Release note (bug fix): Fix bug whereby rapid network disconnections can lead to cluster unavailability because goroutines remain blocked waiting for a connection which will never be initialized to send its first heartbeat.
41533: rpc: notify callers to Connect when redial occurs before first heartbeat r=andreimatei a=ajwerner Before this patch callers would be left blocked on connect forever. Fixes #41521. Release note: None Co-authored-by: Andrew Werner <[email protected]>
Before this patch callers would be left blocked on connect forever. Fixes cockroachdb#41521. Release note (bug fix): Fix bug whereby rapid network disconnections can lead to cluster unavailability because goroutines remain blocked waiting for a connection which will never be initialized to send its first heartbeat.
Before this patch callers would be left blocked on connect forever. Fixes cockroachdb#41521. Release note (bug fix): Fix bug whereby rapid network disconnections can lead to cluster unavailability because goroutines remain blocked waiting for a connection which will never be initialized to send its first heartbeat.
Before this patch callers would be left blocked on connect forever. Fixes cockroachdb#41521. Release note (bug fix): Fix bug whereby rapid network disconnections can lead to cluster unavailability because goroutines remain blocked waiting for a connection which will never be initialized to send its first heartbeat.
Describe the problem
Calls to
rpc.Connection.Connect()
block until the first heartbeat has completed (either succeeded or failed). The rpc.Context.runHeartbeat loop when gRPC tries to redial the connection (leading to a send onredialChan
). The hazard is if gRPC sends on thatredialChan
beforerunHeartbeat
ever sends its first heartbeat. In this case, clients can get stuck inConnect()
forever.The text was updated successfully, but these errors were encountered: