-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: draining hangs when quorum is lost #14620
Comments
This is due to the last node attempting to update its node liveness and not being able to, thus retrying indefinitely. To work around this for now, sending a |
… lost added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
… lost added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
Fixes cockroachdb#14620 added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
fix indefinite retrying for `cockroach quit` when quorum is lost #14620
Reopening because although #14708 fixed this particular issue, the server should still time out potentially infinitely retryable writes while draining. |
cockroach quit
once quorum is lost
@asubiotto is this fixed? It appears fixed to me. After approximately 30 seconds, the final nodes give up and quit when issued the quit command. I've been using this pattern reliably for months now. Closing this issue, please reopen if I'm missing something. |
The remaining work was to time out the liveness update (the quit command forces a shutdown after a minute) to proceed with draining leases. However, this is not completely necessary. It's a small change that I'll probably get to for 2.0 so I'll reopen. |
Actually, thinking about this more I'm not sure that going forward with canceling a node liveness update is the way to go. Timeouts are implemented by users (as you pointed out) from a higher level and we never want to sacrifice correctness for a quicker drain. Closing this issue. |
This isn't an issue for production clusters, which will be upgraded in a rolling fashion, but it is a usability issue for quick test clusters.
Once you lose quorum, the remaining nodes can't be shut down with
cockroach quit
. Instead, you need to do a force kill.Note that the third node never quits. Here's what you see toward the end of the logs:
The text was updated successfully, but these errors were encountered: