-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: KV writes in startup path sensitive to circuit breaker errors #74714
Comments
Randomly saw one instance of this on #82109
|
Easily reproducible with the following use case:
With a bit of luck node will fail to start. Failure could happen in different places. The reason being cluster only has 3 out of 5 replicas on system ranges. When 3rd node is killed it loses quorum. Within 60 seconds this could trigger circuit breaker on unavailability of some range needed for startup on remaining leaseholder. When node is restarted, it tries to write to a range where it is a part of quorum itself, but circuit breaker immediately rejects operation without giving it a chance to participate in consensus. |
I think we'll have to add retries with exponential backoff for all KV reads/writes on the node startup path that would otherwise cause the node to error out. We should get this fixed asap, since this issue exists in 22.1 onwards, and will be hit by any cluster that temporarily loses quorum on a system range. |
For posterity, demonstration on how it fails with a simple test: |
Would it be more straightforward to add an "ignore circuit breakers" flag to the KV write API, instead of introducing retry loops at each caller? |
A lot of these accesses are SQL queries. |
I saw a roachtest [1] fail with the following error:
I think this is the same issue, but let me know and I can create a separate one. Incidentally, should we make this a GA blocker? |
That does seem similar, it's just a different error bubbling up. @aliher1911 I think your PR currently retries only circuit breaker errors, but we also always need to handle ambiguous results, right? They are a bit harder because now the operation may or may not have concluded, so there are questions about idempotency. |
I added handling of ambiguous errors there. I'll double check the side effects of retries there. |
Is your feature request related to a problem? Please describe.
There are (potentially many) synchronous KV writes in the server start-up path.
They technically always needed to be prepared to handle errors such as
AmbiguousResultError
, however in practice these rarely happen.With the introduction of per-replica circuit breakers, and when restarting nodes while the cluster is under stress or at least partially unavailable, situations where these errors bubble up (which is what they are supposed to do) could be more frequent and could lead to failures to start node. In a hypothetical extreme case, a failure to start a sufficient number of nodes could be the very reason for the outage, thus resulting in a situation that could only be resolved by disabling the circuit breakers altogether (which is possible via an env var; there's a cluster setting too but this isn't going to work if the cluster is unavailable).
Describe the solution you'd like
Audit the start path and make sure that all KV uses can react appropriately to circuit breaker and ambiguous result errors.
Jira issue: CRDB-12227
gz#16003
The text was updated successfully, but these errors were encountered: