Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: KV writes in startup path sensitive to circuit breaker errors #74714

Closed
tbg opened this issue Jan 11, 2022 · 9 comments · Fixed by #97710
Closed

server: KV writes in startup path sensitive to circuit breaker errors #74714

tbg opened this issue Jan 11, 2022 · 9 comments · Fixed by #97710
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Jan 11, 2022

Is your feature request related to a problem? Please describe.

There are (potentially many) synchronous KV writes in the server start-up path.
They technically always needed to be prepared to handle errors such as AmbiguousResultError, however in practice these rarely happen.

With the introduction of per-replica circuit breakers, and when restarting nodes while the cluster is under stress or at least partially unavailable, situations where these errors bubble up (which is what they are supposed to do) could be more frequent and could lead to failures to start node. In a hypothetical extreme case, a failure to start a sufficient number of nodes could be the very reason for the outage, thus resulting in a situation that could only be resolved by disabling the circuit breakers altogether (which is possible via an env var; there's a cluster setting too but this isn't going to work if the cluster is unavailable).

Describe the solution you'd like

Audit the start path and make sure that all KV uses can react appropriately to circuit breaker and ambiguous result errors.

Jira issue: CRDB-12227

gz#16003

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 11, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jan 11, 2022
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Jan 11, 2022
@tbg tbg self-assigned this Jan 14, 2022
@tbg tbg removed their assignment May 31, 2022
@tbg
Copy link
Member Author

tbg commented Jun 8, 2022

Randomly saw one instance of this on #82109

* ERROR: ERROR: cockroach server exited with error: error recording initial status summaries: replica unavailable: (n3,s3):3 unable to serve request to r3:/System/{NodeLivenessMax-tsd} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1654716012.232856725,0 (2022-06-08 19:20:12); raft status: {"id":"3","term":44,"vote":"3","commit":52822,"lead":"3","raftState":"StateLeader","applied":52822,"progress":{"1":{"match":0,"next":47625,"state":"StateSnapshot"},"2":{"match":0,"next":52823,"state":"StateProbe"},"3":{"match":52963,"next":52964,"state":"StateReplicate"}},"leadtransferee":"0"}: have been waiting 62.20s for slow proposal RequestLease [/System/NodeLivenessMax,/Min)
* ```

@aliher1911
Copy link
Contributor

aliher1911 commented Feb 14, 2023

Easily reproducible with the following use case:

  1. Start 5 node cluster
  2. Kill 2 nodes
  3. Kill another node
  4. Wait for 60+ seconds
  5. Restart last killed node

With a bit of luck node will fail to start. Failure could happen in different places. The reason being cluster only has 3 out of 5 replicas on system ranges. When 3rd node is killed it loses quorum. Within 60 seconds this could trigger circuit breaker on unavailability of some range needed for startup on remaining leaseholder. When node is restarted, it tries to write to a range where it is a part of quorum itself, but circuit breaker immediately rejects operation without giving it a chance to participate in consensus.

@erikgrinaker
Copy link
Contributor

I think we'll have to add retries with exponential backoff for all KV reads/writes on the node startup path that would otherwise cause the node to error out. We should get this fixed asap, since this issue exists in 22.1 onwards, and will be hit by any cluster that temporarily loses quorum on a system range.

@erikgrinaker erikgrinaker removed T-kv KV Team T-server-and-security DB Server & Security labels Feb 14, 2023
@aliher1911 aliher1911 self-assigned this Feb 14, 2023
@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Feb 14, 2023
@aliher1911
Copy link
Contributor

aliher1911 commented Feb 17, 2023

For posterity, demonstration on how it fails with a simple test:
https://gist.github.com/aliher1911/da4b20b18eb23e6e651008c4e9c82a00
You would expect it to fail (and eventually succeed if retry is added) when doing liveness query after node restart, but what you get is failure to restart server because it writes its liveness on startup.

@nvanbenschoten
Copy link
Member

I think we'll have to add retries with exponential backoff for all KV reads/writes on the node startup path that would otherwise cause the node to error out.

Would it be more straightforward to add an "ignore circuit breakers" flag to the KV write API, instead of introducing retry loops at each caller?

@erikgrinaker
Copy link
Contributor

A lot of these accesses are SQL queries.

@renatolabs
Copy link
Contributor

renatolabs commented Mar 24, 2023

I saw a roachtest [1] fail with the following error:

E230324 01:38:54.202628 1 1@cli/clierror/check.go:35 ⋮ [-] 167 server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/1/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/1/0], [txn: f3417342], [can-forward-ts]› RPC error: grpc: ‹node waiting for init; /cockroach.roachpb.Internal/Batch not available› [code 14/Unavailable] [exhausted]

I think this is the same issue, but let me know and I can create a separate one.

Incidentally, should we make this a GA blocker?

[1] https://teamcity.cockroachdb.com/viewLog.html?buildId=9223255&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=artifacts#%2Ftpccbench%2Fnodes%3D9%2Fcpu%3D4%2Fchaos%2Fpartition%2Frun_1%2Fartifacts.zip!%2Flogs%2F1.unredacted

@tbg
Copy link
Member Author

tbg commented Mar 24, 2023

That does seem similar, it's just a different error bubbling up. @aliher1911 I think your PR currently retries only circuit breaker errors, but we also always need to handle ambiguous results, right? They are a bit harder because now the operation may or may not have concluded, so there are questions about idempotency.

@aliher1911
Copy link
Contributor

I added handling of ambiguous errors there. I'll double check the side effects of retries there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
5 participants