server: KV writes in startup path sensitive to circuit breaker errors #74714

tbg · 2022-01-11T22:57:12Z

Is your feature request related to a problem? Please describe.

There are (potentially many) synchronous KV writes in the server start-up path.
They technically always needed to be prepared to handle errors such as AmbiguousResultError, however in practice these rarely happen.

With the introduction of per-replica circuit breakers, and when restarting nodes while the cluster is under stress or at least partially unavailable, situations where these errors bubble up (which is what they are supposed to do) could be more frequent and could lead to failures to start node. In a hypothetical extreme case, a failure to start a sufficient number of nodes could be the very reason for the outage, thus resulting in a situation that could only be resolved by disabling the circuit breakers altogether (which is possible via an env var; there's a cluster setting too but this isn't going to work if the cluster is unavailable).

Describe the solution you'd like

Audit the start path and make sure that all KV uses can react appropriately to circuit breaker and ambiguous result errors.

Jira issue: CRDB-12227

gz#16003

The text was updated successfully, but these errors were encountered:

tbg · 2022-06-08T19:43:18Z

Randomly saw one instance of this on #82109

* ERROR: ERROR: cockroach server exited with error: error recording initial status summaries: replica unavailable: (n3,s3):3 unable to serve request to r3:/System/{NodeLivenessMax-tsd} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1654716012.232856725,0 (2022-06-08 19:20:12); raft status: {"id":"3","term":44,"vote":"3","commit":52822,"lead":"3","raftState":"StateLeader","applied":52822,"progress":{"1":{"match":0,"next":47625,"state":"StateSnapshot"},"2":{"match":0,"next":52823,"state":"StateProbe"},"3":{"match":52963,"next":52964,"state":"StateReplicate"}},"leadtransferee":"0"}: have been waiting 62.20s for slow proposal RequestLease [/System/NodeLivenessMax,/Min)
* ```

aliher1911 · 2023-02-14T10:58:17Z

Easily reproducible with the following use case:

Start 5 node cluster
Kill 2 nodes
Kill another node
Wait for 60+ seconds
Restart last killed node

With a bit of luck node will fail to start. Failure could happen in different places. The reason being cluster only has 3 out of 5 replicas on system ranges. When 3rd node is killed it loses quorum. Within 60 seconds this could trigger circuit breaker on unavailability of some range needed for startup on remaining leaseholder. When node is restarted, it tries to write to a range where it is a part of quorum itself, but circuit breaker immediately rejects operation without giving it a chance to participate in consensus.

erikgrinaker · 2023-02-14T11:24:46Z

I think we'll have to add retries with exponential backoff for all KV reads/writes on the node startup path that would otherwise cause the node to error out. We should get this fixed asap, since this issue exists in 22.1 onwards, and will be hit by any cluster that temporarily loses quorum on a system range.

aliher1911 · 2023-02-17T16:57:41Z

For posterity, demonstration on how it fails with a simple test:
https://gist.github.com/aliher1911/da4b20b18eb23e6e651008c4e9c82a00
You would expect it to fail (and eventually succeed if retry is added) when doing liveness query after node restart, but what you get is failure to restart server because it writes its liveness on startup.

nvanbenschoten · 2023-02-28T16:53:30Z

I think we'll have to add retries with exponential backoff for all KV reads/writes on the node startup path that would otherwise cause the node to error out.

Would it be more straightforward to add an "ignore circuit breakers" flag to the KV write API, instead of introducing retry loops at each caller?

erikgrinaker · 2023-02-28T17:02:50Z

A lot of these accesses are SQL queries.

renatolabs · 2023-03-24T16:38:12Z

I saw a roachtest [1] fail with the following error:

E230324 01:38:54.202628 1 1@cli/clierror/check.go:35 ⋮ [-] 167 server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/1/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/1/0], [txn: f3417342], [can-forward-ts]› RPC error: grpc: ‹node waiting for init; /cockroach.roachpb.Internal/Batch not available› [code 14/Unavailable] [exhausted]

I think this is the same issue, but let me know and I can create a separate one.

Incidentally, should we make this a GA blocker?

[1] https://teamcity.cockroachdb.com/viewLog.html?buildId=9223255&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=artifacts#%2Ftpccbench%2Fnodes%3D9%2Fcpu%3D4%2Fchaos%2Fpartition%2Frun_1%2Fartifacts.zip!%2Flogs%2F1.unredacted

tbg · 2023-03-24T16:50:50Z

That does seem similar, it's just a different error bubbling up. @aliher1911 I think your PR currently retries only circuit breaker errors, but we also always need to handle ambiguous results, right? They are a bit harder because now the operation may or may not have concluded, so there are questions about idempotency.

aliher1911 · 2023-03-29T16:34:47Z

I added handling of ambiguous errors there. I'll double check the side effects of retries there.

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 11, 2022

blathers-crl bot added the T-kv KV Team label Jan 11, 2022

blathers-crl bot added the T-server-and-security DB Server & Security label Jan 11, 2022

tbg self-assigned this Jan 14, 2022

tbg mentioned this issue Jan 14, 2022

kvserver: circuit-break requests to unavailable ranges #33007

Closed

erikgrinaker added the T-kv-replication label May 31, 2022

tbg removed their assignment May 31, 2022

erikgrinaker removed T-kv KV Team T-server-and-security DB Server & Security labels Feb 14, 2023

aliher1911 self-assigned this Feb 14, 2023

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Feb 14, 2023

aliher1911 mentioned this issue Feb 14, 2023

kv: acceptance/gossip/peerings flaked #96091

Closed

renatolabs mentioned this issue Feb 28, 2023

roachtest: tpcc/multiregion/survive=region/chaos=true failed #97185

Closed

This was referenced Mar 7, 2023

server: retry ambiguous kv/sql operations on node startup #97710

Merged

...server: perform replica retries on startup path #98129

Closed

This was referenced Mar 24, 2023

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

Closed

roachtest: tpccbench/nodes=3/cpu=16/enc=true failed #99568

Closed

roachtest: tpccbench/nodes=12/cpu=16/enc=true failed #99586

Closed

craig bot closed this as completed in db4c5ac Apr 3, 2023

blathers-crl bot mentioned this issue Apr 3, 2023

release-23.1: server: retry ambiguous kv/sql operations on node startup #100458

Merged

aliher1911 mentioned this issue Apr 3, 2023

release-22.2: server: retry ambiguous kv/sql operations on node startup #100463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: KV writes in startup path sensitive to circuit breaker errors #74714

server: KV writes in startup path sensitive to circuit breaker errors #74714

tbg commented Jan 11, 2022 •

edited by RoachietheSupportRoach

Loading

tbg commented Jun 8, 2022

aliher1911 commented Feb 14, 2023 •

edited

Loading

erikgrinaker commented Feb 14, 2023

aliher1911 commented Feb 17, 2023 •

edited

Loading

nvanbenschoten commented Feb 28, 2023

erikgrinaker commented Feb 28, 2023

renatolabs commented Mar 24, 2023 •

edited

Loading

tbg commented Mar 24, 2023

aliher1911 commented Mar 29, 2023

server: KV writes in startup path sensitive to circuit breaker errors #74714

server: KV writes in startup path sensitive to circuit breaker errors #74714

Comments

tbg commented Jan 11, 2022 • edited by RoachietheSupportRoach Loading

tbg commented Jun 8, 2022

aliher1911 commented Feb 14, 2023 • edited Loading

erikgrinaker commented Feb 14, 2023

aliher1911 commented Feb 17, 2023 • edited Loading

nvanbenschoten commented Feb 28, 2023

erikgrinaker commented Feb 28, 2023

renatolabs commented Mar 24, 2023 • edited Loading

tbg commented Mar 24, 2023

aliher1911 commented Mar 29, 2023

tbg commented Jan 11, 2022 •

edited by RoachietheSupportRoach

Loading

aliher1911 commented Feb 14, 2023 •

edited

Loading

aliher1911 commented Feb 17, 2023 •

edited

Loading

renatolabs commented Mar 24, 2023 •

edited

Loading