-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523
Comments
Instance of #74714. |
cc @cockroachdb/replication |
Yeah, this just seems like a startup race where some nodes got ready before others and failed to propose a write. This is a long-standing problem, and while it's related to #74714 in that nodes should retry errors during startup, it isn't an instance of the circuit breaker bug that causes #74714 to be a GA blocker (where the replica circuit breakers prevent nodes from restoring quorum on ranges and thus prevent the nodes from ever starting). Closing this out as a known bug. |
@erikgrinaker Correct, I was a little liberal when I said "instance of #74714"; I used that as an implicit umbrella issue that covers ambiguous errors during server startup as well. Are you saying you think that should not be a GA blocker? IMO, it should be; we have been seeing lots of occurrences of failures like this recently, much more frequently than before. |
Well, CockroachDB has never handled errors during startup, so we're not regressing in that sense at least. If we're seeing this more often, it must be because there's been a change in the operations that we're performing during startup that makes it more vulnerable to such failures. Could even be due to the backup schedules that we're injecting into node startup. I'll have a closer look at what these writes actually are. |
Both of the failed writes here are to the cockroach/pkg/server/server_sql.go Lines 1445 to 1478 in 80ef781
|
There's a hint in the last part of the error here:
"exhausted" implies that the DistSender tried sending to all known replicas and couldn't find one that was reachable. In a six-node cluster, I suppose it's plausible that the nodes with The obvious fix is to add retries here, as in #74714. However, I know we also do other writes during startup that don't retry, and it's unclear to me why we don't see these sorts of issues with those. Will have a closer look. |
Hm, both here and in #99568 the initial error was in fact "node waiting for init":
The connection reset by peer in the comment above happened because n3 then went on to crash while processing an Now, the error handling around these parts is a bit annoying because gRPC doesn't give us a good way to distinguish ambiguous and unambiguous errors, see: cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 2162 to 2211 in c9e4529
cockroach/pkg/util/grpcutil/grpc_util.go Lines 160 to 165 in 3a6e31c
However, we do know for sure that "node waiting for init" is unambiguous, and cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 1950 to 1958 in c9e4529
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 1704 to 1709 in c9e4529
This should go a long way towards mitigating this, since I believe other typical connection refused errors will be handled by the cockroach/pkg/util/grpcutil/grpc_util.go Line 150 in 3a6e31c
|
Submitted #100213, which I believe should address this in a more targeted fashion. |
roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 1f8024bf14433ca169e5a8c3768c5d223dc5018c:
Parameters: Same failure on other branches
|
Latest failure was:
Which I believe is tracked elsewhere. |
It's not; that error shouldn't come up in roachtests. Where do you see it? The error in the failure reported in the comment immediately before yours is the known workload-hanging issue introduced by #98689 (which has since been reverted).
|
Ah, that makes more sense -- I was expecting the workload issue. My bad, I misinterpreted a line from the SSH log while I was grepping for some stuff, and thought I'd seen a Slack message about it earlier today:
I think that's my cue to go to bed. :) |
roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 19a6b804d3aff74d74619a75cac3b52338c7aa02:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=16
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-25911
The text was updated successfully, but these errors were encountered: