roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

cockroach-teamcity · 2023-03-24T21:35:00Z

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 19a6b804d3aff74d74619a75cac3b52338c7aa02:

test artifacts and logs in: /artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9225780-1679635179-117-n7cpu16-geo/1679693178495484334?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_212618.495516165_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #92262 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [C-test-failure O-roachtest O-robot T-testeng branch-release-22.2.0]

/cc @cockroachdb/test-eng _{This test on roachdash | Improve this report!

Jira issue: CRDB-25911}

The text was updated successfully, but these errors were encountered:

renatolabs · 2023-03-24T22:00:02Z

E230324 21:26:22.610100 1 1@cli/clierror/check.go:35 ⋮ [-] 177 server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/1/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/1/0], [txn: 712b3389], [can-forward-ts]› RPC error: grpc: ‹error reading from server: read tcp 10.128.0.79:35780->10.128.0.81:26257: read: connection reset by peer› [code 14/Unavailable] [exhausted]

Instance of #74714.

blathers-crl · 2023-03-27T19:01:42Z

cc @cockroachdb/replication

erikgrinaker · 2023-03-28T12:17:32Z

Yeah, this just seems like a startup race where some nodes got ready before others and failed to propose a write. This is a long-standing problem, and while it's related to #74714 in that nodes should retry errors during startup, it isn't an instance of the circuit breaker bug that causes #74714 to be a GA blocker (where the replica circuit breakers prevent nodes from restoring quorum on ranges and thus prevent the nodes from ever starting). Closing this out as a known bug.

renatolabs · 2023-03-28T13:06:18Z

@erikgrinaker Correct, I was a little liberal when I said "instance of #74714"; I used that as an implicit umbrella issue that covers ambiguous errors during server startup as well.

Are you saying you think that should not be a GA blocker? IMO, it should be; we have been seeing lots of occurrences of failures like this recently, much more frequently than before.

erikgrinaker · 2023-03-28T13:15:10Z

Are you saying you think that should not be a GA blocker? IMO, it should be; we have been seeing lots of occurrences of failures like this recently, much more frequently than before.

Well, CockroachDB has never handled errors during startup, so we're not regressing in that sense at least. If we're seeing this more often, it must be because there's been a change in the operations that we're performing during startup that makes it more vulnerable to such failures. Could even be due to the backup schedules that we're injecting into node startup. I'll have a closer look at what these writes actually are.

erikgrinaker · 2023-03-30T17:49:07Z

Both of the failed writes here are to the sql_instance table. Previously, we only wrote these for SQL pods in serverless, but in #95355 we began writing them on startup for all nodes regardless of deployment model. I believe this may be the first synchronous write to KV during node startup (other writes are async and generally retry). It's possibly happening too early, before KV is fully spun up, although I'm not sure why a connection refused error results in an ambiguous result rather than a retryable error here.

cockroach/pkg/server/server_sql.go

Lines 1445 to 1478 in 80ef781

    
           // If we have a nodeID, set our SQL instance ID to the node 
        
           // ID. Otherwise, allow our SQL instance ID to be generated by 
        
           // SQL. 
        
           nodeID, hasNodeID := s.sqlIDContainer.OptionalNodeID() 
        
           var instance sqlinstance.InstanceInfo 
        
           if hasNodeID { 
        
           	// Write/acquire our instance row. 
        
           	instance, err = s.sqlInstanceStorage.CreateNodeInstance( 
        
           		ctx, 
        
           		session.ID(), 
        
           		session.Expiration(), 
        
           		s.cfg.AdvertiseAddr, 
        
           		s.cfg.SQLAdvertiseAddr, 
        
           		s.distSQLServer.Locality, 
        
           		s.execCfg.Settings.Version.BinaryVersion(), 
        
           		nodeID, 
        
           	) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           } else { 
        
           	instance, err = s.sqlInstanceStorage.CreateInstance( 
        
           		ctx, 
        
           		session.ID(), 
        
           		session.Expiration(), 
        
           		s.cfg.AdvertiseAddr, 
        
           		s.cfg.SQLAdvertiseAddr, 
        
           		s.distSQLServer.Locality, 
        
           		s.execCfg.Settings.Version.BinaryVersion(), 
        
           	) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           }

erikgrinaker · 2023-03-30T17:59:23Z

There's a hint in the last part of the error here:

server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/1/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/1/0], [txn: 712b3389], [can-forward-ts]› RPC error: grpc: ‹error reading from server: read tcp 10.128.0.79:35780->10.128.0.81:26257: read: connection reset by peer› [code 14/Unavailable] [exhausted]

"exhausted" implies that the DistSender tried sending to all known replicas and couldn't find one that was reachable. In a six-node cluster, I suppose it's plausible that the nodes with sql_instance replicas haven't completed startup yet this early during cluster startup.

The obvious fix is to add retries here, as in #74714. However, I know we also do other writes during startup that don't retry, and it's unclear to me why we don't see these sorts of issues with those. Will have a closer look.

erikgrinaker · 2023-03-30T20:07:35Z

Hm, both here and in #99568 the initial error was in fact "node waiting for init":

server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/3/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/3/0], [txn: bb3ff1ac], [can-forward-ts]› RPC error: grpc: ‹node waiting for init; /cockroach.roachpb.Internal/Batch not available› [code 14/Unavailable] [exhausted]

The connection reset by peer in the comment above happened because n3 then went on to crash while processing an EndTxn from n1, causing n1 to crash.

Now, the error handling around these parts is a bit annoying because gRPC doesn't give us a good way to distinguish ambiguous and unambiguous errors, see:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 2162 to 2211 in c9e4529

    
           // For most connection errors, we cannot tell whether or not the request 
        
           // may have succeeded on the remote server (exceptions are captured in the 
        
           // grpcutil.RequestDidNotStart function). We'll retry the request in order 
        
           // to attempt to eliminate the ambiguity; see below. If there's a commit 
        
           // in the batch, we track the ambiguity more explicitly by setting 
        
           // ambiguousError. This serves two purposes: 
        
           // 1) the higher-level retries in the DistSender will not forget the 
        
           // ambiguity, like they forget it for non-commit batches. This in turn 
        
           // will ensure that TxnCoordSender-level retries don't happen across 
        
           // commits; that'd be bad since requests are not idempotent across 
        
           // commits. 
        
           // TODO(andrei): This higher-level does things too bluntly, retrying only 
        
           // in case of sendError. It should also retry in case of 
        
           // AmbiguousRetryError as long as it makes sure to not forget about the 
        
           // ambiguity. 
        
           // 2) SQL recognizes AmbiguousResultErrors and gives them a special code 
        
           // (StatementCompletionUnknown). 
        
           // TODO(andrei): The use of this code is inconsistent because a) the 
        
           // DistSender tries to only return the code for commits, but it'll happily 
        
           // forward along AmbiguousResultErrors coming from the replica and b) we 
        
           // probably should be returning that code for non-commit statements too. 
        
           // 
        
           // We retry requests in order to avoid returning errors (in particular, 
        
           // AmbiguousResultError). Retrying the batch will either: 
        
           // a) succeed if the request had not been evaluated the first time. 
        
           // b) succeed if the request also succeeded the first time, but is 
        
           //    idempotent (i.e. it is internal to a txn, without a commit in the 
        
           //    batch). 
        
           // c) fail if it succeeded the first time and the request is not 
        
           //    idempotent. In the case of EndTxn requests, this is ensured by the 
        
           //    tombstone keys in the timestamp cache. The retry failing does not 
        
           //    prove that the request did not succeed the first time around, so we 
        
           //    can't claim success (and even if we could claim success, we still 
        
           //    wouldn't have the complete result of the successful evaluation). 
        
           // 
        
           // Case a) is great - the retry made the request succeed. Case b) is also 
        
           // good; due to idempotency we managed to swallow a communication error. 
        
           // Case c) is not great - we'll end up returning an error even though the 
        
           // request might have succeeded (an AmbiguousResultError if withCommit is 
        
           // set). 
        
           // 
        
           // TODO(andrei): Case c) is broken for non-transactional requests: nothing 
        
           // prevents them from double evaluation. This can result in, for example, 
        
           // an increment applying twice, or more subtle problems like a blind write 
        
           // evaluating twice, overwriting another unrelated write that fell 
        
           // in-between. 
        
           // 
        
           if withCommit && !grpcutil.RequestDidNotStart(err) { 
        
           	ambiguousError = err 
        
           }

cockroach/pkg/util/grpcutil/grpc_util.go

Lines 160 to 165 in 3a6e31c

    
           // This is where you'd hope to treat some gRPC errors as unambiguous. 
        
           // Unfortunately, gRPC provides no good way to distinguish ambiguous from 
        
           // unambiguous failures. 
        
           // 
        
           // https://github.com/grpc/grpc-go/issues/1443 
        
           // https://github.com/cockroachave hdb/cockroach/issues/19708#issuecomment-343891640

However, we do know for sure that "node waiting for init" is unambiguous, and IsWaitingForInit(error) can tell us as much. If we mark that error as unambiguous then the DistSender will keep retrying for us:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1950 to 1958 in c9e4529

    
           if ambiguousErr != nil { 
        
           	return kvpb.NewAmbiguousResultErrorf("error=%s [exhausted]", ambiguousErr) 
        
           } 
        
           // TODO(bdarnell): The error from the last attempt is not necessarily the best 
        
           // one to return; we may want to remember the "best" error we've seen (for 
        
           // example, a NotLeaseHolderError conveys more information than a 
        
           // RangeNotFound). 
        
           return newSendError(fmt.Sprintf("sending to all replicas failed; last error: %s", lastAttemptErr))

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1704 to 1709 in c9e4529

    
           case errors.HasType(err, sendError{}): 
        
           	// We've tried all the replicas without success. Either they're all 
        
           	// down, or we're using an out-of-date range descriptor. Evict from the 
        
           	// cache and try again with an updated descriptor. Re-sending the 
        
           	// request is ok even though it might have succeeded the first time 
        
           	// around because of idempotency.

This should go a long way towards mitigating this, since I believe other typical connection refused errors will be handled by the InitialHeartbeatFailedError condition:

cockroach/pkg/util/grpcutil/grpc_util.go

Line 150 in 3a6e31c

if errors.HasType(err, (*netutil.InitialHeartbeatFailedError)(nil)) ||

erikgrinaker · 2023-03-30T20:48:26Z

Submitted #100213, which I believe should address this in a more targeted fashion.

cockroach-teamcity · 2023-03-30T21:01:15Z

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 1f8024bf14433ca169e5a8c3768c5d223dc5018c:

test artifacts and logs in: /artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: COMMAND_PROBLEM: ssh verbose log retained in ssh_192708.078678661_n7_cockroach-workload-r.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #92262 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [C-test-failure O-roachtest O-robot T-testeng branch-release-22.2.0]

_{This test on roachdash | Improve this report!}

erikgrinaker · 2023-03-30T21:22:35Z

Latest failure was:

expected host to be part of teamcity-9329887-1680153692-187-n7cpu16-geo, but is `hostname`

Which I believe is tracked elsewhere.

renatolabs · 2023-03-30T21:30:38Z

Which I believe is tracked elsewhere.

It's not; that error shouldn't come up in roachtests. Where do you see it?

The error in the failure reported in the comment immediately before yours is the known workload-hanging issue introduced by #98689 (which has since been reverted).

Attempt to create load generator failed. It's been more than 1h30m0s since we started trying to create the load generator so we're giving up

erikgrinaker · 2023-03-30T21:33:19Z

Ah, that makes more sense -- I was expecting the workload issue. My bad, I misinterpreted a line from the SSH log while I was grepping for some stuff, and thought I'd seen a Slack message about it earlier today:

debug1: Sending command: 
if ! [[ `hostname` == 'teamcity-9329887-1680153692-187-n7cpu16-geo-0007' ]]; then
    echo "expected host to be part of teamcity-9329887-1680153692-187-n7cpu16-geo, but is `hostname`"
    exit 1

else
    export ROACHPROD=7 GOTRACEBACK=crash COCKROACH_TESTING_FORCE_RELEASE_BRANCH=true && bash -c "./cockroach workload run tpcc --warehouses=
debug2: channel 0: request exec confirm 1^M

I think that's my cue to go to bed. :)

cockroach-teamcity added this to the 23.1 milestone Mar 24, 2023

blathers-crl bot added the T-testeng TestEng Team label Mar 24, 2023

renatolabs added T-kv KV Team and removed T-testeng TestEng Team labels Mar 24, 2023

nvanbenschoten added T-kv-replication and removed T-kv KV Team labels Mar 27, 2023

erikgrinaker closed this as completed Mar 28, 2023

erikgrinaker reopened this Mar 28, 2023

erikgrinaker self-assigned this Mar 28, 2023

erikgrinaker added the GA-blocker label Mar 28, 2023

erikgrinaker mentioned this issue Mar 30, 2023

grpcutil: consider WaitingForInitError as unambiguous #100213

Merged

craig bot closed this as completed in dc4b25b Mar 31, 2023

blathers-crl bot mentioned this issue Mar 31, 2023

release-23.1: grpcutil: consider WaitingForInitError as unambiguous #100277

Merged

This was referenced Mar 31, 2023

release-22.2: grpcutil: consider WaitingForInitError as unambiguous #100278

Merged

release-22.1: grpcutil: consider WaitingForInitError as unambiguous #100280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

cockroach-teamcity commented Mar 24, 2023 •

edited by cockroach-jira-scripts

Loading

renatolabs commented Mar 24, 2023

blathers-crl bot commented Mar 27, 2023

erikgrinaker commented Mar 28, 2023

renatolabs commented Mar 28, 2023

erikgrinaker commented Mar 28, 2023

erikgrinaker commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023 •

edited

Loading

erikgrinaker commented Mar 30, 2023

cockroach-teamcity commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023

renatolabs commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

Comments

cockroach-teamcity commented Mar 24, 2023 • edited by cockroach-jira-scripts Loading

renatolabs commented Mar 24, 2023

blathers-crl bot commented Mar 27, 2023

erikgrinaker commented Mar 28, 2023

renatolabs commented Mar 28, 2023

erikgrinaker commented Mar 28, 2023

erikgrinaker commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023 • edited Loading

erikgrinaker commented Mar 30, 2023

cockroach-teamcity commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023

renatolabs commented Mar 30, 2023

erikgrinaker commented Mar 30, 2023

cockroach-teamcity commented Mar 24, 2023 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented Mar 30, 2023 •

edited

Loading