Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

Closed
cockroach-teamcity opened this issue Mar 24, 2023 · 13 comments · Fixed by #100213
Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #99523

cockroach-teamcity opened this issue Mar 24, 2023 · 13 comments · Fixed by #100213
Assignees
Labels
A-kv-server Relating to the KV-level RPC server branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 24, 2023

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 19a6b804d3aff74d74619a75cac3b52338c7aa02:

test artifacts and logs in: /artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9225780-1679635179-117-n7cpu16-geo/1679693178495484334?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_212618.495516165_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-25911

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 24, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 24, 2023
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Mar 24, 2023
@renatolabs
Copy link
Contributor

E230324 21:26:22.610100 1 1@cli/clierror/check.go:35 ⋮ [-] 177 server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/1/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/1/0], [txn: 712b3389], [can-forward-ts]› RPC error: grpc: ‹error reading from server: read tcp 10.128.0.79:35780->10.128.0.81:26257: read: connection reset by peer› [code 14/Unavailable] [exhausted]

Instance of #74714.

@renatolabs renatolabs added T-kv KV Team and removed T-testeng TestEng Team labels Mar 24, 2023
@nvanbenschoten nvanbenschoten added T-kv-replication and removed T-kv KV Team labels Mar 27, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 27, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

Yeah, this just seems like a startup race where some nodes got ready before others and failed to propose a write. This is a long-standing problem, and while it's related to #74714 in that nodes should retry errors during startup, it isn't an instance of the circuit breaker bug that causes #74714 to be a GA blocker (where the replica circuit breakers prevent nodes from restoring quorum on ranges and thus prevent the nodes from ever starting). Closing this out as a known bug.

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 28, 2023
@exalate-issue-sync exalate-issue-sync bot added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server labels Mar 28, 2023
@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 28, 2023
@renatolabs
Copy link
Contributor

@erikgrinaker Correct, I was a little liberal when I said "instance of #74714"; I used that as an implicit umbrella issue that covers ambiguous errors during server startup as well.

Are you saying you think that should not be a GA blocker? IMO, it should be; we have been seeing lots of occurrences of failures like this recently, much more frequently than before.

@erikgrinaker
Copy link
Contributor

Are you saying you think that should not be a GA blocker? IMO, it should be; we have been seeing lots of occurrences of failures like this recently, much more frequently than before.

Well, CockroachDB has never handled errors during startup, so we're not regressing in that sense at least. If we're seeing this more often, it must be because there's been a change in the operations that we're performing during startup that makes it more vulnerable to such failures. Could even be due to the backup schedules that we're injecting into node startup. I'll have a closer look at what these writes actually are.

@erikgrinaker
Copy link
Contributor

Both of the failed writes here are to the sql_instance table. Previously, we only wrote these for SQL pods in serverless, but in #95355 we began writing them on startup for all nodes regardless of deployment model. I believe this may be the first synchronous write to KV during node startup (other writes are async and generally retry). It's possibly happening too early, before KV is fully spun up, although I'm not sure why a connection refused error results in an ambiguous result rather than a retryable error here.

// If we have a nodeID, set our SQL instance ID to the node
// ID. Otherwise, allow our SQL instance ID to be generated by
// SQL.
nodeID, hasNodeID := s.sqlIDContainer.OptionalNodeID()
var instance sqlinstance.InstanceInfo
if hasNodeID {
// Write/acquire our instance row.
instance, err = s.sqlInstanceStorage.CreateNodeInstance(
ctx,
session.ID(),
session.Expiration(),
s.cfg.AdvertiseAddr,
s.cfg.SQLAdvertiseAddr,
s.distSQLServer.Locality,
s.execCfg.Settings.Version.BinaryVersion(),
nodeID,
)
if err != nil {
return err
}
} else {
instance, err = s.sqlInstanceStorage.CreateInstance(
ctx,
session.ID(),
session.Expiration(),
s.cfg.AdvertiseAddr,
s.cfg.SQLAdvertiseAddr,
s.distSQLServer.Locality,
s.execCfg.Settings.Version.BinaryVersion(),
)
if err != nil {
return err
}
}

@erikgrinaker
Copy link
Contributor

There's a hint in the last part of the error here:

server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/1/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/1/0], [txn: 712b3389], [can-forward-ts]› RPC error: grpc: ‹error reading from server: read tcp 10.128.0.79:35780->10.128.0.81:26257: read: connection reset by peer› [code 14/Unavailable] [exhausted]

"exhausted" implies that the DistSender tried sending to all known replicas and couldn't find one that was reachable. In a six-node cluster, I suppose it's plausible that the nodes with sql_instance replicas haven't completed startup yet this early during cluster startup.

The obvious fix is to add retries here, as in #74714. However, I know we also do other writes during startup that don't retry, and it's unclear to me why we don't see these sorts of issues with those. Will have a closer look.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 30, 2023

Hm, both here and in #99568 the initial error was in fact "node waiting for init":

server startup failed: cockroach server exited with error: result is ambiguous: error=ba: ‹Put [/Table/46/2/"\x80"/3/0,/Min), EndTxn(parallel commit) [/Table/46/2/"\x80"/3/0], [txn: bb3ff1ac], [can-forward-ts]› RPC error: grpc: ‹node waiting for init; /cockroach.roachpb.Internal/Batch not available› [code 14/Unavailable] [exhausted]

The connection reset by peer in the comment above happened because n3 then went on to crash while processing an EndTxn from n1, causing n1 to crash.

Now, the error handling around these parts is a bit annoying because gRPC doesn't give us a good way to distinguish ambiguous and unambiguous errors, see:

// For most connection errors, we cannot tell whether or not the request
// may have succeeded on the remote server (exceptions are captured in the
// grpcutil.RequestDidNotStart function). We'll retry the request in order
// to attempt to eliminate the ambiguity; see below. If there's a commit
// in the batch, we track the ambiguity more explicitly by setting
// ambiguousError. This serves two purposes:
// 1) the higher-level retries in the DistSender will not forget the
// ambiguity, like they forget it for non-commit batches. This in turn
// will ensure that TxnCoordSender-level retries don't happen across
// commits; that'd be bad since requests are not idempotent across
// commits.
// TODO(andrei): This higher-level does things too bluntly, retrying only
// in case of sendError. It should also retry in case of
// AmbiguousRetryError as long as it makes sure to not forget about the
// ambiguity.
// 2) SQL recognizes AmbiguousResultErrors and gives them a special code
// (StatementCompletionUnknown).
// TODO(andrei): The use of this code is inconsistent because a) the
// DistSender tries to only return the code for commits, but it'll happily
// forward along AmbiguousResultErrors coming from the replica and b) we
// probably should be returning that code for non-commit statements too.
//
// We retry requests in order to avoid returning errors (in particular,
// AmbiguousResultError). Retrying the batch will either:
// a) succeed if the request had not been evaluated the first time.
// b) succeed if the request also succeeded the first time, but is
// idempotent (i.e. it is internal to a txn, without a commit in the
// batch).
// c) fail if it succeeded the first time and the request is not
// idempotent. In the case of EndTxn requests, this is ensured by the
// tombstone keys in the timestamp cache. The retry failing does not
// prove that the request did not succeed the first time around, so we
// can't claim success (and even if we could claim success, we still
// wouldn't have the complete result of the successful evaluation).
//
// Case a) is great - the retry made the request succeed. Case b) is also
// good; due to idempotency we managed to swallow a communication error.
// Case c) is not great - we'll end up returning an error even though the
// request might have succeeded (an AmbiguousResultError if withCommit is
// set).
//
// TODO(andrei): Case c) is broken for non-transactional requests: nothing
// prevents them from double evaluation. This can result in, for example,
// an increment applying twice, or more subtle problems like a blind write
// evaluating twice, overwriting another unrelated write that fell
// in-between.
//
if withCommit && !grpcutil.RequestDidNotStart(err) {
ambiguousError = err
}

// This is where you'd hope to treat some gRPC errors as unambiguous.
// Unfortunately, gRPC provides no good way to distinguish ambiguous from
// unambiguous failures.
//
// https://github.com/grpc/grpc-go/issues/1443
// https://github.com/cockroachave hdb/cockroach/issues/19708#issuecomment-343891640

However, we do know for sure that "node waiting for init" is unambiguous, and IsWaitingForInit(error) can tell us as much. If we mark that error as unambiguous then the DistSender will keep retrying for us:

if ambiguousErr != nil {
return kvpb.NewAmbiguousResultErrorf("error=%s [exhausted]", ambiguousErr)
}
// TODO(bdarnell): The error from the last attempt is not necessarily the best
// one to return; we may want to remember the "best" error we've seen (for
// example, a NotLeaseHolderError conveys more information than a
// RangeNotFound).
return newSendError(fmt.Sprintf("sending to all replicas failed; last error: %s", lastAttemptErr))

case errors.HasType(err, sendError{}):
// We've tried all the replicas without success. Either they're all
// down, or we're using an out-of-date range descriptor. Evict from the
// cache and try again with an updated descriptor. Re-sending the
// request is ok even though it might have succeeded the first time
// around because of idempotency.

This should go a long way towards mitigating this, since I believe other typical connection refused errors will be handled by the InitialHeartbeatFailedError condition:

if errors.HasType(err, (*netutil.InitialHeartbeatFailedError)(nil)) ||

@erikgrinaker
Copy link
Contributor

Submitted #100213, which I believe should address this in a more targeted fashion.

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=6/cpu=16/multi-az failed with artifacts on master @ 1f8024bf14433ca169e5a8c3768c5d223dc5018c:

test artifacts and logs in: /artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: COMMAND_PROBLEM: ssh verbose log retained in ssh_192708.078678661_n7_cockroach-workload-r.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

Latest failure was:

expected host to be part of teamcity-9329887-1680153692-187-n7cpu16-geo, but is `hostname`

Which I believe is tracked elsewhere.

@renatolabs
Copy link
Contributor

Which I believe is tracked elsewhere.

It's not; that error shouldn't come up in roachtests. Where do you see it?

The error in the failure reported in the comment immediately before yours is the known workload-hanging issue introduced by #98689 (which has since been reverted).

Attempt to create load generator failed. It's been more than 1h30m0s since we started trying to create the load generator so we're giving up

@erikgrinaker
Copy link
Contributor

Ah, that makes more sense -- I was expecting the workload issue. My bad, I misinterpreted a line from the SSH log while I was grepping for some stuff, and thought I'd seen a Slack message about it earlier today:

debug1: Sending command: 
if ! [[ `hostname` == 'teamcity-9329887-1680153692-187-n7cpu16-geo-0007' ]]; then
    echo "expected host to be part of teamcity-9329887-1680153692-187-n7cpu16-geo, but is `hostname`"
    exit 1

else
    export ROACHPROD=7 GOTRACEBACK=crash COCKROACH_TESTING_FORCE_RELEASE_BRANCH=true && bash -c "./cockroach workload run tpcc --warehouses=
debug2: channel 0: request exec confirm 1^M 

I think that's my cue to go to bed. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants