roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257

cockroach-teamcity · 2023-01-14T17:23:47Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 8866ecadd99e5f6dc8db9fed5d09bc57cd90b03c:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 7)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-23426}

The text was updated successfully, but these errors were encountered:

adityamaru · 2023-01-16T15:28:47Z

The node was killed because of:
disk stall detected: pebble unable to write to â€¹/mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r4050/i31.t6â€º in 21.10 seconds. Now looking into the buildup to this fatal error.

cockroach-teamcity · 2023-01-26T21:19:49Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 2ad8df3df3272110705984efc32f1453631ce602:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1079).teardownTest: test timed out (5h0m0s)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-01-28T18:00:12Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 78fe59d0536be35fcec68280142746fb91515551:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: read tcp 172.17.0.3:38496 -> 18.216.146.59:26257: read: connection reset by peer

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

adityamaru · 2023-01-30T19:24:07Z

I'm going to first focus on the timeout mode of failure in this issue - #95257 (comment).

The test started at 16:18:50 and timed out at 21:18:50. Loading the tsdump we see that we stopped sending addsstable requests at ~1830.

At the timestamp we stopped sending AddSST requests we had ingested:
diskusage: 18:49:43 restore.go:230: n#6: 405 GiB, n#4: 399 GiB, n#1: 380 GiB, n#5: 366 GiB, n#9: 364 GiB, n#7: 321 GiB, n#8: 302 GiB, n#2: 294 GiB, n#3: 270 GiB, n#10: 262 GiB which when compared with another nightly run was far behind what we usually ingest by this timestamp:
diskusage: 18:49:23 restore.go:230: n#3: 675 GiB, n#7: 654 GiB, n#1: 651 GiB, n#5: 642 GiB, n#6: 642 GiB, n#9: 630 GiB, n#8: 622 GiB, n#2: 613 GiB, n#10: 586 GiB, n#4: 566 GiB

This made me suspect a slow disk or a disk stall but none of storage.disk-slow or storage.disk-stall have non-zero values while we were ingesting. The fact that there are zero addsst requests makes me think that there is a deadlock of sorts that has ground everything to a halt. I'll now look at the stacks to see what it throws up.

adityamaru · 2023-01-31T19:01:13Z

Digging some more we see all nodes have restore_data_processor stacks that are stuck waiting for entries from the upstream split and scatter processors.

In the newly added generative_split_and_scatter_processor we see a goroutine stuck waiting on the doneScatterCh for 298 minutes! This is basically the duration of the whole test so how did the restore make any progress? My hunch is that given that this is a buffered channel, we pushed some work into it that powered the ingestion for the next two hours. Before generating all the spans however something happened in this goroutine

cockroach/pkg/ccl/backupccl/generative_split_and_scatter_processor.go

Line 371 in f3e0327

case doneScatterCh <- scatteredEntry:

that caused us to stop pushing data into the channel. I also see no stack for

cockroach/pkg/ccl/backupccl/generative_split_and_scatter_processor.go

Line 338 in f3e0327

g.GoCtx(func(ctx context.Context) error {

or

cockroach/pkg/ccl/backupccl/generative_split_and_scatter_processor.go

Line 346 in f3e0327

for worker := 0; worker < splitScatterWorkers; worker++ {

which makes me think that the goroutine in g2 exited, and we closed importSpanChunkCh, but we didn’t tear down the other goroutines in the g wait group. My guess is we saw a context cancellation that didn't propogate everywhere it should have. I can see a few places that need fixing up so I will send out a patch.

In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

cockroach-teamcity · 2023-02-01T21:11:28Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ b59fa09d971e92b8b5bcc874437159ef252e0313:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1110).teardownTest: test timed out (5h0m0s)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

95904: roachtest: normalize versions in multitenant-upgrade roachtest r=ajstorm a=healthy-pod If the multitenant-upgrade roachtest uses a mix of release/non-release binaries, it may be using versions that are technically the same but fail to confirm that because version in test binaries are incremented by 1M. This code change fixes the issue by normalizing versions before comparing them. Closes #95648 Epic: none Release note: None 95911: kv: perform PushTxn(PUSH_TIMESTAMP) without Raft consensus r=arulajmani a=nvanbenschoten This PR contains a sequence of three commits that combine to resolve #94728. ### check txn push marker on commit, not txn record creation The first commit moves the point when a transaction checks the timestamp cache for its minimum commit timestamp from transaction record creation time back to commit time. This allows us to use the timestamp cache to communicate successful `PushTxn(TIMESTAMP)` to a pushee with an existing record without rewriting its transaction record. For details, see the changes to the state machine diagram attached to `Replica.CanCreateTxnRecord` for a visual depiction of this change. ### always promote PUSH_TIMESTAMP to PUSH_ABORT on failed staging record The second commit simplifies logic in PushTxnRequest that promoted a `PUSH_TIMESTAMP` to a `PUSH_ABORT` when it found a STAGING transaction record that it knew to be part of a failed parallel commit attempt. The logic tried to be smart and minimize the cases where it needed to promote a `PUSH_TIMESTAMP` to a `PUSH_ABORT`. It was avoiding doing so if it had previously found an intent with a higher epoch. In practice, this optimization doesn't seem to matter. It was also making logic in a following commit harder to write because it was preserving cases where a `PUSH_TIMESTAMP` would succeed against a STAGING transaction record. We don't want to support such state transitions, so eliminate them. ### don't rewrite txn record on PushTxn(TIMESTAMP) With the previous two commits, transactions will check the timestamp cache before committing to determine whether they have had their commit timestamp pushed. The final commit exploits this to avoid ever rewriting a transaction's record on a timestamp push. Instead, the timestamp cache is used, regardless of whether the record already existed or not. Doing so avoids consensus. Release note: None 96115: schemachanger: Implement `DROP CONSTRAINT` in declarative schema changer r=Xiang-Gu a=Xiang-Gu This PR implements `ALTER TABLE t DROP CONSTRAINT cons_name` in declarative schema changer. Supported constraints include Checks, FK, and UniqueWithoutIndex. Dropping PK or Unique constraints will fall back to legacy schema changer, which in turn spits out an "not supported yet" error. Epic: None 96241: sql: use RLock in connExecutor.CancelQuery and connExecutor.CancelActiveQueries r=rafiss,yuzefovich a=ecwall Fixes #95994 `connExecutor.CancelQuery` and `connExecutor.CancelActiveQueries` do not modify `mu.ActiveQueries` or the `*queryMetas` inside so they can safely use `RLock` instead of `Lock`. Release note: None 96273: sql/schemachanger: forward fit compatibility changes for 22.2 rules r=fqazi a=fqazi Informs: #95849 Previously, some constraint-related rules in the 22.2 set incorrectly used logic for 23.X. This patch addresses those to get compatibility back. Additionally, some minor clean-up in rules-related helpers to ensure proper compatibility. With this change, a manual diff shows both branches are now equal in terms of rules (outside of renames). A roachtest will be coming soon to assert this. Epic: none Release note: None 96302: backupccl: add missing context cancel checks to restore r=stevendanna a=adityamaru In #95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. Informs: #95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job. Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: adityamaru <[email protected]>

cockroach-teamcity · 2023-02-04T17:51:22Z

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 8: dead (exit status 7)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

…cessor Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: cockroachdb#95257 Release note: None

96257: kvserver: log if lease applies with a delay r=erikgrinaker a=tbg When we transfer a lease to a lagging follower, there's often a latency blip that we get asked to investigate. This is time consuming; it's often very subtle to even figure out that it happened. We try to be better about not doing it, but at least on 22.1 we know it's possible, and we can't backport the rather involved fixes. This warning makes it fairly obvious when it happens. > W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165 lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0] Addresses #95991. Epic: none Release note: None 96529: backupccl: add missing context cancel checks in gen split scatter processor r=rhu713 a=rhu713 Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: #95257 Release note: None 96796: sql: remove sql obs V22_2 gates r=maryliag a=maryliag With the min version bumped to 22.2, it's safe to delete version gates for 22.2. Fixes #96758 Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: maryliag <[email protected]>

…cessor Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: cockroachdb#95257 Release note: None

In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

…cessor Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: cockroachdb#95257 Release note: None

In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

cockroach-teamcity added this to the 23.1 milestone Jan 14, 2023

blathers-crl bot added the T-disaster-recovery label Jan 14, 2023

exalate-issue-sync bot assigned adityamaru Jan 17, 2023

adityamaru mentioned this issue Jan 31, 2023

backupccl: add missing context cancel checks to restore #96302

Merged

rhu713 mentioned this issue Feb 3, 2023

backupccl: add missing context cancel checks in gen split scatter processor #96529

Merged

craig bot closed this as completed in d91fc4b Feb 8, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257

cockroach-teamcity commented Jan 14, 2023 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Jan 16, 2023

cockroach-teamcity commented Jan 26, 2023

cockroach-teamcity commented Jan 28, 2023

adityamaru commented Jan 30, 2023

adityamaru commented Jan 31, 2023

cockroach-teamcity commented Feb 1, 2023

cockroach-teamcity commented Feb 4, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257

Comments

cockroach-teamcity commented Jan 14, 2023 • edited by cockroach-jira-scripts Loading

adityamaru commented Jan 16, 2023

cockroach-teamcity commented Jan 26, 2023

cockroach-teamcity commented Jan 28, 2023

adityamaru commented Jan 30, 2023

adityamaru commented Jan 31, 2023

cockroach-teamcity commented Feb 1, 2023

cockroach-teamcity commented Feb 4, 2023

cockroach-teamcity commented Jan 14, 2023 •

edited by cockroach-jira-scripts

Loading