Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257

Closed
cockroach-teamcity opened this issue Jan 14, 2023 · 7 comments · Fixed by #96529
Closed

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257

cockroach-teamcity opened this issue Jan 14, 2023 · 7 comments · Fixed by #96529
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 14, 2023

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 8866ecadd99e5f6dc8db9fed5d09bc57cd90b03c:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 7)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-23426

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 14, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jan 14, 2023
@adityamaru
Copy link
Contributor

The node was killed because of:
disk stall detected: pebble unable to write to ‹/mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r4050/i31.t6› in 21.10 seconds. Now looking into the buildup to this fatal error.

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 2ad8df3df3272110705984efc32f1453631ce602:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1079).teardownTest: test timed out (5h0m0s)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 78fe59d0536be35fcec68280142746fb91515551:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: read tcp 172.17.0.3:38496 -> 18.216.146.59:26257: read: connection reset by peer

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@adityamaru
Copy link
Contributor

I'm going to first focus on the timeout mode of failure in this issue - #95257 (comment).

The test started at 16:18:50 and timed out at 21:18:50. Loading the tsdump we see that we stopped sending addsstable requests at ~1830.

Screenshot 2023-01-30 at 2 07 31 PM

At the timestamp we stopped sending AddSST requests we had ingested:
diskusage: 18:49:43 restore.go:230: n#6: 405 GiB, n#4: 399 GiB, n#1: 380 GiB, n#5: 366 GiB, n#9: 364 GiB, n#7: 321 GiB, n#8: 302 GiB, n#2: 294 GiB, n#3: 270 GiB, n#10: 262 GiB which when compared with another nightly run was far behind what we usually ingest by this timestamp:
diskusage: 18:49:23 restore.go:230: n#3: 675 GiB, n#7: 654 GiB, n#1: 651 GiB, n#5: 642 GiB, n#6: 642 GiB, n#9: 630 GiB, n#8: 622 GiB, n#2: 613 GiB, n#10: 586 GiB, n#4: 566 GiB

This made me suspect a slow disk or a disk stall but none of storage.disk-slow or storage.disk-stall have non-zero values while we were ingesting. The fact that there are zero addsst requests makes me think that there is a deadlock of sorts that has ground everything to a halt. I'll now look at the stacks to see what it throws up.

@adityamaru
Copy link
Contributor

Digging some more we see all nodes have restore_data_processor stacks that are stuck waiting for entries from the upstream split and scatter processors.

In the newly added generative_split_and_scatter_processor we see a goroutine stuck waiting on the doneScatterCh for 298 minutes! This is basically the duration of the whole test so how did the restore make any progress? My hunch is that given that this is a buffered channel, we pushed some work into it that powered the ingestion for the next two hours. Before generating all the spans however something happened in this goroutine

case doneScatterCh <- scatteredEntry:
that caused us to stop pushing data into the channel. I also see no stack for
g.GoCtx(func(ctx context.Context) error {
or
for worker := 0; worker < splitScatterWorkers; worker++ {
which makes me think that the goroutine in g2 exited, and we closed importSpanChunkCh, but we didn’t tear down the other goroutines in the g wait group. My guess is we saw a context cancellation that didn't propogate everywhere it should have. I can see a few places that need fixing up so I will send out a patch.

adityamaru added a commit to adityamaru/cockroach that referenced this issue Jan 31, 2023
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a
5 hour roachtest. The stacks indicated that we may have seen
a context cancellation that was not being respected by the
goroutine running `generateAndSendImportSpans`. This resulted
in the `generative_split_and_scatter_processor` getting stuck writing
to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516)
since the other goroutines
in the processor had seen the ctx cancellation and exited.
A side effect of the generative processor not shutting down
was that the downstream restore data processors would also
hang on their call to `input.Next()` as they would not receive
a row or a meta from the generative processor signalling them
to shutdown. This fix adds a ctx cancellation check to the
goroutine described above, thereby allowing a
graceful teardown of the flow.

Informs: cockroachdb#95257

Release note (bug fix): fixes a bug where a restore flow could
hang indefinitely in the face of a context cancellation, manifesting
as a stuck restore job.
adityamaru added a commit to adityamaru/cockroach that referenced this issue Feb 1, 2023
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a
5 hour roachtest. The stacks indicated that we may have seen
a context cancellation that was not being respected by the
goroutine running `generateAndSendImportSpans`. This resulted
in the `generative_split_and_scatter_processor` getting stuck writing
to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516)
since the other goroutines
in the processor had seen the ctx cancellation and exited.
A side effect of the generative processor not shutting down
was that the downstream restore data processors would also
hang on their call to `input.Next()` as they would not receive
a row or a meta from the generative processor signalling them
to shutdown. This fix adds a ctx cancellation check to the
goroutine described above, thereby allowing a
graceful teardown of the flow.

This fix also adds the JobID to the generative processor spec
so that logs on remote nodes are correctly tagged with the JobID
making for easier debugging.

Informs: cockroachdb#95257

Release note (bug fix): fixes a bug where a restore flow could
hang indefinitely in the face of a context cancellation, manifesting
as a stuck restore job.
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ b59fa09d971e92b8b5bcc874437159ef252e0313:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(test_runner.go:1110).teardownTest: test timed out (5h0m0s)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

adityamaru added a commit to adityamaru/cockroach that referenced this issue Feb 2, 2023
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a
5 hour roachtest. The stacks indicated that we may have seen
a context cancellation that was not being respected by the
goroutine running `generateAndSendImportSpans`. This resulted
in the `generative_split_and_scatter_processor` getting stuck writing
to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516)
since the other goroutines
in the processor had seen the ctx cancellation and exited.
A side effect of the generative processor not shutting down
was that the downstream restore data processors would also
hang on their call to `input.Next()` as they would not receive
a row or a meta from the generative processor signalling them
to shutdown. This fix adds a ctx cancellation check to the
goroutine described above, thereby allowing a
graceful teardown of the flow.

This fix also adds the JobID to the generative processor spec
so that logs on remote nodes are correctly tagged with the JobID
making for easier debugging.

Informs: cockroachdb#95257

Release note (bug fix): fixes a bug where a restore flow could
hang indefinitely in the face of a context cancellation, manifesting
as a stuck restore job.
craig bot pushed a commit that referenced this issue Feb 2, 2023
95904: roachtest: normalize versions in multitenant-upgrade roachtest r=ajstorm a=healthy-pod

If the multitenant-upgrade roachtest uses a mix
of release/non-release binaries, it may be using versions
that are technically the same but fail to confirm that because
version in test binaries are incremented by 1M.

This code change fixes the issue by normalizing versions before
comparing them.

Closes #95648

Epic: none
Release note: None

95911: kv: perform PushTxn(PUSH_TIMESTAMP) without Raft consensus r=arulajmani a=nvanbenschoten

This PR contains a sequence of three commits that combine to resolve #94728.

### check txn push marker on commit, not txn record creation

The first commit moves the point when a transaction checks the timestamp cache for its minimum commit timestamp from transaction record creation time back to commit time. This allows us to use the timestamp cache to communicate successful `PushTxn(TIMESTAMP)` to a pushee with an existing record without rewriting its transaction record.

For details, see the changes to the state machine diagram attached to `Replica.CanCreateTxnRecord` for a visual depiction of this change.

### always promote PUSH_TIMESTAMP to PUSH_ABORT on failed staging record 

The second commit simplifies logic in PushTxnRequest that promoted a `PUSH_TIMESTAMP` to a `PUSH_ABORT` when it 
 found a STAGING transaction record that it knew to be part of a failed parallel commit attempt.

The logic tried to be smart and minimize the cases where it needed to promote a `PUSH_TIMESTAMP` to a `PUSH_ABORT`. It was avoiding doing so if it had previously found an intent with a higher epoch. In practice, this optimization doesn't seem to matter. It was also making logic in a following commit harder to write because it was preserving cases where a `PUSH_TIMESTAMP` would succeed against a STAGING transaction record. We don't want to support such state transitions, so eliminate them.

### don't rewrite txn record on PushTxn(TIMESTAMP)

With the previous two commits, transactions will check the timestamp cache before committing to determine whether they have had their commit timestamp pushed. The final commit exploits this to avoid ever rewriting a transaction's record on a timestamp push. Instead, the timestamp cache is used, regardless of whether the record already existed or not. Doing so avoids consensus.

Release note: None

96115: schemachanger: Implement `DROP CONSTRAINT` in declarative schema changer r=Xiang-Gu a=Xiang-Gu

This PR implements `ALTER TABLE t DROP CONSTRAINT cons_name` in declarative schema changer.

Supported constraints include Checks, FK, and UniqueWithoutIndex.

Dropping PK or Unique constraints will fall back to legacy schema changer, which in turn spits out an "not supported yet" error.

Epic: None

96241: sql: use RLock in connExecutor.CancelQuery and connExecutor.CancelActiveQueries r=rafiss,yuzefovich a=ecwall

Fixes #95994

`connExecutor.CancelQuery` and `connExecutor.CancelActiveQueries` do not modify `mu.ActiveQueries` or the `*queryMetas` inside so they can safely use `RLock` instead of `Lock`.

Release note: None

96273: sql/schemachanger: forward fit compatibility changes for 22.2 rules r=fqazi a=fqazi

Informs: #95849

Previously, some constraint-related rules in the 22.2 set incorrectly used logic for 23.X. This patch addresses those to get compatibility back. Additionally, some minor clean-up in rules-related helpers to ensure proper compatibility.

With this change, a manual diff shows both branches are now equal in terms of rules (outside of renames). A roachtest will be coming soon to assert this.

Epic: none
Release note: None

96302: backupccl: add missing context cancel checks to restore r=stevendanna a=adityamaru

In #95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines
in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a
graceful teardown of the flow.

Informs: #95257

Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.

Co-authored-by: healthy-pod <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Evan Wall <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: adityamaru <[email protected]>
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:

test artifacts and logs in: /artifacts/restore/tpce/8TB/aws/nodes=10/cpus=8/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 8: dead (exit status 7)

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

rhu713 pushed a commit to rhu713/cockroach that referenced this issue Feb 7, 2023
…cessor

Add the rest of the missing context cancel checks in restore's
generativeSplitAndScatterProcessor. Add a red/green test to show that
runGenerativeSplitAndScatter is interrupted if its supplied context is
canceled.

Fixes: cockroachdb#95257

Release note: None
craig bot pushed a commit that referenced this issue Feb 8, 2023
96257: kvserver: log if lease applies with a delay r=erikgrinaker a=tbg

When we transfer a lease to a lagging follower, there's often a latency
blip that we get asked to investigate. This is time consuming; it's
often very subtle to even figure out that it happened. We try to be
better about not doing it, but at least on 22.1 we know it's possible,
and we can't backport the rather involved fixes.

This warning makes it fairly obvious when it happens.

> W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165  lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0]

Addresses #95991.

Epic: none
Release note: None


96529: backupccl: add missing context cancel checks in gen split scatter processor r=rhu713 a=rhu713

Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled.

Fixes: #95257

Release note: None

96796: sql: remove sql obs V22_2 gates r=maryliag a=maryliag

With the min version bumped to 22.2, it's safe
to delete version gates for 22.2.

Fixes #96758

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: maryliag <[email protected]>
@craig craig bot closed this as completed in d91fc4b Feb 8, 2023
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Feb 15, 2023
…cessor

Add the rest of the missing context cancel checks in restore's
generativeSplitAndScatterProcessor. Add a red/green test to show that
runGenerativeSplitAndScatter is interrupted if its supplied context is
canceled.

Fixes: cockroachdb#95257

Release note: None
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Feb 15, 2023
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a
5 hour roachtest. The stacks indicated that we may have seen
a context cancellation that was not being respected by the
goroutine running `generateAndSendImportSpans`. This resulted
in the `generative_split_and_scatter_processor` getting stuck writing
to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516)
since the other goroutines
in the processor had seen the ctx cancellation and exited.
A side effect of the generative processor not shutting down
was that the downstream restore data processors would also
hang on their call to `input.Next()` as they would not receive
a row or a meta from the generative processor signalling them
to shutdown. This fix adds a ctx cancellation check to the
goroutine described above, thereby allowing a
graceful teardown of the flow.

This fix also adds the JobID to the generative processor spec
so that logs on remote nodes are correctly tagged with the JobID
making for easier debugging.

Informs: cockroachdb#95257

Release note (bug fix): fixes a bug where a restore flow could
hang indefinitely in the face of a context cancellation, manifesting
as a stuck restore job.
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Feb 15, 2023
…cessor

Add the rest of the missing context cancel checks in restore's
generativeSplitAndScatterProcessor. Add a red/green test to show that
runGenerativeSplitAndScatter is interrupted if its supplied context is
canceled.

Fixes: cockroachdb#95257

Release note: None
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Feb 15, 2023
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a
5 hour roachtest. The stacks indicated that we may have seen
a context cancellation that was not being respected by the
goroutine running `generateAndSendImportSpans`. This resulted
in the `generative_split_and_scatter_processor` getting stuck writing
to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516)
since the other goroutines
in the processor had seen the ctx cancellation and exited.
A side effect of the generative processor not shutting down
was that the downstream restore data processors would also
hang on their call to `input.Next()` as they would not receive
a row or a meta from the generative processor signalling them
to shutdown. This fix adds a ctx cancellation check to the
goroutine described above, thereby allowing a
graceful teardown of the flow.

This fix also adds the JobID to the generative processor spec
so that logs on remote nodes are correctly tagged with the JobID
making for easier debugging.

Informs: cockroachdb#95257

Release note (bug fix): fixes a bug where a restore flow could
hang indefinitely in the face of a context cancellation, manifesting
as a stuck restore job.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants