-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #95257
Comments
The node was killed because of: |
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 2ad8df3df3272110705984efc32f1453631ce602:
Parameters: |
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 78fe59d0536be35fcec68280142746fb91515551:
Parameters: |
I'm going to first focus on the timeout mode of failure in this issue - #95257 (comment). The test started at 16:18:50 and timed out at 21:18:50. Loading the tsdump we see that we stopped sending addsstable requests at ~1830. At the timestamp we stopped sending AddSST requests we had ingested: This made me suspect a slow disk or a disk stall but none of |
Digging some more we see all nodes have In the newly added
g2 exited, and we closed importSpanChunkCh , but we didn’t tear down the other goroutines in the g wait group. My guess is we saw a context cancellation that didn't propogate everywhere it should have. I can see a few places that need fixing up so I will send out a patch.
|
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ b59fa09d971e92b8b5bcc874437159ef252e0313:
Parameters: |
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.
95904: roachtest: normalize versions in multitenant-upgrade roachtest r=ajstorm a=healthy-pod If the multitenant-upgrade roachtest uses a mix of release/non-release binaries, it may be using versions that are technically the same but fail to confirm that because version in test binaries are incremented by 1M. This code change fixes the issue by normalizing versions before comparing them. Closes #95648 Epic: none Release note: None 95911: kv: perform PushTxn(PUSH_TIMESTAMP) without Raft consensus r=arulajmani a=nvanbenschoten This PR contains a sequence of three commits that combine to resolve #94728. ### check txn push marker on commit, not txn record creation The first commit moves the point when a transaction checks the timestamp cache for its minimum commit timestamp from transaction record creation time back to commit time. This allows us to use the timestamp cache to communicate successful `PushTxn(TIMESTAMP)` to a pushee with an existing record without rewriting its transaction record. For details, see the changes to the state machine diagram attached to `Replica.CanCreateTxnRecord` for a visual depiction of this change. ### always promote PUSH_TIMESTAMP to PUSH_ABORT on failed staging record The second commit simplifies logic in PushTxnRequest that promoted a `PUSH_TIMESTAMP` to a `PUSH_ABORT` when it found a STAGING transaction record that it knew to be part of a failed parallel commit attempt. The logic tried to be smart and minimize the cases where it needed to promote a `PUSH_TIMESTAMP` to a `PUSH_ABORT`. It was avoiding doing so if it had previously found an intent with a higher epoch. In practice, this optimization doesn't seem to matter. It was also making logic in a following commit harder to write because it was preserving cases where a `PUSH_TIMESTAMP` would succeed against a STAGING transaction record. We don't want to support such state transitions, so eliminate them. ### don't rewrite txn record on PushTxn(TIMESTAMP) With the previous two commits, transactions will check the timestamp cache before committing to determine whether they have had their commit timestamp pushed. The final commit exploits this to avoid ever rewriting a transaction's record on a timestamp push. Instead, the timestamp cache is used, regardless of whether the record already existed or not. Doing so avoids consensus. Release note: None 96115: schemachanger: Implement `DROP CONSTRAINT` in declarative schema changer r=Xiang-Gu a=Xiang-Gu This PR implements `ALTER TABLE t DROP CONSTRAINT cons_name` in declarative schema changer. Supported constraints include Checks, FK, and UniqueWithoutIndex. Dropping PK or Unique constraints will fall back to legacy schema changer, which in turn spits out an "not supported yet" error. Epic: None 96241: sql: use RLock in connExecutor.CancelQuery and connExecutor.CancelActiveQueries r=rafiss,yuzefovich a=ecwall Fixes #95994 `connExecutor.CancelQuery` and `connExecutor.CancelActiveQueries` do not modify `mu.ActiveQueries` or the `*queryMetas` inside so they can safely use `RLock` instead of `Lock`. Release note: None 96273: sql/schemachanger: forward fit compatibility changes for 22.2 rules r=fqazi a=fqazi Informs: #95849 Previously, some constraint-related rules in the 22.2 set incorrectly used logic for 23.X. This patch addresses those to get compatibility back. Additionally, some minor clean-up in rules-related helpers to ensure proper compatibility. With this change, a manual diff shows both branches are now equal in terms of rules (outside of renames). A roachtest will be coming soon to assert this. Epic: none Release note: None 96302: backupccl: add missing context cancel checks to restore r=stevendanna a=adityamaru In #95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. Informs: #95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job. Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: adityamaru <[email protected]>
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:
Parameters: |
…cessor Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: cockroachdb#95257 Release note: None
96257: kvserver: log if lease applies with a delay r=erikgrinaker a=tbg When we transfer a lease to a lagging follower, there's often a latency blip that we get asked to investigate. This is time consuming; it's often very subtle to even figure out that it happened. We try to be better about not doing it, but at least on 22.1 we know it's possible, and we can't backport the rather involved fixes. This warning makes it fairly obvious when it happens. > W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165 lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0] Addresses #95991. Epic: none Release note: None 96529: backupccl: add missing context cancel checks in gen split scatter processor r=rhu713 a=rhu713 Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: #95257 Release note: None 96796: sql: remove sql obs V22_2 gates r=maryliag a=maryliag With the min version bumped to 22.2, it's safe to delete version gates for 22.2. Fixes #96758 Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: maryliag <[email protected]>
…cessor Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: cockroachdb#95257 Release note: None
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.
…cessor Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: cockroachdb#95257 Release note: None
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.
roachtest.restore/tpce/8TB/aws/nodes=10/cpus=8 failed with artifacts on master @ 8866ecadd99e5f6dc8db9fed5d09bc57cd90b03c:
Parameters:
ROACHTEST_cloud=aws
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-23426
The text was updated successfully, but these errors were encountered: