-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695
Comments
@andreimatei looks like #34341. Does the previous log line: Should we keep this issue open or fold it into #34341? |
SHA: https://github.com/cockroachdb/cockroach/commits/4e2e450ecf5c3f73c7de46e48696528468803db7 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1128035&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/70cd0458348655fb74f4e2cb89aa202b5dd89ed0 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1129807&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/40e403544d60b1a44b8b1ed961a817c77d67aa31 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1131411&tab=buildLog
|
Seems to be happening with some regularity now @andreimatei. |
looking |
I've been able to reproduce the |
So here's what I do know: The I've gotten a repro before #34387 too. In that case, what we see is that a leaf sends a
and seemingly afterwards the root prints this:
I don't understand how the "client rejected" message happens after the leaf has already run a flow... To be continued. |
SHA: https://github.com/cockroachdb/cockroach/commits/acba091f04f3d8ecabf51009bf394951fbd3643c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1137872&tab=buildLog
|
Like clockwork.
|
I think I got the sucker. I believe the crash happens when there's a race between two retriable errors, out of which exactly one is a Now figuring out what to do about it. |
SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog
|
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and I believe that has fixed the majority of crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError should also be well possible) Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 (I believe all the issues above were really fixed by cockroachdb#35105 but this patch makes it more convincing) Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/d888b76df319571afe4d5816f1a1f0f53905653f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1159641&tab=buildLog
|
Hmm I had expected the failures to go away after #35105. Will look at what happened. |
I believe the last failure is #34241 |
@andreimatei I'm trying to repro #34241 but I'm immediately getting this:
on top of this commit:
|
and again, with stack
|
@petermattis should this be fetching Core dumps, because it isn't (likely due to using |
I've probably repro'ed that 10 times out of 40 runs. @andreimatei could you prioritize this? Looks like the bug is still around (or I'm epic failing no my choice of binary, but I've tripled checked) |
We'd have to replace the |
SHA: https://github.com/cockroachdb/cockroach/commits/38bb1e7905b89f911bd74be4f5830217ffb7b343 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168752&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/9d058d53c8a82fceb2205f1827c26f1bf36c32ba Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1172386&tab=buildLog
|
Still failing with:
|
SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1183678&tab=buildLog
|
^^^ Same as above. |
SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog
|
Same as the nodes=3 flavor:
@justinj is already on the case. |
Hmm, the augmentMetaLocked error doesn't repro as readily as I'd like. Took around 8 days = 8 runs between the last two iterations, we'll see when it shows up for me. |
This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695. Release note: None
Got myself a repro:
This seems to explain things, though not fully to me. Things happen in this block: cockroach/pkg/sql/distsql_running.go Lines 474 to 506 in 1352687
@andreimatei, WDYT? |
Great, that's exactly what #35249 theorized. And now we know what such a rogue error racing with the Btw, a lot of this crap about ingesting errors is pretty dirty. I'm also attempting to improve things in #35224. |
This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695. Release note: None
35655: exec: add multi column equality to vectorized mergejoiner r=georgeutsin a=georgeutsin Added the ability to join on multiple columns in the vectorized merge joiner. This was done by inverting the previous method of generating groups; instead of assuming no matches and discovering matches group by group, if we assume that everything is a match and filter out the rows that don't match in the join, we can use multiple columns in this approach. This PR also includes some work on output shaping, to ensure that we still have decently shaped outputs, even when a batch ends on a small run. Also fixed the benchmarks in this PR. Updated benchmarks for the mergejoiner: ``` BenchmarkMergeJoiner/rows=1024-8 20000 67231 ns/op 974.77 MB/s BenchmarkMergeJoiner/rows=4096-8 5000 257487 ns/op 1018.08 MB/s BenchmarkMergeJoiner/rows=16384-8 2000 995437 ns/op 1053.38 MB/s BenchmarkMergeJoiner/rows=1048576-8 20 62822638 ns/op 1068.23 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=1024-8 20000 67343 ns/op 973.16 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=4096-8 5000 256988 ns/op 1020.06 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=16384-8 2000 997690 ns/op 1051.00 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=1048576-8 20 62184741 ns/op 1079.19 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=1024-8 20000 68672 ns/op 954.32 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=4096-8 1000 1336630 ns/op 196.12 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=16384-8 100 18478767 ns/op 56.74 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=32768-8 20 80351272 ns/op 26.10 MB/s ``` Note that the performance degradation in the 'bothSidesRepeat' is actually due to incorrect benchmarking in the first place. Regardless, this is still an area of investigation. Release note: None 35989: opt: fix zigzag joins with IS condition r=justinj a=justinj This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock #34695. Release note: None 35994: sqlsmith: add x IN (SELECT ...) r=justinj a=justinj I added this in an attempt to repro a specific bug. I was unsuccessful, but figured I should PR this since I already did the work. This is technically just another instance of a comparison op, it could probably be refactored to be lumped in with those once there's more structure around how those are generated. Release note: None 35995: sql: fix panic when searching for equivalent renders r=jordanlewis a=jordanlewis Previously, certain kinds of window functions could generate a panic when trying to reuse equivalent renders when those renders were tuples. This is fixed by comparing the types using the Equivalent method instead of ==. Release note (bug fix): fix panics caused by certain window functions that operate on tuples Co-authored-by: George Utsin <[email protected]> Co-authored-by: Justin Jaffray <[email protected]> Co-authored-by: Jordan Lewis <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog
|
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <[email protected]>
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.
This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695. Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/66f13c1d9a12c31e18a198da4ff5ac0bbe2db781
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1127849&tab=buildLog
The text was updated successfully, but these errors were encountered: