-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698
Comments
Hmm I can't really read these logs.
is the highest retry count I can find (i.e. there isn't anything that retries 8 times). Also, the log covers approximately 10 minutes, which is for how long the workload is supposed to run for. So what happened to the other 10 minutes? The latest sqlapp log entry is
which is just about when it was supposed to shut down. It's also when the chaos runner stops. But then the test just hangs for another ~10min. Looking at the code, this must be caused by sqlapp not returning. Unclear to me why, the cluster is left intact by the chaos runner (i.e. all nodes up). The node logs are innocuous, the only thing I don't understand is this error that pops up a small number of times:
Not sure what to make of this. I did poke at the |
Actually, maybe this log means something bad:
It shouldn't, but just going out on a limb. We should edit the source code of this generator to emit better logging that would help investigating this particular problem. |
SHA: https://github.com/cockroachdb/cockroach/commits/82026117d83262e87873aad52b8eca2dd0bea335 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126329&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/b9242838f18107318f53c4e22266fd12223f7651 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126617&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/4e2e450ecf5c3f73c7de46e48696528468803db7 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1128035&tab=buildLog
|
|
SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135549&tab=buildLog
|
Same. cc @andreimatei
|
#34695 is the failure for the 6-node version of this test. They're probably related. @tbg you say "same", but the fatal you pasted is "unexpected non-pending txn in augmentMetaLocked" which is a new one, isn't it? Although I'm sure it's related to the "txn committed or aborted but heartbeat loop hasn't been signaled to stop" condition we've seen before. |
I was eyeballing, possible they're not the same.
…On Wed, Feb 13, 2019, 20:20 Andrei Matei ***@***.*** wrote:
#34341 <#34341> is the
failure for the 6-node version of this test. They're probably related.
@tbg <https://github.com/tbg> you say "same", but the fatal you pasted is
"unexpected non-pending txn in augmentMetaLocked" which is a new one, isn't
it? Although I'm sure it's related to the "txn committed or aborted but
heartbeat loop hasn't been signaled to stop" condition we've seen before.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33698 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135ME-PFRIErBNJy8ghO1xfnDgcvBDks5vNGV9gaJpZM4Z9JP2>
.
|
I've been making progress in the meantime. |
SHA: https://github.com/cockroachdb/cockroach/commits/4fc4d63ddddeb507564bfe53c6cd79549ff2fd27 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1141650&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/8e9a1e310e3e8e37f091b7ca8bd204084ad9e2e5 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1142461&tab=buildLog
|
|
SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/763e90b043b5d732856d3ecf1d7b0d6aa33e3b26 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1147715&tab=buildLog
|
^- same as usual |
SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog
|
@nvanbenschoten would you mind taking a look at the last failure here? It's different than the other (and also I think I had fixed the others). This one looks like the semaphores app crashed because it couldn't connect to node 2, and node 2 was dead because of chaos at that time. Do we have sources for these "apps"? I see the test downloads some binaries. |
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and I believe that has fixed the majority of crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError should also be well possible) Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 (I believe all the issues above were really fixed by cockroachdb#35105 but this patch makes it more convincing) Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog
|
Looks like this tickled a zig zag bug:
cc @justinj |
SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog
|
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <[email protected]>
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.
SHA: https://github.com/cockroachdb/cockroach/commits/766a53defc27940d41ae8987792ecd79e76d2b65
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1090948&tab=buildLog
The text was updated successfully, but these errors were encountered: