-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: rank the errors received by the DistSQLReceiver #35105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)
pkg/roachpb/errors.proto, line 405 at r1 (raw file):
// TransactionAbortedError. This indicates that the client will continue with // a completely new transaction, not the old transaction at a different epoch. optional bool prev_txn_aborted = 4 [(gogoproto.nullable) = false];
How does this differ from txn_id != transaction.id
? I see some discussion about "escaping from some inner transaction", but is that really a concern anymore? I'd push to avoid denormalized state on these errors. If we need to, we can add an accessor instead.
pkg/sql/distsql_running.go, line 265 at r1 (raw file):
} type errorScore int
Give this a comment. What does it mean? How should it be used? Why are these the four variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @nvanbenschoten)
pkg/sql/distsql_plan_csv.go, line 127 at r1 (raw file):
} // OverwriteError is part of the rowResultWriter interface.
Strange to have two methods that do exactly the same thing. I know they differ in terms of documentation, but there's nothing depending on or enforcing those rules. I'd either merge them, or add a if b.err != nil { panic }
to SetError
pkg/sql/distsql_running.go, line 265 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this a comment. What does it mean? How should it be used? Why are these the four variants?
[nit] maybe "precedence" instead of "score" (or "class" or "priority")
pkg/sql/distsql_running.go, line 584 at r1 (raw file):
} func errScore(err error) errorScore {
[nit] since we are only using this with the pattern errScore(e1) > errScore(e2)
, we could make a function that takes two errors and hides all the score stuff inside
d604ebb
to
aed8ab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test too.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @nvanbenschoten, and @RaduBerinde)
pkg/roachpb/errors.proto, line 405 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How does this differ from
txn_id != transaction.id
? I see some discussion about "escaping from some inner transaction", but is that really a concern anymore? I'd push to avoid denormalized state on these errors. If we need to, we can add an accessor instead.
went with an accessor, and also used it in the client.Txn
code where we need to decide whether to create a new TxnCoordSender
.
Btw in upcoming changes I'm going to make this error not be a proto any more (it never should have been, but the damn Sender interface...).
pkg/sql/distsql_running.go, line 265 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] maybe "precedence" instead of "score" (or "class" or "priority")
went with errorPriority
and added a comment
pkg/sql/distsql_running.go, line 584 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] since we are only using this with the pattern
errScore(e1) > errScore(e2)
, we could make a function that takes two errors and hides all the score stuff inside
meh. I think this is a fine building block and I also have plans to use it elsewhere
pkg/sql/distsql_plan_csv.go, line 127 at r1 (raw file):
Previously, RaduBerinde wrote…
Strange to have two methods that do exactly the same thing. I know they differ in terms of documentation, but there's nothing depending on or enforcing those rules. I'd either merge them, or add a
if b.err != nil { panic }
toSetError
Well the rule was enforced in the implementation that matters - the pgwire one.
But you know what, I got rid of OverwriteErr
and document SetErr
to overwrite. I was on the fence already.
A DistSQL flow can potentially return many errors; different sub-flows from different nodes, and different processors within a flow, can all generate different errors. Before this patch, the first one to make it to the receiver was the one presented to the client. This patch adds more smarts be chosing the "best" error. The ranking is as follows, from high precedence to low: - non-retriable error - TxnAbortedError - other retriable errors Release note: None
aed8ab0
to
574e805
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @nvanbenschoten, and @RaduBerinde)
35105: sql: rank the errors received by the DistSQLReceiver r=andreimatei a=andreimatei A DistSQL flow can potentially return many errors; different sub-flows from different nodes, and different processors within a flow, can all generate different errors. Before this patch, the first one to make it to the receiver was the one presented to the client. This patch adds more smarts be chosing the "best" error. The ranking is as follows, from high precedence to low: - non-retriable error - TxnAbortedError - other retriable errors Release note: None Co-authored-by: Andrei Matei <[email protected]>
Build succeeded |
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and I believe that has fixed the majority of crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError should also be well possible) Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 (I believe all the issues above were really fixed by cockroachdb#35105 but this patch makes it more convincing) Release note: None
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <[email protected]>
Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.
A DistSQL flow can potentially return many errors; different sub-flows
from different nodes, and different processors within a flow, can all
generate different errors. Before this patch, the first one to make it
to the receiver was the one presented to the client. This patch adds
more smarts be chosing the "best" error. The ranking is as follows, from
high precedence to low:
Release note: None