sql: rank the errors received by the DistSQLReceiver #35105

andreimatei · 2019-02-20T23:48:58Z

A DistSQL flow can potentially return many errors; different sub-flows
from different nodes, and different processors within a flow, can all
generate different errors. Before this patch, the first one to make it
to the receiver was the one presented to the client. This patch adds
more smarts be chosing the "best" error. The ranking is as follows, from
high precedence to low:

non-retriable error
TxnAbortedError
other retriable errors

Release note: None

cockroach-teamcity · 2019-02-20T23:49:06Z

This change is

nvanbenschoten

Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/roachpb/errors.proto, line 405 at r1 (raw file):

  // TransactionAbortedError. This indicates that the client will continue with
  // a completely new transaction, not the old transaction at a different epoch.
  optional bool prev_txn_aborted = 4 [(gogoproto.nullable) = false];

How does this differ from txn_id != transaction.id? I see some discussion about "escaping from some inner transaction", but is that really a concern anymore? I'd push to avoid denormalized state on these errors. If we need to, we can add an accessor instead.

pkg/sql/distsql_running.go, line 265 at r1 (raw file):

}

type errorScore int

Give this a comment. What does it mean? How should it be used? Why are these the four variants?

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @nvanbenschoten)

pkg/sql/distsql_plan_csv.go, line 127 at r1 (raw file):

}

// OverwriteError is part of the rowResultWriter interface.

Strange to have two methods that do exactly the same thing. I know they differ in terms of documentation, but there's nothing depending on or enforcing those rules. I'd either merge them, or add a if b.err != nil { panic } to SetError

pkg/sql/distsql_running.go, line 265 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this a comment. What does it mean? How should it be used? Why are these the four variants?

[nit] maybe "precedence" instead of "score" (or "class" or "priority")

pkg/sql/distsql_running.go, line 584 at r1 (raw file):

}

func errScore(err error) errorScore {

[nit] since we are only using this with the pattern errScore(e1) > errScore(e2), we could make a function that takes two errors and hides all the score stuff inside

andreimatei

I've added a test too.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @nvanbenschoten, and @RaduBerinde)

pkg/roachpb/errors.proto, line 405 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this differ from txn_id != transaction.id? I see some discussion about "escaping from some inner transaction", but is that really a concern anymore? I'd push to avoid denormalized state on these errors. If we need to, we can add an accessor instead.

went with an accessor, and also used it in the client.Txn code where we need to decide whether to create a new TxnCoordSender.
Btw in upcoming changes I'm going to make this error not be a proto any more (it never should have been, but the damn Sender interface...).

pkg/sql/distsql_running.go, line 265 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] maybe "precedence" instead of "score" (or "class" or "priority")

went with errorPriority and added a comment

pkg/sql/distsql_running.go, line 584 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] since we are only using this with the pattern errScore(e1) > errScore(e2), we could make a function that takes two errors and hides all the score stuff inside

meh. I think this is a fine building block and I also have plans to use it elsewhere

pkg/sql/distsql_plan_csv.go, line 127 at r1 (raw file):

Previously, RaduBerinde wrote…

Strange to have two methods that do exactly the same thing. I know they differ in terms of documentation, but there's nothing depending on or enforcing those rules. I'd either merge them, or add a if b.err != nil { panic } to SetError

Well the rule was enforced in the implementation that matters - the pgwire one.
But you know what, I got rid of OverwriteErr and document SetErr to overwrite. I was on the fence already.

A DistSQL flow can potentially return many errors; different sub-flows from different nodes, and different processors within a flow, can all generate different errors. Before this patch, the first one to make it to the receiver was the one presented to the client. This patch adds more smarts be chosing the "best" error. The ranking is as follows, from high precedence to low: - non-retriable error - TxnAbortedError - other retriable errors Release note: None

andreimatei

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @nvanbenschoten, and @RaduBerinde)

35105: sql: rank the errors received by the DistSQLReceiver r=andreimatei a=andreimatei A DistSQL flow can potentially return many errors; different sub-flows from different nodes, and different processors within a flow, can all generate different errors. Before this patch, the first one to make it to the receiver was the one presented to the client. This patch adds more smarts be chosing the "best" error. The ranking is as follows, from high precedence to low: - non-retriable error - TxnAbortedError - other retriable errors Release note: None Co-authored-by: Andrei Matei <[email protected]>

craig · 2019-02-22T20:56:09Z

Build succeeded

GitHub CI (Cockroach)

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and I believe that has fixed the majority of crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError should also be well possible) Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 (I believe all the issues above were really fixed by cockroachdb#35105 but this patch makes it more convincing) Release note: None

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <[email protected]>

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

andreimatei requested review from nvanbenschoten, asubiotto and a team February 20, 2019 23:48

nvanbenschoten approved these changes Feb 21, 2019

View reviewed changes

RaduBerinde approved these changes Feb 21, 2019

View reviewed changes

andreimatei requested a review from a team February 21, 2019 17:08

andreimatei force-pushed the distsql.score-errors branch from d604ebb to aed8ab0 Compare February 21, 2019 19:47

andreimatei commented Feb 21, 2019

View reviewed changes

andreimatei force-pushed the distsql.score-errors branch from aed8ab0 to 574e805 Compare February 21, 2019 19:52

andreimatei commented Feb 22, 2019

View reviewed changes

craig bot merged commit 574e805 into cockroachdb:master Feb 22, 2019

andreimatei deleted the distsql.score-errors branch February 22, 2019 22:31

This was referenced Feb 27, 2019

kv: don't ingest aborted TxnCoordMeta #35249

Merged

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

Closed

andreimatei mentioned this pull request Mar 21, 2019

release-19.1: kv: don't ingest aborted TxnCoordMeta #36041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: rank the errors received by the DistSQLReceiver #35105

sql: rank the errors received by the DistSQLReceiver #35105

andreimatei commented Feb 20, 2019

cockroach-teamcity commented Feb 20, 2019

nvanbenschoten left a comment

RaduBerinde left a comment

andreimatei left a comment

andreimatei left a comment

craig bot commented Feb 22, 2019

sql: rank the errors received by the DistSQLReceiver #35105

sql: rank the errors received by the DistSQLReceiver #35105

Conversation

andreimatei commented Feb 20, 2019

cockroach-teamcity commented Feb 20, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Feb 22, 2019

Build succeeded