Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: rank the errors received by the DistSQLReceiver #35105

Merged
merged 1 commit into from
Feb 22, 2019

Conversation

andreimatei
Copy link
Contributor

A DistSQL flow can potentially return many errors; different sub-flows
from different nodes, and different processors within a flow, can all
generate different errors. Before this patch, the first one to make it
to the receiver was the one presented to the client. This patch adds
more smarts be chosing the "best" error. The ranking is as follows, from
high precedence to low:

  • non-retriable error
  • TxnAbortedError
  • other retriable errors

Release note: None

@andreimatei andreimatei requested review from nvanbenschoten, asubiotto and a team February 20, 2019 23:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/roachpb/errors.proto, line 405 at r1 (raw file):

  // TransactionAbortedError. This indicates that the client will continue with
  // a completely new transaction, not the old transaction at a different epoch.
  optional bool prev_txn_aborted = 4 [(gogoproto.nullable) = false];

How does this differ from txn_id != transaction.id? I see some discussion about "escaping from some inner transaction", but is that really a concern anymore? I'd push to avoid denormalized state on these errors. If we need to, we can add an accessor instead.


pkg/sql/distsql_running.go, line 265 at r1 (raw file):

}

type errorScore int

Give this a comment. What does it mean? How should it be used? Why are these the four variants?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @nvanbenschoten)


pkg/sql/distsql_plan_csv.go, line 127 at r1 (raw file):

}

// OverwriteError is part of the rowResultWriter interface.

Strange to have two methods that do exactly the same thing. I know they differ in terms of documentation, but there's nothing depending on or enforcing those rules. I'd either merge them, or add a if b.err != nil { panic } to SetError


pkg/sql/distsql_running.go, line 265 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this a comment. What does it mean? How should it be used? Why are these the four variants?

[nit] maybe "precedence" instead of "score" (or "class" or "priority")


pkg/sql/distsql_running.go, line 584 at r1 (raw file):

}

func errScore(err error) errorScore {

[nit] since we are only using this with the pattern errScore(e1) > errScore(e2), we could make a function that takes two errors and hides all the score stuff inside

@andreimatei andreimatei requested a review from a team February 21, 2019 17:08
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test too.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @nvanbenschoten, and @RaduBerinde)


pkg/roachpb/errors.proto, line 405 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this differ from txn_id != transaction.id? I see some discussion about "escaping from some inner transaction", but is that really a concern anymore? I'd push to avoid denormalized state on these errors. If we need to, we can add an accessor instead.

went with an accessor, and also used it in the client.Txn code where we need to decide whether to create a new TxnCoordSender.
Btw in upcoming changes I'm going to make this error not be a proto any more (it never should have been, but the damn Sender interface...).


pkg/sql/distsql_running.go, line 265 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] maybe "precedence" instead of "score" (or "class" or "priority")

went with errorPriority and added a comment


pkg/sql/distsql_running.go, line 584 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] since we are only using this with the pattern errScore(e1) > errScore(e2), we could make a function that takes two errors and hides all the score stuff inside

meh. I think this is a fine building block and I also have plans to use it elsewhere


pkg/sql/distsql_plan_csv.go, line 127 at r1 (raw file):

Previously, RaduBerinde wrote…

Strange to have two methods that do exactly the same thing. I know they differ in terms of documentation, but there's nothing depending on or enforcing those rules. I'd either merge them, or add a if b.err != nil { panic } to SetError

Well the rule was enforced in the implementation that matters - the pgwire one.
But you know what, I got rid of OverwriteErr and document SetErr to overwrite. I was on the fence already.

A DistSQL flow can potentially return many errors; different sub-flows
from different nodes, and different processors within a flow, can all
generate different errors. Before this patch, the first one to make it
to the receiver was the one presented to the client. This patch adds
more smarts be chosing the "best" error. The ranking is as follows, from
high precedence to low:
- non-retriable error
- TxnAbortedError
- other retriable errors

Release note: None
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @nvanbenschoten, and @RaduBerinde)

craig bot pushed a commit that referenced this pull request Feb 22, 2019
35105: sql: rank the errors received by the DistSQLReceiver r=andreimatei a=andreimatei

A DistSQL flow can potentially return many errors; different sub-flows
from different nodes, and different processors within a flow, can all
generate different errors. Before this patch, the first one to make it
to the receiver was the one presented to the client. This patch adds
more smarts be chosing the "best" error. The ranking is as follows, from
high precedence to low:
- non-retriable error
- TxnAbortedError
- other retriable errors

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 22, 2019

Build succeeded

@craig craig bot merged commit 574e805 into cockroachdb:master Feb 22, 2019
@andreimatei andreimatei deleted the distsql.score-errors branch February 22, 2019 22:31
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Feb 27, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and I
believe that has fixed the majority of crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError should also be well
possible)

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

(I believe all the issues above were really fixed by cockroachdb#35105 but this
patch makes it more convincing)

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
craig bot pushed a commit that referenced this pull request Mar 21, 2019
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei

Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since #35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes #34695
Fixes #34341
Fixes #33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.

Co-authored-by: Andrei Matei <[email protected]>
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants