Skip to content

Commit

Permalink
Merge #35249
Browse files Browse the repository at this point in the history
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei

Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since #35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes #34695
Fixes #34341
Fixes #33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.

Co-authored-by: Andrei Matei <[email protected]>
  • Loading branch information
craig[bot] and andreimatei committed Mar 21, 2019
2 parents b9e9de5 + 52e1c01 commit a200cea
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 0 deletions.
17 changes: 17 additions & 0 deletions pkg/kv/txn_coord_sender.go
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,23 @@ func (tc *TxnCoordSender) AugmentMeta(ctx context.Context, meta roachpb.TxnCoord
if tc.mu.txn.ID != meta.Txn.ID {
return
}

// If the TxnCoordMeta is telling us the transaction has been aborted, it's
// better if we don't ingest it. Ingesting it would possibly put us in an
// inconsistent state, with an ABORTED proto but with the heartbeat loop still
// running. When this TxnCoordMeta was received from DistSQL, it presumably
// followed a TxnAbortedError that was also received. If that error was also
// passed to us, then we've already aborted the txn and ingesting the meta
// would be OK. However, as it stands, if the TxnAbortedError followed a
// non-retriable error, than we don't get the aborted error (in fact, we don't
// get either of the errors; the client is responsible for rolling back).
// TODO(andrei): A better design would be to abort the txn as soon as any
// error is received from DistSQL, which would eliminate qualms about what
// error comes first.
if meta.Txn.Status != roachpb.PENDING {
return
}

tc.augmentMetaLocked(ctx, meta)
}

Expand Down
20 changes: 20 additions & 0 deletions pkg/kv/txn_coord_sender_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2384,3 +2384,23 @@ func TestLeafTxnClientRejectError(t *testing.T) {
t.Fatalf("expected UnhandledRetryableError(TransactionAbortedError), got: (%T) %v", err, err)
}
}

// Check that ingesting an Aborted txn record is a no-op. The TxnCoordSender is
// supposed to reject such updates because they risk putting it into an
// inconsistent state. See comments in TxnCoordSender.AugmentMeta().
func TestAugmentTxnCoordMetaAbortedTxn(t *testing.T) {
defer leaktest.AfterTest(t)()
s := createTestDBWithContextAndKnobs(t, client.DefaultDBContext(), nil /* knobs */)
defer s.Stop()
ctx := context.Background()

txn := client.NewTxn(ctx, s.DB, 0 /* gatewayNodeID */, client.RootTxn)
txnProto := txn.Serialize()
txnProto.Status = roachpb.ABORTED
txn.AugmentTxnCoordMeta(ctx, roachpb.TxnCoordMeta{Txn: *txnProto})
// Check that the transaction was not updated.
txnProto = txn.Serialize()
if txnProto.Status != roachpb.PENDING {
t.Fatalf("expected PENDING txn, got: %s", txnProto.Status)
}
}

0 comments on commit a200cea

Please sign in to comment.