Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <[email protected]>
- Loading branch information