Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kv, storage: fix handling of ambiguous failures on commit #111017

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented Sep 21, 2023

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release


Release justification: Backporting fix for outstanding bug described in #103817.

@AlexTalks AlexTalks changed the title [draft] release-23.1: kv, storage: fix handling of ambiguous failures on commit #110757 [draft] release-23.1: kv, storage: fix handling of ambiguous failures on commit Sep 21, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

This change introduces `MVCCWriteOptions`, a structure for bundling
parameters for `MVCCPut`, `MVCCDelete`, and their many variants, and
refactors usages of these functions across the code base in order to
move the existing function arguments into this structure. In addition to
allowing the code to eliminate specifying default values in many
callers, this enables the ability to pass new flags to write operations
such as the replay protection needed to address cockroachdb#103817.

Part of: cockroachdb#103817

Release note: None
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and
seen in cockroachdb#67765, which currently is a bug in our implementation of the
parallel commit protocol that results in the assertion failure known as
`transaction unexpectedly committed`. The test currently validates the
incorrect behavior of the known issue, though it is inded to be used to
validate the potential fixes as proposed in cockroachdb#103817.

Release note: None

Part of: cockroachdb#103817
In cockroachdb#107323, testing for the ambiguous write case that leads to the
"transaction unexpectedly committed" bug were introduced, however to
increase test coverage of the fix, multiple schedules of operations need
to be tested. This change simply refactors the framework of the existing
test in order to enable the addition of muliple subtests. The subtests
are included in a separate patch.

Part of: cockroachdb#103817

Release note: None
This change builds on the testing introduced in cockroachdb#107323 with additional
subtests evaluating the behavior of different race outcomes with
contended transactions where the first transaction experiences an RPC
failure (i.e. an ambiguous write).

Part of: cockroachdb#103817

Release note: None
While previously, RPC failures were assumed to be retryable, as write
operations (with the notable exception of `EndTxn`) were assumed to be
idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that
RPC failures on write operations that occur in parallel with a commit
(i.e. a partial batch where `withCommit==true`), it is not always
possible to assume idempotency and retry the "ambiguous" writes.
This is due to the fact that the retried write RPC could result in the
transaction's `WriteTimestamp` being bumped, changing the commit
timestamp of the transaction that may in fact already be implicitly
committed if the initial "ambiguous" write actually succeeded.

This change modifies the protocol of the DistSender to flag in
subsequent retries that a batch with a commit has previously
experienced ambiguity, as well as the handling of the retried write in
the MVCC layer to detect this previous ambiguity and reject retries
that change the write timestamp as a non-idempotent replay. The flag
allows subsequent retries to "remember" the earlier ambiguous write and
evaluate accordingly.

The flag allows us to properly handle RPC failures (i.e. ambiguous
writes) that occur on commit, as a transaction that is implicitly
committed is eligible to be marked as explicitly committed by contending
transactions via the `RecoverTxn` operation, resulting in a race between
retries by the transaction coordinator and recovery by contending
transactions that could result in either incorrectly reporting a
transaction as having failed with a `RETRY_SERIALIZABLE` error (despite
the possibility that it already was or could be recovered and
successfully committed), or in attempting to explicitly commit an
already-recovered and committed transaction, resulting in seeing an
assertion failure due to `transaction unexpectedly committed`.

The replay protection introduced here allows us to avoid both of these
situations by detecting a replay that should be considered
non-idempotent and returning an error, causing the original RPC error
remembered by the DistSender to be propagated as an
`AmbiguousResultError`. As such, this can be handled by application code
by validating the success/failure of a transaction when receiving this
error.

Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001

Fixes: cockroachdb#103817

Release note (bug fix): Properly handles RPC failures on writes using
the parallel commit protocol that execute in parallel to the commit
operation, avoiding incorrect retryable failures and `transaction
unexpectedly committed` assertions by detecting when writes cannot
be retried idempotently, instead returning an `AmbiguousResultError`.
The `TestTransactionUnexpectedlyCommitted/recovery_after_transfer_lease`
test, introduced to test cockroachdb#107658, has been flaky (particularly under
deadlock builds) due to a race condition between a retry of a write and
intent resolution. While both orderings in this test result in a correct
`AmbiguousResultError` for the client, when intent resolution wins the
race, the retried write will attempt to push away the current
lockholder; since it is illegal for a committed transaction to perform a
push, this results in a different secondary error attached to the
`AmbiguousResultError`. This change ensures a predefined ordering of
these operations so that the secondary error is consistent across runs
of the test.

Fixes: cockroachdb#110187

Release note: None
Prior to cockroachdb#102808 and cockroachdb#102809, `WriteTooOldError`s were deferred for
client-side handling, allowing intent writes to proceed. This means some
changes to the fix introduced in cockroachdb#107658 are required, in order to
handle retried blind writes that encounter a `WriteTooOld`, even if
deferred, with the RPC returning as a success. The change thus catches
the `WriteTooOld` flag on a successful batch response, propagating the
original RPC failure from the on-commit batch as an
`AmbiguousResultError`. This does have the side-effect of leaving behind
an intent written on the retry of a blind write (after the original
intents were cleaned up), though they would be resolved by any future
readers.

This change also adapts some of the MVCC history tests to the earlier
formatting used in 23.1.

Part of: cockroachdb#103817

Release note: None
@AlexTalks AlexTalks force-pushed the alt_backport23.1_ambiguous_commit_fix branch from 7196c36 to 54a8c4c Compare September 27, 2023 20:20
@AlexTalks AlexTalks marked this pull request as ready for review September 27, 2023 20:23
@AlexTalks AlexTalks requested review from a team as code owners September 27, 2023 20:23
@AlexTalks AlexTalks requested a review from a team September 27, 2023 20:23
@AlexTalks AlexTalks requested review from a team as code owners September 27, 2023 20:23
@AlexTalks AlexTalks requested a review from itsbilal September 27, 2023 20:23
@AlexTalks AlexTalks changed the title [draft] release-23.1: kv, storage: fix handling of ambiguous failures on commit release-23.1: kv, storage: fix handling of ambiguous failures on commit Sep 27, 2023
@AlexTalks
Copy link
Contributor Author

Split up into #111870, #111871, and #111876.

After some discussion, we decided against backporting changes to the WriteTooOld behavior, so that specific subcase - when intents have already been cleaned up by the time an ambiguous write is retried - will only be fixed on v23.2 and onwards.

@AlexTalks AlexTalks closed this Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants