-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: kv, storage: fix handling of ambiguous failures on commit #111017
Closed
AlexTalks
wants to merge
7
commits into
cockroachdb:release-23.1
from
AlexTalks:alt_backport23.1_ambiguous_commit_fix
Closed
release-23.1: kv, storage: fix handling of ambiguous failures on commit #111017
AlexTalks
wants to merge
7
commits into
cockroachdb:release-23.1
from
AlexTalks:alt_backport23.1_ambiguous_commit_fix
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This change introduces `MVCCWriteOptions`, a structure for bundling parameters for `MVCCPut`, `MVCCDelete`, and their many variants, and refactors usages of these functions across the code base in order to move the existing function arguments into this structure. In addition to allowing the code to eliminate specifying default values in many callers, this enables the ability to pass new flags to write operations such as the replay protection needed to address cockroachdb#103817. Part of: cockroachdb#103817 Release note: None
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and seen in cockroachdb#67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as `transaction unexpectedly committed`. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in cockroachdb#103817. Release note: None Part of: cockroachdb#103817
In cockroachdb#107323, testing for the ambiguous write case that leads to the "transaction unexpectedly committed" bug were introduced, however to increase test coverage of the fix, multiple schedules of operations need to be tested. This change simply refactors the framework of the existing test in order to enable the addition of muliple subtests. The subtests are included in a separate patch. Part of: cockroachdb#103817 Release note: None
This change builds on the testing introduced in cockroachdb#107323 with additional subtests evaluating the behavior of different race outcomes with contended transactions where the first transaction experiences an RPC failure (i.e. an ambiguous write). Part of: cockroachdb#103817 Release note: None
While previously, RPC failures were assumed to be retryable, as write operations (with the notable exception of `EndTxn`) were assumed to be idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that RPC failures on write operations that occur in parallel with a commit (i.e. a partial batch where `withCommit==true`), it is not always possible to assume idempotency and retry the "ambiguous" writes. This is due to the fact that the retried write RPC could result in the transaction's `WriteTimestamp` being bumped, changing the commit timestamp of the transaction that may in fact already be implicitly committed if the initial "ambiguous" write actually succeeded. This change modifies the protocol of the DistSender to flag in subsequent retries that a batch with a commit has previously experienced ambiguity, as well as the handling of the retried write in the MVCC layer to detect this previous ambiguity and reject retries that change the write timestamp as a non-idempotent replay. The flag allows subsequent retries to "remember" the earlier ambiguous write and evaluate accordingly. The flag allows us to properly handle RPC failures (i.e. ambiguous writes) that occur on commit, as a transaction that is implicitly committed is eligible to be marked as explicitly committed by contending transactions via the `RecoverTxn` operation, resulting in a race between retries by the transaction coordinator and recovery by contending transactions that could result in either incorrectly reporting a transaction as having failed with a `RETRY_SERIALIZABLE` error (despite the possibility that it already was or could be recovered and successfully committed), or in attempting to explicitly commit an already-recovered and committed transaction, resulting in seeing an assertion failure due to `transaction unexpectedly committed`. The replay protection introduced here allows us to avoid both of these situations by detecting a replay that should be considered non-idempotent and returning an error, causing the original RPC error remembered by the DistSender to be propagated as an `AmbiguousResultError`. As such, this can be handled by application code by validating the success/failure of a transaction when receiving this error. Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001 Fixes: cockroachdb#103817 Release note (bug fix): Properly handles RPC failures on writes using the parallel commit protocol that execute in parallel to the commit operation, avoiding incorrect retryable failures and `transaction unexpectedly committed` assertions by detecting when writes cannot be retried idempotently, instead returning an `AmbiguousResultError`.
The `TestTransactionUnexpectedlyCommitted/recovery_after_transfer_lease` test, introduced to test cockroachdb#107658, has been flaky (particularly under deadlock builds) due to a race condition between a retry of a write and intent resolution. While both orderings in this test result in a correct `AmbiguousResultError` for the client, when intent resolution wins the race, the retried write will attempt to push away the current lockholder; since it is illegal for a committed transaction to perform a push, this results in a different secondary error attached to the `AmbiguousResultError`. This change ensures a predefined ordering of these operations so that the secondary error is consistent across runs of the test. Fixes: cockroachdb#110187 Release note: None
Prior to cockroachdb#102808 and cockroachdb#102809, `WriteTooOldError`s were deferred for client-side handling, allowing intent writes to proceed. This means some changes to the fix introduced in cockroachdb#107658 are required, in order to handle retried blind writes that encounter a `WriteTooOld`, even if deferred, with the RPC returning as a success. The change thus catches the `WriteTooOld` flag on a successful batch response, propagating the original RPC failure from the on-commit batch as an `AmbiguousResultError`. This does have the side-effect of leaving behind an intent written on the retry of a blind write (after the original intents were cleaned up), though they would be resolved by any future readers. This change also adapts some of the MVCC history tests to the earlier formatting used in 23.1. Part of: cockroachdb#103817 Release note: None
7196c36
to
54a8c4c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport:
TestTransactionUnexpectedlyCommitted
#111090Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: Backporting fix for outstanding bug described in #103817.