release-21.1: kv: commit-wait before running commit triggers and resolving intents #64304

nvanbenschoten · 2021-04-27T23:10:35Z

Backport 3/3 commits from #63971.

/cc @cockroachdb/release

Fixes a serious bug revealed by #63747.

This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the
right-hand side of a range merge could incorrectly determine that a range merge
txn had succeeded, when in reality, it had failed. The watcher would then put
the RHS leaseholder replica into a stalled state by setting r.mu.destroyStatus
to destroyReasonMergePending, effectively stalling any operation on the range
indefinitely.

The setup for this bug was that a range was operating with a global_reads zone
configuration attribute, so it was pushing all writers into the future. The
range was split and then rapidly merged back together. During the merge txn, a
range-merge watcher (see maybeWatchForMergeLocked) began monitoring the state
of the range merge txn. The problem was that at the time that the range merge
txn committed, neither the meta descriptor version written by the merge or even
the meta descriptor version written by the split were visible to the watcher's
follow-up query. Because the watcher read below the split txn's descriptor, it
came to the wrong conclusion about the merge.

It is interesting to think about what is going wrong here, because it's not
immediately obvious who is at fault. If a transaction has a commit timestamp in
the future of present time, it will need to commit-wait before acknowledging the
client. Typically, this is performed in the TxnCoordSender after the transaction
has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It
is safe to wait after a future-time transaction has committed and resolved
intents without compromising linearizability because the uncertainty interval of
concurrent and later readers ensures atomic visibility of the effects of the
transaction. In other words, all of the transaction's intents will become
visible and will remain visible at once, which is sometimes called "monotonic
reads". This is true even if the resolved intents are at a high enough timestamp
such that they are not visible to concurrent readers immediately after they are
resolved, but only become visible sometime during the writer's commit-wait
sleep. This property is central to the correctness of non-blocking transactions.
See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md

However, if a transaction has a commit trigger, the side-effects of the trigger
will go into effect immediately upon applying the corresponding Raft log entry.
This poses a problem, because we do not want part of a transaction's effects
(e.g. its commit trigger) to become visible to onlookers before the rest of its
effects do (e.g. its intent writes).

To avoid this problem, this commit adds special server-side logic to perform the
commit-wait stage of a transaction with a commit trigger early, before its
EndTxn evaluates and its commit trigger fires. This results in the transaction
waiting longer to commit, run its commit trigger, and resolve its intents, but
it is otherwise safe and effective.

Interestingly, this is quite similar to how Spanner handles its commit-wait rule:

Before allowing any coordinator replica to apply the commit record, the
coordinator leader waits until TT.after(s), so as to obey the commit-wait rule
described in Section 4.1.2. Because the coordinator leader chose s based on
TT.now().latest, and now waits until that timestamp is guaranteed to be in the
past, the expected wait is at least 2 ∗ . This wait is typically overlapped with
Paxos communication. After commit wait, the coordinator sends the commit
timestamp to the client and all other participant leaders. Each participant
leader logs the transaction’s outcome through Paxos. All participants apply at
the same timestamp and then release locks.

Of course, the whole point of non-blocking transactions is that we release locks
early and use clocks (through uncertainty intervals + a reader-side commit-wait
rule) to enforce consistency, so we don't want to make this change for standard
transactions.

Before this change, I could hit the bug in about 5 minutes of stressing
kvnemesis on a roachprod cluster. After this change, I've been able to run
kvnemesis for a few hours without issue.

Release note (bug fix): Fixed a rare bug present in betas where rapid range
splits and merges on a GLOBAL table could lead to a stuck leaseholder replica.
The situation is no longer possible.

cc. @cockroachdb/kv

This commit adds support to `(*hlc.Clock).SleepUntil` to listen to context cancellation and terminate its sleep early if the context that is passed to it is canceled.

Fixes a serious bug revealed by cockroachdb#63747. This commit fixes a bug revealed by kvnemesis where a range-merge watcher on the right-hand side of a range merge could incorrectly determine that a range merge txn had succeeded, when in reality, it had failed. The watcher would then put the RHS leaseholder replica into a stalled state by setting `r.mu.destroyStatus` to `destroyReasonMergePending`, effectively stalling any operation on the range indefinitely. The setup for this bug was that a range was operating with a `global_reads` zone configuration attribute, so it was pushing all writers into the future. The range was split and then rapidly merged back together. During the merge txn, a range-merge watcher (see `maybeWatchForMergeLocked`) began monitoring the state of the range merge txn. The problem was that at the time that the range merge txn committed, neither the meta descriptor version written by the merge or even the meta descriptor version written by the split were visible to the watcher's follow-up query. Because the watcher read below the split txn's descriptor, it came to the wrong conclusion about the merge. It is interesting to think about what is going wrong here, because it's not immediately obvious who is at fault. If a transaction has a commit timestamp in the future of present time, it will need to commit-wait before acknowledging the client. Typically, this is performed in the TxnCoordSender after the transaction has committed and resolved its intents (see TxnCoordSender.maybeCommitWait). It is safe to wait after a future-time transaction has committed and resolved intents without compromising linearizability because the uncertainty interval of concurrent and later readers ensures atomic visibility of the effects of the transaction. In other words, all of the transaction's intents will become visible and will remain visible at once, which is sometimes called "monotonic reads". This is true even if the resolved intents are at a high enough timestamp such that they are not visible to concurrent readers immediately after they are resolved, but only become visible sometime during the writer's commit-wait sleep. This property is central to the correctness of non-blocking transactions. See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md However, if a transaction has a commit trigger, the side-effects of the trigger will go into effect immediately upon applying to corresponding Raft log entry. This poses a problem, because we do not want part of a transaction's effects (e.g. its commit trigger) to become visible to onlookers before the rest of its effects do (e.g. its intent writes). To avoid this problem, this commit adds special server-side logic to perform the commit-wait stage of a transaction with a commit trigger early, before its EndTxn evaluates and its commit trigger fires. This results in the transaction waiting longer to commit, run its commit trigger, and resolve its intents, but it is otherwise safe and effective. Interestingly, this is quite similar to how Spanner handles its commit-wait rule: > Before allowing any coordinator replica to apply the commit record, the > coordinator leader waits until TT.after(s), so as to obey the commit-wait rule > described in Section 4.1.2. Because the coordinator leader chose s based on > TT.now().latest, and now waits until that timestamp is guaranteed to be in the > past, the expected wait is at least 2 ∗ . This wait is typically overlapped with > Paxos communication. After commit wait, the coordinator sends the commit > timestamp to the client and all other participant leaders. Each participant > leader logs the transaction’s outcome through Paxos. All participants apply at > the same timestamp and then release locks. Of course, the whole point of non-blocking transactions is that we release locks early and use clocks (through uncertainty intervals + a reader-side commit-wait rule) to enforce consistency, so we don't want to make this change for standard transactions. Before this change, I could hit the bug in about 5 minutes of stressing kvnemesis on a roachprod cluster. After this change, I've been able to run kvnemesis for a few hours without issue. Release note (bug fix): Fixed a rare bug present in betas where rapid range splits and merges on a GLOBAL table could lead to a stuck leaseholder replica. The situation is no longer possible.

Allowing this is a major risk, as it would break node liveness updates, which perform a 1PC transaction with a commit trigger and can not tolerate being pushed into the future. We could try to block this at zone config update time, but due to the complexities of zone config inheritance, we would either need to be overly restrictive or risk missing cases which could wedge a cluster. This commit opts to disable the issue more directly.

cockroach-teamcity · 2021-04-27T23:10:44Z

This change is

andreimatei · 2021-04-27T23:22:12Z

LGTM

…

On Tue, Apr 27, 2021 at 7:11 PM cockroach-teamcity ***@***.***> wrote: This change is [image: Reviewable] <https://reviewable.io/reviews/cockroachdb/cockroach/64304> — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#64304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC4C4M4KWMVYVELXKT7WIDTK5AALANCNFSM43V5ONRA> .

nvanbenschoten added 3 commits April 27, 2021 19:10

hlc: support context cancellation in Clock.SleepUntil

412f248

This commit adds support to `(*hlc.Clock).SleepUntil` to listen to context cancellation and terminate its sleep early if the context that is passed to it is canceled.

nvanbenschoten requested a review from andreimatei April 27, 2021 23:10

nvanbenschoten requested a review from a team as a code owner April 27, 2021 23:10

nvanbenschoten merged commit 2bf7515 into cockroachdb:release-21.1 Apr 28, 2021

nvanbenschoten deleted the backport21.1-63971 branch April 28, 2021 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.1: kv: commit-wait before running commit triggers and resolving intents #64304

release-21.1: kv: commit-wait before running commit triggers and resolving intents #64304

nvanbenschoten commented Apr 27, 2021

cockroach-teamcity commented Apr 27, 2021

andreimatei commented Apr 27, 2021 via email

release-21.1: kv: commit-wait before running commit triggers and resolving intents #64304

release-21.1: kv: commit-wait before running commit triggers and resolving intents #64304

Conversation

nvanbenschoten commented Apr 27, 2021

cockroach-teamcity commented Apr 27, 2021

andreimatei commented Apr 27, 2021 via email