-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: stale read because intent from our past gets committed in our certain future? #36431
Comments
I'm too distant from the code to know whether this is or is not a problem, but that list of steps looks almost like a test. Seems like this could be transformed into a test and you can verify whether there is a problem or if there is a mechanism we're forgetting about which prevents the problem. |
I think you're right; I can't think of anything that would prevent this stale read.
This may be a problem in its own right. |
I thought the refresh on the original txn would push node A’s clock forward
to ts 105, which should be the low water mark for any subsequent txns
coordinated by that node. Any node RPC coming in should forward the
recipient node’s clock.
…On Tue, Apr 2, 2019 at 3:46 PM Ben Darnell ***@***.***> wrote:
I think you're right; I can't think of anything that would prevent this
stale read.
Node A's clock I think is not pushed to 105 by this ResolveIntent
This may be a problem in its own right.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#36431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTXBMhmcbkx0qgzMZR8eOXdF4MlzZks5vc7OIgaJpZM4cY0cJ>
.
|
But we don't refresh blind writes, only reads. If we did send an RPC to each leaseholder that has at least one intent during the refresh to sync clocks then I think we would be ok. Also, |
So the simplest fix I can think of for this is to add the |
To formalize this a bit, I think the invariant we need to hold for these observed timestamps to be safe is that "a transaction can only commit if any node that serves as the leaseholder for one of its writes as an intent or as a committed value in the future has an HLC clock with an equal or higher timestamp than the commit timestamp". We can simplify this a little bit. First, we assume that HLC clocks will never regress, so a clock reading before intent resolution will always be smaller than a clock reading after intent resolution. This allows us to ignore talking about committed values directly. Secondly, all forms of lease acquisitions ensure that the new leaseholder's clock is larger than the clock of the old leaseholder at the time that the old leaseholder stopped serving traffic. This allows us to ignore talking about future leaseholders directly. So we can refine this invariant to "at the time of commit, any leaseholder for a Range that contains an intent for a transaction must have an HLC clock with an equal or higher timestamp than the transaction's commit timestamp." |
Good find. I don't have much to add, Nathan's final invariant makes sense to me.
You mean Node A here, right? The sentence earlier mentions that 105 was taken off Node B's clock. |
Yes, thanks for catching that. I updated the issue. |
…holders This PR is an initial reaction to cockroachdb#36431 and to `limitTxnMaxTimestamp` being broken for follower reads (PR for that coming soon). To start, we were missing any real documentation on what makes the observed timestamp mechanism safe and what invariants need to hold in order for a clock reading on a node to imply an absence of causality on any values later read at higher timestamp on that node. The PR addresses this by discussing the invariants necessary for observed timestamps to be used to bound a transaction's uncertainty window. In doing so, the PR also attempts to address a second issue, which is that the documentation conflated a node serving reads and writes with leaseholders serving reads and writes. At the time that this was all written, these were equivalent. Now that we have follower reads, the omission of a discussion about leaseholders made this all hard to think about (e.g. "whose observed timestamp do we care about when performing a follower read?"). The PR addresses this by adjusting the wording around nodes and leaseholders. This would have come in handy in reasoning about cockroachdb#23749. Release note: None
is this a release blocker? please add to #35554 if needed |
It's not a release blocker because this behavior is not new. But it is bad and we have to do something. I personally still need to put thought into this, but others seem to be ahead of me. |
+1 to what @andreimatei said. It's not a release blocker because we've had this issue since 2.0 and maybe always with snapshot transactions (back when we had those). I've been trying to demonstrate that this can cause a single-key linearizability violation with cockroachdb/jepsen#19. Once I'm confident that we have randomized testing that would catch this class of error, I'll go in and fix the actual issue. |
…holders This PR is an initial reaction to cockroachdb#36431 and to `limitTxnMaxTimestamp` being broken for follower reads (PR for that coming soon). To start, we were missing any real documentation on what makes the observed timestamp mechanism safe and what invariants need to hold in order for a clock reading on a node to imply an absence of causality on any values later read at higher timestamp on that node. The PR addresses this by discussing the invariants necessary for observed timestamps to be used to bound a transaction's uncertainty window. In doing so, the PR also attempts to address a second issue, which is that the documentation conflated a node serving reads and writes with leaseholders serving reads and writes. At the time that this was all written, these were equivalent. Now that we have follower reads, the omission of a discussion about leaseholders made this all hard to think about (e.g. "whose observed timestamp do we care about when performing a follower read?"). The PR addresses this by adjusting the wording around nodes and leaseholders. This would have come in handy in reasoning about cockroachdb#23749. Release note: None
36554: roachpb: formalize observed timestamp preconditions, talk about leaseholders r=nvanbenschoten a=nvanbenschoten This PR is an initial reaction to #36431 and to `limitTxnMaxTimestamp` being broken for follower reads (PR for that coming soon). To start, we were missing any real documentation on what makes the observed timestamp mechanism safe and what invariants need to hold in order for a clock reading on a node to imply an absence of causality on any values later read at higher timestamp on that node. The PR addresses this by discussing the invariants necessary for observed timestamps to be used to bound a transaction's uncertainty window. In doing so, the PR also attempts to address a second issue, which is that the documentation conflated a node serving reads and writes with leaseholders serving reads and writes. At the time that this was all written, these were equivalent. Now that we have follower reads, the omission of a discussion about leaseholders made this all hard to think about (e.g. "whose observed timestamp do we care about when performing a follower read?"). The PR addresses this by adjusting the wording around nodes and leaseholders. This would have come in handy in reasoning about #23749. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
36554: roachpb: formalize observed timestamp preconditions, talk about leaseholders r=nvanbenschoten a=nvanbenschoten This PR is an initial reaction to #36431 and to `limitTxnMaxTimestamp` being broken for follower reads (PR for that coming soon). To start, we were missing any real documentation on what makes the observed timestamp mechanism safe and what invariants need to hold in order for a clock reading on a node to imply an absence of causality on any values later read at higher timestamp on that node. The PR addresses this by discussing the invariants necessary for observed timestamps to be used to bound a transaction's uncertainty window. In doing so, the PR also attempts to address a second issue, which is that the documentation conflated a node serving reads and writes with leaseholders serving reads and writes. At the time that this was all written, these were equivalent. Now that we have follower reads, the omission of a discussion about leaseholders made this all hard to think about (e.g. "whose observed timestamp do we care about when performing a follower read?"). The PR addresses this by adjusting the wording around nodes and leaseholders. This would have come in handy in reasoning about #23749. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
@tbg how do you feel about reintroducing |
SGTM. I would suggest to change the name but once the hlc updates are hoisted into the interceptors, it is going to be a true NoopRequest and the comment on the request should explain why it exists instead. |
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. NOTE: this commit does not yet handle mixed-version compatibility and is not intended to merge for the v22.1 release. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key called the "local timestamp". The MVCC key's existing version timestamp dictates the key's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to the MVCC key encoding, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - observe TPC-C performance -- top-line performance -- uncertainty retry rate -- commit-wait rate (should be zero) - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. For example, consider the following series of events: - client A writes k = v1 - leaseholder writes v1 at ts = 100 - client A receives ack for write - client B wants to read k using a non-txn request - follower replica with slower clock receives non-txn request - follower replica assigns request ts = 95 - lease transferred to follower replica with lease start time = 101 - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. 2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. For example, consider the following series of events: - client A begins a txn and is assigned provisional commit timestamp = 95 - client A's txn performs a Put(k, v1) - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95 - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100 - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of cockroachdb#36431), so that leaseholder's clock is not updated - client A's txn commits remotely and client A receives the acknowledgment - client B initiates non-txn read of k - leaseholder assigns read timestamp ts = 97 - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note: None
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. NOTE: this commit does not yet handle mixed-version compatibility and is not intended to merge for the v22.1 release. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key called the "local timestamp". The MVCC key's existing version timestamp dictates the key's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to the MVCC key encoding, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - observe TPC-C performance -- top-line performance -- uncertainty retry rate -- commit-wait rate (should be zero) - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
Fixes #36431. Fixes #49360. Replaces #72121. NOTE: this commit does not yet handle mixed-version compatibility and is not intended to merge for the v22.1 release. This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key called the "local timestamp". The MVCC key's existing version timestamp dictates the key's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to the MVCC key encoding, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### Future improvements As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - observe TPC-C performance -- top-line performance -- uncertainty retry rate -- commit-wait rate (should be zero) - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in #36431 (comment).
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key called the "local timestamp". The MVCC key's existing version timestamp dictates the key's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to the MVCC key encoding, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - observe TPC-C performance -- top-line performance -- uncertainty retry rate -- commit-wait rate (should be zero) - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. Replaces cockroachdb#77342. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### MVCCValue To store the local timestamp, the commit introduces a new MVCCValue type to parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the roachpb.Value encoding. To further avoid any performance hit, MVCCValue has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of roachpb.Value. This provided backwards compatibility and ensures that the MVCCValue optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are: ``` Simple (identical to the roachpb.Value encoding): <4-byte-checksum><1-byte-tag><encoded-data> Extended (header prepended to roachpb.Value encoding): <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data> ``` The two encoding scheme variants are distinguished using the 5th byte, which is either the roachpb.Value tag (which has many possible values) or a sentinel tag not used by the roachpb.Value encoding which indicates the extended encoding scheme. Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - microbenchmarks - single-process benchmarks - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. Replaces cockroachdb#77342. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### MVCCValue To store the local timestamp, the commit introduces a new MVCCValue type to parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the roachpb.Value encoding. To further avoid any performance hit, MVCCValue has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of roachpb.Value. This provided backwards compatibility and ensures that the MVCCValue optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are: ``` Simple (identical to the roachpb.Value encoding): <4-byte-checksum><1-byte-tag><encoded-data> Extended (header prepended to roachpb.Value encoding): <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data> ``` The two encoding scheme variants are distinguished using the 5th byte, which is either the roachpb.Value tag (which has many possible values) or a sentinel tag not used by the roachpb.Value encoding which indicates the extended encoding scheme. Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - microbenchmarks - single-process benchmarks - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. Replaces cockroachdb#77342. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### MVCCValue To store the local timestamp, the commit introduces a new MVCCValue type to parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the roachpb.Value encoding. To further avoid any performance hit, MVCCValue has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of roachpb.Value. This provided backwards compatibility and ensures that the MVCCValue optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are: ``` Simple (identical to the roachpb.Value encoding): <4-byte-checksum><1-byte-tag><encoded-data> Extended (header prepended to roachpb.Value encoding): <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data> ``` The two encoding scheme variants are distinguished using the 5th byte, which is either the roachpb.Value tag (which has many possible values) or a sentinel tag not used by the roachpb.Value encoding which indicates the extended encoding scheme. Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - microbenchmarks - single-process benchmarks - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. Replaces cockroachdb#77342. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### MVCCValue To store the local timestamp, the commit introduces a new MVCCValue type to parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the roachpb.Value encoding. To further avoid any performance hit, MVCCValue has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of roachpb.Value. This provided backwards compatibility and ensures that the MVCCValue optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are: ``` Simple (identical to the roachpb.Value encoding): <4-byte-checksum><1-byte-tag><encoded-data> Extended (header prepended to roachpb.Value encoding): <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data> ``` The two encoding scheme variants are distinguished using the 5th byte, which is either the roachpb.Value tag (which has many possible values) or a sentinel tag not used by the roachpb.Value encoding which indicates the extended encoding scheme. Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - microbenchmarks - single-process benchmarks - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
Fixes cockroachdb#36431. Fixes cockroachdb#49360. Replaces cockroachdb#72121. Replaces cockroachdb#77342. This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the "remember when intents were written" approach described in cockroachdb#36431 (comment) and later expanded on in cockroachdb#72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go. To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. \### MVCCValue To store the local timestamp, the commit introduces a new MVCCValue type to parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the roachpb.Value encoding. To further avoid any performance hit, MVCCValue has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of roachpb.Value. This provided backwards compatibility and ensures that the MVCCValue optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are: ``` Simple (identical to the roachpb.Value encoding): <4-byte-checksum><1-byte-tag><encoded-data> Extended (header prepended to roachpb.Value encoding): <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data> ``` The two encoding scheme variants are distinguished using the 5th byte, which is either the roachpb.Value tag (which has many possible values) or a sentinel tag not used by the roachpb.Value encoding which indicates the extended encoding scheme. Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. \### Future improvements As noted in cockroachdb#72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - microbenchmarks - single-process benchmarks - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in cockroachdb#36431 (comment).
80706: kv/storage: introduce local timestamps for MVCC versions in MVCCValue r=sumeerbhola a=nvanbenschoten Fixes #36431. Fixes #49360. Replaces #72121. Replaces #77342. **NOTE: this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key.** This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test. ### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart. For more, see the updates made in this commit to `pkg/kv/kvserver/observedts/doc.go`. To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads. ### MVCCValue ```go type MVCCValue struct { MVCCValueHeader Value roachpb.Value } ``` To store the local timestamp, the commit introduces a new `MVCCValue` type to parallel the `MVCCKey` type. `MVCCValue` wraps a `roachpb.Value` and extends it with MVCC-level metadata which is stored in an `enginepb.MVCCValueHeader` protobuf struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the `roachpb.Value` encoding. To further avoid any performance hit, `MVCCValue` has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of `roachpb.Value`. This provided backwards compatibility and ensures that the `MVCCValue` optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are: ``` Simple (identical to the roachpb.Value encoding): <4-byte-checksum><1-byte-tag><encoded-data> Extended (header prepended to roachpb.Value encoding): <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data> ``` The two encoding scheme variants are distinguished using the 5th byte, which is either the `roachpb.Value` tag (which has many possible values) or a sentinel tag not used by the `roachpb.Value` encoding which indicates the extended encoding scheme. Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. See microbenchmarks below. ### Future improvements As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables. The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock. The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp. ### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. ### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. <details> <summary><b>MVCCValue Encoding and Decoding Microbenchmarks</b></summary> ``` name time/op EncodeMVCCValue/header=empty/value=tombstone-10 3.11ns ± 0% EncodeMVCCValue/header=empty/value=short-10 3.11ns ± 0% EncodeMVCCValue/header=empty/value=long-10 3.10ns ± 0% EncodeMVCCValue/header=local_walltime/value=tombstone-10 38.9ns ± 0% EncodeMVCCValue/header=local_walltime/value=short-10 42.1ns ± 0% EncodeMVCCValue/header=local_walltime/value=long-10 533ns ± 3% EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10 40.5ns ± 0% EncodeMVCCValue/header=local_walltime+logical/value=short-10 42.9ns ± 0% EncodeMVCCValue/header=local_walltime+logical/value=long-10 541ns ± 4% DecodeMVCCValue/header=empty/value=tombstone/inline=false-10 7.81ns ± 1% DecodeMVCCValue/header=empty/value=tombstone/inline=true-10 0.93ns ± 0% DecodeMVCCValue/header=empty/value=short/inline=false-10 8.39ns ± 0% DecodeMVCCValue/header=empty/value=short/inline=true-10 1.55ns ± 0% DecodeMVCCValue/header=empty/value=long/inline=false-10 8.38ns ± 0% DecodeMVCCValue/header=empty/value=long/inline=true-10 1.55ns ± 0% DecodeMVCCValue/header=local_walltime/value=long/inline=false-10 32.2ns ± 0% DecodeMVCCValue/header=local_walltime/value=long/inline=true-10 22.7ns ± 0% DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10 32.2ns ± 0% DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10 22.7ns ± 0% DecodeMVCCValue/header=local_walltime/value=short/inline=false-10 32.2ns ± 0% DecodeMVCCValue/header=local_walltime/value=short/inline=true-10 22.7ns ± 0% name alloc/op EncodeMVCCValue/header=empty/value=tombstone-10 0.00B EncodeMVCCValue/header=empty/value=short-10 0.00B EncodeMVCCValue/header=empty/value=long-10 0.00B EncodeMVCCValue/header=local_walltime/value=tombstone-10 24.0B ± 0% EncodeMVCCValue/header=local_walltime/value=short-10 32.0B ± 0% EncodeMVCCValue/header=local_walltime/value=long-10 4.86kB ± 0% EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10 24.0B ± 0% EncodeMVCCValue/header=local_walltime+logical/value=short-10 32.0B ± 0% EncodeMVCCValue/header=local_walltime+logical/value=long-10 4.86kB ± 0% DecodeMVCCValue/header=empty/value=tombstone/inline=false-10 0.00B DecodeMVCCValue/header=empty/value=tombstone/inline=true-10 0.00B DecodeMVCCValue/header=empty/value=short/inline=false-10 0.00B DecodeMVCCValue/header=empty/value=short/inline=true-10 0.00B DecodeMVCCValue/header=empty/value=long/inline=false-10 0.00B DecodeMVCCValue/header=empty/value=long/inline=true-10 0.00B DecodeMVCCValue/header=local_walltime/value=long/inline=false-10 0.00B DecodeMVCCValue/header=local_walltime/value=long/inline=true-10 0.00B DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10 0.00B DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10 0.00B DecodeMVCCValue/header=local_walltime/value=short/inline=false-10 0.00B DecodeMVCCValue/header=local_walltime/value=short/inline=true-10 0.00B name allocs/op EncodeMVCCValue/header=empty/value=tombstone-10 0.00 EncodeMVCCValue/header=empty/value=short-10 0.00 EncodeMVCCValue/header=empty/value=long-10 0.00 EncodeMVCCValue/header=local_walltime/value=tombstone-10 1.00 ± 0% EncodeMVCCValue/header=local_walltime/value=short-10 1.00 ± 0% EncodeMVCCValue/header=local_walltime/value=long-10 1.00 ± 0% EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10 1.00 ± 0% EncodeMVCCValue/header=local_walltime+logical/value=short-10 1.00 ± 0% EncodeMVCCValue/header=local_walltime+logical/value=long-10 1.00 ± 0% DecodeMVCCValue/header=empty/value=tombstone/inline=false-10 0.00 DecodeMVCCValue/header=empty/value=tombstone/inline=true-10 0.00 DecodeMVCCValue/header=empty/value=short/inline=false-10 0.00 DecodeMVCCValue/header=empty/value=short/inline=true-10 0.00 DecodeMVCCValue/header=empty/value=long/inline=false-10 0.00 DecodeMVCCValue/header=empty/value=long/inline=true-10 0.00 DecodeMVCCValue/header=local_walltime/value=long/inline=false-10 0.00 DecodeMVCCValue/header=local_walltime/value=long/inline=true-10 0.00 DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10 0.00 DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10 0.00 DecodeMVCCValue/header=local_walltime/value=short/inline=false-10 0.00 DecodeMVCCValue/header=local_walltime/value=short/inline=true-10 0.00 ``` </details> <details> <summary><b>pkg/sql/tests End-To-End Microbenchmarks</b></summary> ``` name old time/op new time/op delta KV/Delete/Native/rows=1-30 106µs ± 2% 104µs ± 2% -1.38% (p=0.012 n=8+10) KV/Insert/SQL/rows=100-30 1.26ms ± 2% 1.24ms ± 2% -1.25% (p=0.029 n=10+10) KV/Scan/SQL/rows=1-30 281µs ± 2% 277µs ± 2% -1.17% (p=0.009 n=10+10) KV/Insert/Native/rows=1-30 107µs ± 2% 107µs ± 2% ~ (p=0.353 n=10+10) KV/Insert/Native/rows=10-30 156µs ± 2% 155µs ± 4% ~ (p=0.481 n=10+10) KV/Insert/Native/rows=100-30 570µs ± 1% 576µs ± 2% ~ (p=0.075 n=10+10) KV/Insert/Native/rows=1000-30 4.18ms ± 2% 4.23ms ± 2% ~ (p=0.143 n=10+10) KV/Insert/Native/rows=10000-30 43.9ms ± 2% 44.1ms ± 2% ~ (p=0.393 n=10+10) KV/Insert/SQL/rows=1-30 412µs ± 2% 410µs ± 2% ~ (p=0.280 n=10+10) KV/Insert/SQL/rows=10-30 511µs ± 2% 508µs ± 1% ~ (p=0.436 n=10+10) KV/Insert/SQL/rows=1000-30 8.18ms ± 1% 8.11ms ± 2% ~ (p=0.165 n=10+10) KV/Insert/SQL/rows=10000-30 85.2ms ± 2% 84.7ms ± 2% ~ (p=0.481 n=10+10) KV/Update/Native/rows=1-30 163µs ± 2% 162µs ± 2% ~ (p=0.436 n=10+10) KV/Update/Native/rows=10-30 365µs ± 1% 365µs ± 1% ~ (p=0.739 n=10+10) KV/Update/Native/rows=10000-30 143ms ± 1% 144ms ± 2% ~ (p=0.075 n=10+10) KV/Update/SQL/rows=1-30 525µs ± 2% 522µs ± 3% ~ (p=0.190 n=10+10) KV/Update/SQL/rows=10-30 815µs ± 1% 810µs ± 1% ~ (p=0.190 n=10+10) KV/Update/SQL/rows=100-30 2.47ms ± 1% 2.48ms ± 1% ~ (p=0.356 n=9+10) KV/Update/SQL/rows=1000-30 16.6ms ± 1% 16.7ms ± 1% ~ (p=0.065 n=9+10) KV/Update/SQL/rows=10000-30 193ms ± 2% 195ms ± 3% ~ (p=0.315 n=10+10) KV/Delete/Native/rows=10-30 173µs ± 2% 171µs ± 3% ~ (p=0.247 n=10+10) KV/Delete/Native/rows=100-30 677µs ± 1% 674µs ± 0% ~ (p=0.475 n=10+7) KV/Delete/Native/rows=1000-30 5.20ms ± 2% 5.19ms ± 1% ~ (p=0.853 n=10+10) KV/Delete/Native/rows=10000-30 53.7ms ± 1% 53.9ms ± 1% ~ (p=0.684 n=10+10) KV/Delete/SQL/rows=1-30 377µs ± 3% 375µs ± 1% ~ (p=0.305 n=10+9) KV/Delete/SQL/rows=10-30 503µs ± 1% 500µs ± 2% ~ (p=0.123 n=10+10) KV/Delete/SQL/rows=100-30 1.52ms ± 4% 1.51ms ± 2% ~ (p=0.529 n=10+10) KV/Delete/SQL/rows=1000-30 2.53ms ± 3% 2.53ms ± 3% ~ (p=1.000 n=10+10) KV/Delete/SQL/rows=10000-30 21.9ms ± 1% 21.8ms ± 1% ~ (p=0.315 n=9+10) KV/Scan/Native/rows=1-30 29.6µs ± 2% 29.8µs ± 1% ~ (p=0.143 n=10+10) KV/Scan/Native/rows=100-30 54.6µs ± 1% 55.0µs ± 2% ~ (p=0.052 n=10+10) KV/Scan/SQL/rows=10-30 292µs ± 1% 290µs ± 1% ~ (p=0.190 n=10+10) KV/Scan/SQL/rows=100-30 364µs ± 2% 363µs ± 1% ~ (p=0.684 n=10+10) KV/Scan/SQL/rows=10000-30 5.34ms ± 2% 5.32ms ± 1% ~ (p=0.631 n=10+10) KVAndStorageUsingSQL/versions=2-30 2.59ms ± 1% 2.59ms ± 2% ~ (p=0.842 n=9+10) KVAndStorageUsingSQL/versions=4-30 2.57ms ± 3% 2.56ms ± 2% ~ (p=0.684 n=10+10) KVAndStorageUsingSQL/versions=8-30 2.60ms ± 3% 2.59ms ± 2% ~ (p=0.315 n=10+10) KV/Update/Native/rows=100-30 2.02ms ± 1% 2.04ms ± 1% +0.95% (p=0.015 n=10+10) KV/Update/Native/rows=1000-30 16.2ms ± 2% 16.5ms ± 2% +1.30% (p=0.001 n=10+9) KV/Scan/Native/rows=10-30 32.6µs ± 2% 33.0µs ± 3% +1.39% (p=0.019 n=10+10) KV/Scan/Native/rows=1000-30 266µs ± 1% 270µs ± 1% +1.51% (p=0.000 n=9+10) KV/Scan/SQL/rows=1000-30 982µs ± 1% 997µs ± 1% +1.60% (p=0.000 n=10+10) KV/Scan/Native/rows=10000-30 2.18ms ± 1% 2.23ms ± 1% +2.55% (p=0.000 n=10+10) name old alloc/op new alloc/op delta KV/Insert/Native/rows=1-30 15.6kB ± 0% 15.6kB ± 0% ~ (p=0.631 n=10+10) KV/Insert/Native/rows=10000-30 34.7MB ± 2% 35.0MB ± 2% ~ (p=0.089 n=10+10) KV/Insert/SQL/rows=1-30 41.3kB ± 1% 41.4kB ± 1% ~ (p=0.353 n=10+10) KV/Insert/SQL/rows=10-30 90.8kB ± 1% 90.8kB ± 0% ~ (p=0.945 n=10+8) KV/Insert/SQL/rows=1000-30 5.69MB ± 1% 5.71MB ± 1% ~ (p=0.436 n=10+10) KV/Insert/SQL/rows=10000-30 69.6MB ± 1% 69.6MB ± 1% ~ (p=0.853 n=10+10) KV/Update/Native/rows=1-30 21.4kB ± 1% 21.4kB ± 0% ~ (p=0.915 n=9+9) KV/Update/Native/rows=10000-30 67.8MB ± 1% 68.1MB ± 2% ~ (p=0.315 n=10+10) KV/Update/SQL/rows=1-30 47.2kB ± 1% 47.3kB ± 1% ~ (p=0.280 n=10+10) KV/Update/SQL/rows=10-30 116kB ± 1% 117kB ± 1% ~ (p=0.353 n=10+10) KV/Update/SQL/rows=1000-30 5.20MB ± 2% 5.18MB ± 1% ~ (p=0.278 n=10+9) KV/Delete/Native/rows=10000-30 30.5MB ± 2% 30.5MB ± 1% ~ (p=0.631 n=10+10) KV/Delete/SQL/rows=1-30 42.9kB ± 0% 43.0kB ± 1% ~ (p=0.393 n=10+10) KV/Delete/SQL/rows=10-30 66.9kB ± 3% 66.5kB ± 1% ~ (p=0.853 n=10+10) KV/Delete/SQL/rows=1000-30 1.19MB ± 0% 1.19MB ± 0% ~ (p=0.447 n=9+10) KV/Delete/SQL/rows=10000-30 14.3MB ± 1% 14.3MB ± 0% ~ (p=0.353 n=10+10) KV/Scan/Native/rows=1-30 7.17kB ± 0% 7.18kB ± 0% ~ (p=0.061 n=10+9) KV/Scan/Native/rows=10-30 8.59kB ± 0% 8.59kB ± 0% ~ (p=0.926 n=10+10) KV/Scan/Native/rows=100-30 21.2kB ± 0% 21.2kB ± 0% ~ (p=0.781 n=10+10) KV/Scan/Native/rows=1000-30 172kB ± 0% 172kB ± 0% ~ (p=0.264 n=10+8) KV/Scan/Native/rows=10000-30 1.51MB ± 0% 1.51MB ± 0% ~ (p=0.968 n=9+10) KV/Scan/SQL/rows=1-30 19.3kB ± 0% 19.3kB ± 1% ~ (p=0.968 n=9+10) KV/Scan/SQL/rows=10-30 20.6kB ± 1% 20.7kB ± 0% ~ (p=0.897 n=10+8) KV/Scan/SQL/rows=100-30 32.8kB ± 1% 32.9kB ± 0% ~ (p=0.400 n=10+9) KV/Scan/SQL/rows=1000-30 185kB ± 0% 185kB ± 0% ~ (p=0.247 n=10+10) KV/Scan/SQL/rows=10000-30 1.10MB ± 0% 1.10MB ± 0% ~ (p=0.631 n=10+10) KVAndStorageUsingSQL/versions=2-30 793kB ± 3% 793kB ± 2% ~ (p=0.796 n=10+10) KVAndStorageUsingSQL/versions=4-30 788kB ± 3% 783kB ± 3% ~ (p=0.353 n=10+10) KVAndStorageUsingSQL/versions=8-30 787kB ± 3% 785kB ± 2% ~ (p=0.853 n=10+10) KV/Delete/Native/rows=1-30 15.1kB ± 0% 15.2kB ± 0% +0.26% (p=0.029 n=10+10) KV/Update/Native/rows=10-30 66.8kB ± 0% 67.0kB ± 0% +0.38% (p=0.029 n=10+10) KV/Insert/SQL/rows=100-30 550kB ± 1% 553kB ± 1% +0.44% (p=0.043 n=10+10) KV/Update/SQL/rows=100-30 578kB ± 1% 580kB ± 1% +0.46% (p=0.019 n=10+10) KV/Update/Native/rows=100-30 503kB ± 0% 506kB ± 0% +0.59% (p=0.000 n=10+10) KV/Update/SQL/rows=10000-30 153MB ± 1% 154MB ± 1% +0.62% (p=0.006 n=9+10) KV/Delete/SQL/rows=100-30 306kB ± 0% 308kB ± 0% +0.67% (p=0.000 n=10+8) KV/Delete/Native/rows=10-30 36.1kB ± 1% 36.4kB ± 1% +0.82% (p=0.001 n=10+9) KV/Update/Native/rows=1000-30 4.75MB ± 1% 4.79MB ± 1% +0.83% (p=0.002 n=10+10) KV/Insert/Native/rows=10-30 39.7kB ± 1% 40.1kB ± 1% +0.83% (p=0.023 n=10+10) KV/Insert/Native/rows=1000-30 2.56MB ± 1% 2.58MB ± 1% +1.00% (p=0.000 n=9+10) KV/Insert/Native/rows=100-30 279kB ± 1% 282kB ± 1% +1.02% (p=0.007 n=10+10) KV/Delete/Native/rows=100-30 243kB ± 1% 246kB ± 1% +1.07% (p=0.000 n=10+10) KV/Delete/Native/rows=1000-30 2.23MB ± 1% 2.26MB ± 1% +1.33% (p=0.000 n=10+9) name old allocs/op new allocs/op delta KV/Update/SQL/rows=1000-30 58.6k ± 0% 58.4k ± 0% -0.28% (p=0.000 n=9+9) KV/Scan/SQL/rows=10-30 277 ± 0% 276 ± 0% -0.25% (p=0.001 n=8+10) KV/Update/SQL/rows=10000-30 740k ± 0% 739k ± 0% -0.12% (p=0.040 n=8+7) KV/Insert/Native/rows=1-30 129 ± 0% 129 ± 0% ~ (all equal) KV/Insert/Native/rows=10-30 272 ± 0% 272 ± 0% ~ (all equal) KV/Insert/Native/rows=1000-30 14.3k ± 0% 14.3k ± 0% ~ (p=0.221 n=10+10) KV/Insert/Native/rows=10000-30 141k ± 0% 141k ± 0% ~ (p=0.356 n=10+9) KV/Insert/SQL/rows=1-30 349 ± 0% 348 ± 0% ~ (p=0.776 n=9+10) KV/Insert/SQL/rows=10-30 625 ± 0% 625 ± 0% ~ (p=0.068 n=9+8) KV/Insert/SQL/rows=100-30 3.12k ± 0% 3.12k ± 0% ~ (p=0.351 n=10+9) KV/Insert/SQL/rows=1000-30 29.3k ± 0% 29.3k ± 0% ~ (p=0.644 n=9+10) KV/Insert/SQL/rows=10000-30 294k ± 0% 293k ± 0% ~ (p=0.134 n=9+7) KV/Update/Native/rows=1-30 181 ± 0% 181 ± 0% ~ (all equal) KV/Update/Native/rows=10-30 458 ± 0% 458 ± 0% ~ (all equal) KV/Update/Native/rows=1000-30 26.9k ± 0% 27.0k ± 0% ~ (p=0.232 n=9+10) KV/Update/Native/rows=10000-30 273k ± 0% 274k ± 0% ~ (p=0.315 n=9+10) KV/Update/SQL/rows=1-30 503 ± 0% 503 ± 0% ~ (p=1.000 n=10+10) KV/Update/SQL/rows=10-30 904 ± 1% 905 ± 0% ~ (p=0.223 n=10+10) KV/Update/SQL/rows=100-30 6.79k ± 0% 6.79k ± 0% ~ (p=0.669 n=10+10) KV/Delete/Native/rows=1-30 128 ± 0% 128 ± 0% ~ (all equal) KV/Delete/Native/rows=10-30 248 ± 0% 248 ± 0% ~ (all equal) KV/Delete/Native/rows=10000-30 112k ± 0% 112k ± 0% ~ (p=0.825 n=10+9) KV/Delete/SQL/rows=1-30 331 ± 0% 331 ± 0% ~ (all equal) KV/Delete/SQL/rows=10-30 512 ± 0% 512 ± 0% ~ (p=0.406 n=8+10) KV/Delete/SQL/rows=100-30 2.01k ± 0% 2.01k ± 0% ~ (p=0.947 n=10+8) KV/Delete/SQL/rows=1000-30 7.51k ± 0% 7.51k ± 0% ~ (p=0.204 n=9+10) KV/Delete/SQL/rows=10000-30 71.2k ± 0% 71.2k ± 0% ~ (p=0.986 n=9+10) KV/Scan/Native/rows=1-30 54.0 ± 0% 54.0 ± 0% ~ (all equal) KV/Scan/Native/rows=10-30 58.0 ± 0% 58.0 ± 0% ~ (all equal) KV/Scan/Native/rows=100-30 62.3 ± 1% 62.3 ± 1% ~ (p=1.000 n=10+10) KV/Scan/Native/rows=1000-30 72.3 ± 1% 72.0 ± 0% ~ (p=0.137 n=10+8) KV/Scan/Native/rows=10000-30 115 ± 1% 115 ± 1% ~ (p=0.193 n=9+10) KV/Scan/SQL/rows=1-30 242 ± 0% 242 ± 0% ~ (p=0.828 n=9+10) KV/Scan/SQL/rows=100-30 648 ± 0% 648 ± 0% ~ (all equal) KV/Scan/SQL/rows=1000-30 5.04k ± 0% 5.04k ± 0% ~ (p=1.000 n=10+10) KV/Scan/SQL/rows=10000-30 50.3k ± 0% 50.3k ± 0% ~ (p=0.208 n=10+10) KVAndStorageUsingSQL/versions=2-30 7.61k ± 0% 7.61k ± 0% ~ (p=0.223 n=8+10) KVAndStorageUsingSQL/versions=4-30 7.61k ± 0% 7.60k ± 0% ~ (p=0.643 n=10+10) KVAndStorageUsingSQL/versions=8-30 7.61k ± 0% 7.61k ± 0% ~ (p=0.888 n=10+9) KV/Update/Native/rows=100-30 2.76k ± 0% 2.76k ± 0% +0.03% (p=0.044 n=10+10) KV/Delete/Native/rows=1000-30 11.3k ± 0% 11.3k ± 0% +0.03% (p=0.006 n=10+9) KV/Insert/Native/rows=100-30 1.57k ± 0% 1.57k ± 0% +0.06% (p=0.000 n=8+7) KV/Delete/Native/rows=100-30 1.27k ± 0% 1.28k ± 0% +0.08% (p=0.000 n=8+8) ``` </details> <details> <summary><b>YCSB benchmark suite</b></summary> ``` name old ops/s new ops/s delta ycsb/A/nodes=3 15.2k ± 6% 15.4k ± 3% ~ (p=0.579 n=10+10) ycsb/B/nodes=3 28.7k ± 4% 28.9k ± 1% ~ (p=0.780 n=10+9) ycsb/C/nodes=3 41.3k ± 4% 41.4k ± 2% ~ (p=0.529 n=10+10) ycsb/D/nodes=3 33.9k ± 2% 33.5k ± 2% -1.37% (p=0.003 n=9+9) ycsb/E/nodes=3 2.10k ± 2% 2.10k ± 1% ~ (p=0.813 n=10+7) ycsb/F/nodes=3 7.06k ± 5% 7.05k ± 4% ~ (p=0.853 n=10+10) ycsb/A/nodes=3/cpu=32 30.0k ± 9% 31.4k ± 3% ~ (p=0.095 n=10+9) ycsb/B/nodes=3/cpu=32 97.6k ± 2% 98.1k ± 2% ~ (p=0.237 n=8+10) ycsb/C/nodes=3/cpu=32 129k ± 2% 130k ± 1% ~ (p=0.243 n=10+9) ycsb/D/nodes=3/cpu=32 105k ± 2% 106k ± 1% +1.06% (p=0.034 n=10+8) ycsb/E/nodes=3/cpu=32 3.33k ± 1% 3.33k ± 1% ~ (p=0.951 n=10+9) ycsb/F/nodes=3/cpu=32 15.5k ±10% 16.4k ± 5% +5.35% (p=0.029 n=10+10) name old avg(ms) new avg(ms) delta ycsb/A/nodes=3 6.32 ± 6% 6.23 ± 3% ~ (p=0.635 n=10+10) ycsb/B/nodes=3 5.02 ± 4% 4.97 ± 3% ~ (p=0.542 n=10+10) ycsb/C/nodes=3 3.49 ± 3% 3.50 ± 0% ~ (p=0.443 n=10+9) ycsb/D/nodes=3 2.80 ± 0% 2.87 ± 2% +2.50% (p=0.008 n=8+10) ycsb/E/nodes=3 45.8 ± 1% 45.8 ± 1% ~ (p=0.718 n=10+7) ycsb/F/nodes=3 13.5 ± 3% 13.6 ± 4% ~ (p=0.509 n=9+10) ycsb/A/nodes=3/cpu=32 4.82 ±10% 4.62 ± 6% ~ (p=0.092 n=10+10) ycsb/B/nodes=3/cpu=32 2.00 ± 0% 1.96 ± 3% ~ (p=0.065 n=9+10) ycsb/C/nodes=3/cpu=32 1.50 ± 0% 1.50 ± 0% ~ (all equal) ycsb/D/nodes=3/cpu=32 1.40 ± 0% 1.36 ± 4% ~ (p=0.065 n=9+10) ycsb/E/nodes=3/cpu=32 43.3 ± 1% 43.3 ± 0% ~ (p=0.848 n=10+9) ycsb/F/nodes=3/cpu=32 9.30 ±11% 8.79 ± 5% -5.48% (p=0.022 n=10+10) name old p99(ms) new p99(ms) delta ycsb/A/nodes=3 52.0 ±17% 48.4 ±18% ~ (p=0.379 n=10+10) ycsb/B/nodes=3 32.2 ±11% 32.2 ± 9% ~ (p=0.675 n=10+10) ycsb/C/nodes=3 13.8 ± 6% 13.8 ± 3% ~ (p=1.000 n=10+10) ycsb/D/nodes=3 12.6 ± 0% 12.9 ± 2% +2.38% (p=0.023 n=8+10) ycsb/E/nodes=3 200 ± 8% 197 ± 6% ~ (p=0.375 n=10+8) ycsb/F/nodes=3 124 ±19% 125 ±21% ~ (p=0.856 n=9+10) ycsb/A/nodes=3/cpu=32 68.1 ±17% 63.9 ± 5% ~ (p=0.211 n=10+8) ycsb/B/nodes=3/cpu=32 15.7 ± 0% 15.5 ± 2% ~ (p=0.071 n=6+10) ycsb/C/nodes=3/cpu=32 10.5 ± 0% 10.2 ± 3% -2.86% (p=0.003 n=8+10) ycsb/D/nodes=3/cpu=32 8.70 ± 3% 8.40 ± 0% -3.45% (p=0.003 n=10+8) ycsb/E/nodes=3/cpu=32 151 ± 0% 151 ± 0% ~ (all equal) ycsb/F/nodes=3/cpu=32 148 ±15% 145 ±10% ~ (p=0.503 n=9+10) ``` </details> ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in #36431 (comment). Co-authored-by: Nathan VanBenschoten <[email protected]>
Together with @nvanbenschoten and @ajwerner we're speculating on the following hazard:
This is a stale read.
Are we missing anything? It seems that the "observed timestamps" mechanism (and the special case of no uncertainty on the gateway) seems to not work with the fact that we started changing intents' timestamps (through the refresh mechanism). I guess even before refreshing, we used to change intents' ts in SNAPSHOT txns, back in the day.
@tbg @spencerkimball @bdarnell @ajwerner @nvanbenschoten @petermattis
Epic: CRDB-1514
Jira issue: CRDB-4498
The text was updated successfully, but these errors were encountered: