Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: jepsen/multi-register/strobe-skews failed [likely #36431] #49360

Closed
cockroach-teamcity opened this issue May 21, 2020 · 60 comments · Fixed by #80706
Closed

roachtest: jepsen/multi-register/strobe-skews failed [likely #36431] #49360

cockroach-teamcity opened this issue May 21, 2020 · 60 comments · Fixed by #80706
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team X-nostale Marks an issue/pr that should be ignored by the stale bot
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 21, 2020

This is a long-standing test failure. We have reasonable confidence (read the thread below) that it is exposing the bug in #36431.

Jira issue: CRDB-4234

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 21, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone May 21, 2020
@knz
Copy link
Contributor

knz commented Jun 4, 2020

I looked at the jepsen error payload in there. Direct link:

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1957823:id/jepsen/multi-register/strobe-skews/run_1/failure-logs.tbz

It looks like a legitimate failure - the jepsen test found an inconsistency.

@petermattis @bdarnell what's the operating procedure in this case?

@knz knz added S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jun 4, 2020
@bdarnell
Copy link
Contributor

bdarnell commented Jun 4, 2020

No special procedure - just like any other roachtest failure, assign it to someone, try to get a reproduction, etc.

@andreimatei
Copy link
Contributor

I thought that we were able to reproduce #36431 with this multi-register test (reproducing that issue is why the test was created in the first place). But now looking around, I'm not sure we ever got a repro (maybe @nvanbenschoten knows). Until, perhaps, now?

@nvanbenschoten
Copy link
Member

No, I don't think we ever did get a definitive reproduction of #36431. We thought we did at one point, but it turned out to be #40600.

We see in results.edn that timeline 204 detected a linearizability violation. We see in independent/204/results.edn that this timeline failed with the following invalid sequence of operations:

:final-paths
([{:op
   {:process 24,
    :type :ok,
    :f :txn,
    :value
    [[:write 2 4]
     [:write 4 8]
     [:write 3 9]
     [:write 0 5]
     [:write 1 3]],
    :index 12332,
    :time 29902845586},
   :model {2 4, 4 8, 3 9, 0 5, 1 3}}
  {:op
   {:process 25,
    :type :ok,
    :f :txn,
    :value [[:read 4 1]],
    :index 12346,
    :time 29927838442},
   :model {:msg "8≠1"}}]
 [{:op
   {:process 24,
    :type :ok,
    :f :txn,
    :value
    [[:write 2 4]
     [:write 4 8]
     [:write 3 9]
     [:write 0 5]
     [:write 1 3]],
    :index 12332,
    :time 29902845586},
   :model {2 4, 4 8, 3 9, 0 5, 1 3}}
  {:op
   {:process 24,
    :type :invoke,
    :f :txn,
    :value [[:write 2 7] [:write 4 6]],
    :index 12339,
    :time 29915064692},
   :model {2 7, 4 6, 3 9, 0 5, 1 3}}
  {:op
   {:process 25,
    :type :ok,
    :f :txn,
    :value [[:read 4 1]],
    :index 12346,
    :time 29927838442},
   :model {:msg "6≠1"}}]),

What we see here is that we have a read txn in process 25 that is observing a value of 1 in register 4. We have a write that happens before this read in process 24 and writes a value of 8 to register 4. We also have a write that happens concurrently to this read in process 24 and writes a value of 6 to register 4. So it would be valid for the read in process 25 to see 8 or 6 in register 4, but not 1. So this is a stale read.

We can see this more easily in independent/204/timeline.html:
Screen Shot 2020-06-08 at 12 48 18 PM
Circled in red, we see the two writes, either of which would be valid. Circled in green, we see the read. Circled in purple, we see the write that the read observes.

While we can't say definitively that this is #37394, there's also nothing that proves that it's not that issue. We're running the store-skews, which we expect will help trigger this bug. We also see that both ignored writes were second in their batch, meaning that they would have undergone async intent resolution (a necessary condition of #37394).

@tbg tbg self-assigned this Jun 16, 2020
@tbg
Copy link
Member

tbg commented Jun 16, 2020

I will take a look at reproducing this. Assuming this is met with any success, something we could do is provide a "hacky" fix to #37394 (i.e. not something active in production, but something that will avoid stale reads as seen above as a result of #37394 only in this test). That way, we can at least ensure that we don't have another bug here.

@tbg
Copy link
Member

tbg commented Jun 16, 2020

50 runs, 49 passes, 1 infra failure.

@tbg
Copy link
Member

tbg commented Jun 17, 2020

$ find ./artifacts/ -name invoke.log -execdir grep invalid {} ';'
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

This is ~400 runs in, so we have a reproduction rate of ~2%.

@tbg
Copy link
Member

tbg commented Jun 17, 2020

find artifacts/jepsen/multi-register/ -name results.edn -exec grep -l 'valid? false' {} ';' | grep independent
artifacts/jepsen/multi-register/strobe-skews/run_224/store/latest/independent/364/results.edn
artifacts/jepsen/multi-register/strobe-skews/run_356/store/latest/independent/2581/results.edn
artifacts/jepsen/multi-register/strobe-skews/run_287/store/latest/independent/334/results.edn
artifacts/jepsen/multi-register/strobe-skews/run_74/store/latest/independent/65/results.edn
artifacts/jepsen/multi-register/strobe-skews/run_33/store/latest/independent/841/results.edn
artifacts/jepsen/multi-register/strobe-skews/run_124/store/latest/independent/248/results.edn
artifacts/jepsen/multi-register/strobe-skews/run_235/store/latest/independent/460/results.edn
artifacts/jepsen/multi-register/strobe-skews/run_377/store/latest/independent/927/results.edn

Uploaded full failure logs and the above timeline directories to drive

@tbg
Copy link
Member

tbg commented Jun 17, 2020

Spot checked the 65 timeline and it looks like a stale read again:

image

the ops on the top left write 5 and then 8, but the strictly later read (circled) returns 5.

@tbg
Copy link
Member

tbg commented Jun 17, 2020

I'd like to kick off a similar run with enough of a fix to avoid #36431. @nvanbenschoten what did we think was best? Do we simply disable marking the local node as certain, i.e. this line:

cockroach/pkg/kv/txn.go

Lines 131 to 134 in a8ae1bf

// Ensure the gateway node ID is marked as free from clock offset.
if gatewayNodeID != 0 && typ == RootTxn {
proto.UpdateObservedTimestamp(gatewayNodeID, now)
}

@nvanbenschoten
Copy link
Member

I don't think that's sufficient to avoid #36431. I think we need to disable all use of observed timestamps to limit uncertainty intervals. I was thinking you could make limitTxnMaxTimestamp a no-op.

@tbg
Copy link
Member

tbg commented Jun 18, 2020

@tbg
Copy link
Member

tbg commented Jun 19, 2020

I got this in 3 out of ~400 runs tonight:

ERROR [2020-06-18 22:42:58,915] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
org.postgresql.util.PSQLException: ERROR: request timestamp 1592520170.916466340,0 less than PushTo timestamp 1592520171.366191399,1
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2458) ~[postgresql-9.4.1211.jar:9.4.1211]

On the plus side, the other ~397 runs were successful. So we are seeing compelling evidence that the stale read is only possible with limitMaxTxnTimestamp in place, i.e. it checks out that these repros are all #36431.

tbg added a commit to tbg/cockroach that referenced this issue Jun 19, 2020
I just saw this while trying to repro cockroachdb#49360.

Release note: None
@tbg
Copy link
Member

tbg commented Jun 19, 2020

Finished the 500 runs and same result. No more repros without observed timestamps.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 5, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.

NOTE: this commit does not yet handle mixed-version compatibility and is not
intended to merge for the v22.1 release.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key called the "local timestamp". The MVCC key's existing version
timestamp dictates the key's visibility to readers and is tied to the writer's
commit timestamp. The local clock timestamp records the value of the local HLC
clock on the leaseholder when the key was originally written. It is used to make
claims about the relative real time ordering of the key's writer and readers
when comparing a reader's uncertainty interval (and observed timestamps) to the
key. Ignoring edge cases, readers with an observed timestamp from the key's
leaseholder that is greater than the local clock timestamp stored in the key
cannot make claims about real time ordering and must consider it possible that
the key's write occurred before the read began. However, readers with an
observed timestamp from the key's leaseholder that is less than the clock
timestamp can claim that the reader captured that observed timestamp before the
key was written and therefore can consider the key's write to have been
concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the
MVCC key encoding, the commit optimizes the clock timestamp away in the common
case where it leads the version timestamp. Only in the rare cases where the
local timestamp trails the version timestamp (e.g. future-time writes, async
intent resolution with a new commit timestamp) does the local timestamp need to
be explicitly represented in the key encoding. This is possible because it is
safe for the local clock timestamp to be rounded down, as this will simply lead
to additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
@cockroach-teamcity

This comment was marked as off-topic.

@cockroach-teamcity

This comment was marked as off-topic.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 4, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.

NOTE: this commit does not yet handle mixed-version compatibility and is not
intended to merge for the v22.1 release.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key called the "local timestamp". The MVCC key's existing version
timestamp dictates the key's visibility to readers and is tied to the writer's
commit timestamp. The local clock timestamp records the value of the local HLC
clock on the leaseholder when the key was originally written. It is used to make
claims about the relative real time ordering of the key's writer and readers
when comparing a reader's uncertainty interval (and observed timestamps) to the
key. Ignoring edge cases, readers with an observed timestamp from the key's
leaseholder that is greater than the local clock timestamp stored in the key
cannot make claims about real time ordering and must consider it possible that
the key's write occurred before the read began. However, readers with an
observed timestamp from the key's leaseholder that is less than the clock
timestamp can claim that the reader captured that observed timestamp before the
key was written and therefore can consider the key's write to have been
concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the
MVCC key encoding, the commit optimizes the clock timestamp away in the common
case where it leads the version timestamp. Only in the rare cases where the
local timestamp trails the version timestamp (e.g. future-time writes, async
intent resolution with a new commit timestamp) does the local timestamp need to
be explicitly represented in the key encoding. This is possible because it is
safe for the local clock timestamp to be rounded down, as this will simply lead
to additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit that referenced this issue Apr 14, 2022
Fixes #36431.
Fixes #49360.
Replaces #72121.

NOTE: this commit does not yet handle mixed-version compatibility and is not
intended to merge for the v22.1 release.

This commit fixes the potential for a stale read as detailed in #36431 using the
"remember when intents were written" approach described in
#36431 (comment) and
later expanded on in
#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key called the "local timestamp". The MVCC key's existing version
timestamp dictates the key's visibility to readers and is tied to the writer's
commit timestamp. The local clock timestamp records the value of the local HLC
clock on the leaseholder when the key was originally written. It is used to make
claims about the relative real time ordering of the key's writer and readers
when comparing a reader's uncertainty interval (and observed timestamps) to the
key. Ignoring edge cases, readers with an observed timestamp from the key's
leaseholder that is greater than the local clock timestamp stored in the key
cannot make claims about real time ordering and must consider it possible that
the key's write occurred before the read began. However, readers with an
observed timestamp from the key's leaseholder that is less than the clock
timestamp can claim that the reader captured that observed timestamp before the
key was written and therefore can consider the key's write to have been
concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the
MVCC key encoding, the commit optimizes the clock timestamp away in the common
case where it leads the version timestamp. Only in the rare cases where the
local timestamp trails the version timestamp (e.g. future-time writes, async
intent resolution with a new commit timestamp) does the local timestamp need to
be explicitly represented in the key encoding. This is possible because it is
safe for the local clock timestamp to be rounded down, as this will simply lead
to additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### Future improvements

As noted in #72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in #36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 15, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key called the "local timestamp". The MVCC key's existing version
timestamp dictates the key's visibility to readers and is tied to the writer's
commit timestamp. The local clock timestamp records the value of the local HLC
clock on the leaseholder when the key was originally written. It is used to make
claims about the relative real time ordering of the key's writer and readers
when comparing a reader's uncertainty interval (and observed timestamps) to the
key. Ignoring edge cases, readers with an observed timestamp from the key's
leaseholder that is greater than the local clock timestamp stored in the key
cannot make claims about real time ordering and must consider it possible that
the key's write occurred before the read began. However, readers with an
observed timestamp from the key's leaseholder that is less than the clock
timestamp can claim that the reader captured that observed timestamp before the
key was written and therefore can consider the key's write to have been
concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the
MVCC key encoding, the commit optimizes the clock timestamp away in the common
case where it leads the version timestamp. Only in the rare cases where the
local timestamp trails the version timestamp (e.g. future-time writes, async
intent resolution with a new commit timestamp) does the local timestamp need to
be explicitly represented in the key encoding. This is possible because it is
safe for the local clock timestamp to be rounded down, as this will simply lead
to additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 28, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 4, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
craig bot pushed a commit that referenced this issue May 9, 2022
80706: kv/storage: introduce local timestamps for MVCC versions in MVCCValue r=sumeerbhola a=nvanbenschoten

Fixes #36431.
Fixes #49360.
Replaces #72121.
Replaces #77342.

**NOTE: this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key.**

This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.

### Explanation

The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.

This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart.

For more, see the updates made in this commit to `pkg/kv/kvserver/observedts/doc.go`.

To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads.

### MVCCValue

```go
type MVCCValue struct {
    MVCCValueHeader
    Value roachpb.Value
}
```

To store the local timestamp, the commit introduces a new `MVCCValue` type to parallel the `MVCCKey` type. `MVCCValue` wraps a `roachpb.Value` and extends it with MVCC-level metadata which is stored in an `enginepb.MVCCValueHeader` protobuf struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the `roachpb.Value` encoding.

To further avoid any performance hit, `MVCCValue` has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of `roachpb.Value`. This provided backwards compatibility and ensures that the `MVCCValue` optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which is either the `roachpb.Value` tag (which has many possible values) or a sentinel tag not used by the `roachpb.Value` encoding which indicates the extended encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. See microbenchmarks below.

### Future improvements

As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock.

The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp.

### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.

### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under some workloads.

<details>
<summary><b>MVCCValue Encoding and Decoding Microbenchmarks</b></summary>
  
```
name                                                                   time/op
EncodeMVCCValue/header=empty/value=tombstone-10                        3.11ns ± 0%
EncodeMVCCValue/header=empty/value=short-10                            3.11ns ± 0%
EncodeMVCCValue/header=empty/value=long-10                             3.10ns ± 0%
EncodeMVCCValue/header=local_walltime/value=tombstone-10               38.9ns ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                   42.1ns ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                     533ns ± 3%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10       40.5ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10           42.9ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10             541ns ± 4%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10           7.81ns ± 1%
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10            0.93ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=false-10               8.39ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=true-10                1.55ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=false-10                8.38ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=true-10                 1.55ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10       32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10        22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10  32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10   22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10      32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10       22.7ns ± 0%

name                                                                   alloc/op
EncodeMVCCValue/header=empty/value=tombstone-10                         0.00B
EncodeMVCCValue/header=empty/value=short-10                             0.00B
EncodeMVCCValue/header=empty/value=long-10                              0.00B
EncodeMVCCValue/header=local_walltime/value=tombstone-10                24.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                    32.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                    4.86kB ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10        24.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10            32.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10            4.86kB ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10            0.00B
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10             0.00B
DecodeMVCCValue/header=empty/value=short/inline=false-10                0.00B
DecodeMVCCValue/header=empty/value=short/inline=true-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=false-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=true-10                  0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10        0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10         0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10   0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10    0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10       0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10        0.00B

name                                                                   allocs/op
EncodeMVCCValue/header=empty/value=tombstone-10                          0.00
EncodeMVCCValue/header=empty/value=short-10                              0.00
EncodeMVCCValue/header=empty/value=long-10                               0.00
EncodeMVCCValue/header=local_walltime/value=tombstone-10                 1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                     1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                      1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10         1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10             1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10              1.00 ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10             0.00
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10              0.00
DecodeMVCCValue/header=empty/value=short/inline=false-10                 0.00
DecodeMVCCValue/header=empty/value=short/inline=true-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=false-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=true-10                   0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10         0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10          0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10    0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10     0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10        0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10         0.00
```
</details>

<details>
<summary><b>pkg/sql/tests End-To-End Microbenchmarks</b></summary>
  
```
name                                old time/op    new time/op    delta
KV/Delete/Native/rows=1-30             106µs ± 2%     104µs ± 2%  -1.38%  (p=0.012 n=8+10)
KV/Insert/SQL/rows=100-30             1.26ms ± 2%    1.24ms ± 2%  -1.25%  (p=0.029 n=10+10)
KV/Scan/SQL/rows=1-30                  281µs ± 2%     277µs ± 2%  -1.17%  (p=0.009 n=10+10)
KV/Insert/Native/rows=1-30             107µs ± 2%     107µs ± 2%    ~     (p=0.353 n=10+10)
KV/Insert/Native/rows=10-30            156µs ± 2%     155µs ± 4%    ~     (p=0.481 n=10+10)
KV/Insert/Native/rows=100-30           570µs ± 1%     576µs ± 2%    ~     (p=0.075 n=10+10)
KV/Insert/Native/rows=1000-30         4.18ms ± 2%    4.23ms ± 2%    ~     (p=0.143 n=10+10)
KV/Insert/Native/rows=10000-30        43.9ms ± 2%    44.1ms ± 2%    ~     (p=0.393 n=10+10)
KV/Insert/SQL/rows=1-30                412µs ± 2%     410µs ± 2%    ~     (p=0.280 n=10+10)
KV/Insert/SQL/rows=10-30               511µs ± 2%     508µs ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=1000-30            8.18ms ± 1%    8.11ms ± 2%    ~     (p=0.165 n=10+10)
KV/Insert/SQL/rows=10000-30           85.2ms ± 2%    84.7ms ± 2%    ~     (p=0.481 n=10+10)
KV/Update/Native/rows=1-30             163µs ± 2%     162µs ± 2%    ~     (p=0.436 n=10+10)
KV/Update/Native/rows=10-30            365µs ± 1%     365µs ± 1%    ~     (p=0.739 n=10+10)
KV/Update/Native/rows=10000-30         143ms ± 1%     144ms ± 2%    ~     (p=0.075 n=10+10)
KV/Update/SQL/rows=1-30                525µs ± 2%     522µs ± 3%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=10-30               815µs ± 1%     810µs ± 1%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=100-30             2.47ms ± 1%    2.48ms ± 1%    ~     (p=0.356 n=9+10)
KV/Update/SQL/rows=1000-30            16.6ms ± 1%    16.7ms ± 1%    ~     (p=0.065 n=9+10)
KV/Update/SQL/rows=10000-30            193ms ± 2%     195ms ± 3%    ~     (p=0.315 n=10+10)
KV/Delete/Native/rows=10-30            173µs ± 2%     171µs ± 3%    ~     (p=0.247 n=10+10)
KV/Delete/Native/rows=100-30           677µs ± 1%     674µs ± 0%    ~     (p=0.475 n=10+7)
KV/Delete/Native/rows=1000-30         5.20ms ± 2%    5.19ms ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=10000-30        53.7ms ± 1%    53.9ms ± 1%    ~     (p=0.684 n=10+10)
KV/Delete/SQL/rows=1-30                377µs ± 3%     375µs ± 1%    ~     (p=0.305 n=10+9)
KV/Delete/SQL/rows=10-30               503µs ± 1%     500µs ± 2%    ~     (p=0.123 n=10+10)
KV/Delete/SQL/rows=100-30             1.52ms ± 4%    1.51ms ± 2%    ~     (p=0.529 n=10+10)
KV/Delete/SQL/rows=1000-30            2.53ms ± 3%    2.53ms ± 3%    ~     (p=1.000 n=10+10)
KV/Delete/SQL/rows=10000-30           21.9ms ± 1%    21.8ms ± 1%    ~     (p=0.315 n=9+10)
KV/Scan/Native/rows=1-30              29.6µs ± 2%    29.8µs ± 1%    ~     (p=0.143 n=10+10)
KV/Scan/Native/rows=100-30            54.6µs ± 1%    55.0µs ± 2%    ~     (p=0.052 n=10+10)
KV/Scan/SQL/rows=10-30                 292µs ± 1%     290µs ± 1%    ~     (p=0.190 n=10+10)
KV/Scan/SQL/rows=100-30                364µs ± 2%     363µs ± 1%    ~     (p=0.684 n=10+10)
KV/Scan/SQL/rows=10000-30             5.34ms ± 2%    5.32ms ± 1%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30    2.59ms ± 1%    2.59ms ± 2%    ~     (p=0.842 n=9+10)
KVAndStorageUsingSQL/versions=4-30    2.57ms ± 3%    2.56ms ± 2%    ~     (p=0.684 n=10+10)
KVAndStorageUsingSQL/versions=8-30    2.60ms ± 3%    2.59ms ± 2%    ~     (p=0.315 n=10+10)
KV/Update/Native/rows=100-30          2.02ms ± 1%    2.04ms ± 1%  +0.95%  (p=0.015 n=10+10)
KV/Update/Native/rows=1000-30         16.2ms ± 2%    16.5ms ± 2%  +1.30%  (p=0.001 n=10+9)
KV/Scan/Native/rows=10-30             32.6µs ± 2%    33.0µs ± 3%  +1.39%  (p=0.019 n=10+10)
KV/Scan/Native/rows=1000-30            266µs ± 1%     270µs ± 1%  +1.51%  (p=0.000 n=9+10)
KV/Scan/SQL/rows=1000-30               982µs ± 1%     997µs ± 1%  +1.60%  (p=0.000 n=10+10)
KV/Scan/Native/rows=10000-30          2.18ms ± 1%    2.23ms ± 1%  +2.55%  (p=0.000 n=10+10)

name                                old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1-30            15.6kB ± 0%    15.6kB ± 0%    ~     (p=0.631 n=10+10)
KV/Insert/Native/rows=10000-30        34.7MB ± 2%    35.0MB ± 2%    ~     (p=0.089 n=10+10)
KV/Insert/SQL/rows=1-30               41.3kB ± 1%    41.4kB ± 1%    ~     (p=0.353 n=10+10)
KV/Insert/SQL/rows=10-30              90.8kB ± 1%    90.8kB ± 0%    ~     (p=0.945 n=10+8)
KV/Insert/SQL/rows=1000-30            5.69MB ± 1%    5.71MB ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=10000-30           69.6MB ± 1%    69.6MB ± 1%    ~     (p=0.853 n=10+10)
KV/Update/Native/rows=1-30            21.4kB ± 1%    21.4kB ± 0%    ~     (p=0.915 n=9+9)
KV/Update/Native/rows=10000-30        67.8MB ± 1%    68.1MB ± 2%    ~     (p=0.315 n=10+10)
KV/Update/SQL/rows=1-30               47.2kB ± 1%    47.3kB ± 1%    ~     (p=0.280 n=10+10)
KV/Update/SQL/rows=10-30               116kB ± 1%     117kB ± 1%    ~     (p=0.353 n=10+10)
KV/Update/SQL/rows=1000-30            5.20MB ± 2%    5.18MB ± 1%    ~     (p=0.278 n=10+9)
KV/Delete/Native/rows=10000-30        30.5MB ± 2%    30.5MB ± 1%    ~     (p=0.631 n=10+10)
KV/Delete/SQL/rows=1-30               42.9kB ± 0%    43.0kB ± 1%    ~     (p=0.393 n=10+10)
KV/Delete/SQL/rows=10-30              66.9kB ± 3%    66.5kB ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/SQL/rows=1000-30            1.19MB ± 0%    1.19MB ± 0%    ~     (p=0.447 n=9+10)
KV/Delete/SQL/rows=10000-30           14.3MB ± 1%    14.3MB ± 0%    ~     (p=0.353 n=10+10)
KV/Scan/Native/rows=1-30              7.17kB ± 0%    7.18kB ± 0%    ~     (p=0.061 n=10+9)
KV/Scan/Native/rows=10-30             8.59kB ± 0%    8.59kB ± 0%    ~     (p=0.926 n=10+10)
KV/Scan/Native/rows=100-30            21.2kB ± 0%    21.2kB ± 0%    ~     (p=0.781 n=10+10)
KV/Scan/Native/rows=1000-30            172kB ± 0%     172kB ± 0%    ~     (p=0.264 n=10+8)
KV/Scan/Native/rows=10000-30          1.51MB ± 0%    1.51MB ± 0%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=1-30                 19.3kB ± 0%    19.3kB ± 1%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=10-30                20.6kB ± 1%    20.7kB ± 0%    ~     (p=0.897 n=10+8)
KV/Scan/SQL/rows=100-30               32.8kB ± 1%    32.9kB ± 0%    ~     (p=0.400 n=10+9)
KV/Scan/SQL/rows=1000-30               185kB ± 0%     185kB ± 0%    ~     (p=0.247 n=10+10)
KV/Scan/SQL/rows=10000-30             1.10MB ± 0%    1.10MB ± 0%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30     793kB ± 3%     793kB ± 2%    ~     (p=0.796 n=10+10)
KVAndStorageUsingSQL/versions=4-30     788kB ± 3%     783kB ± 3%    ~     (p=0.353 n=10+10)
KVAndStorageUsingSQL/versions=8-30     787kB ± 3%     785kB ± 2%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=1-30            15.1kB ± 0%    15.2kB ± 0%  +0.26%  (p=0.029 n=10+10)
KV/Update/Native/rows=10-30           66.8kB ± 0%    67.0kB ± 0%  +0.38%  (p=0.029 n=10+10)
KV/Insert/SQL/rows=100-30              550kB ± 1%     553kB ± 1%  +0.44%  (p=0.043 n=10+10)
KV/Update/SQL/rows=100-30              578kB ± 1%     580kB ± 1%  +0.46%  (p=0.019 n=10+10)
KV/Update/Native/rows=100-30           503kB ± 0%     506kB ± 0%  +0.59%  (p=0.000 n=10+10)
KV/Update/SQL/rows=10000-30            153MB ± 1%     154MB ± 1%  +0.62%  (p=0.006 n=9+10)
KV/Delete/SQL/rows=100-30              306kB ± 0%     308kB ± 0%  +0.67%  (p=0.000 n=10+8)
KV/Delete/Native/rows=10-30           36.1kB ± 1%    36.4kB ± 1%  +0.82%  (p=0.001 n=10+9)
KV/Update/Native/rows=1000-30         4.75MB ± 1%    4.79MB ± 1%  +0.83%  (p=0.002 n=10+10)
KV/Insert/Native/rows=10-30           39.7kB ± 1%    40.1kB ± 1%  +0.83%  (p=0.023 n=10+10)
KV/Insert/Native/rows=1000-30         2.56MB ± 1%    2.58MB ± 1%  +1.00%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-30           279kB ± 1%     282kB ± 1%  +1.02%  (p=0.007 n=10+10)
KV/Delete/Native/rows=100-30           243kB ± 1%     246kB ± 1%  +1.07%  (p=0.000 n=10+10)
KV/Delete/Native/rows=1000-30         2.23MB ± 1%    2.26MB ± 1%  +1.33%  (p=0.000 n=10+9)

name                                old allocs/op  new allocs/op  delta
KV/Update/SQL/rows=1000-30             58.6k ± 0%     58.4k ± 0%  -0.28%  (p=0.000 n=9+9)
KV/Scan/SQL/rows=10-30                   277 ± 0%       276 ± 0%  -0.25%  (p=0.001 n=8+10)
KV/Update/SQL/rows=10000-30             740k ± 0%      739k ± 0%  -0.12%  (p=0.040 n=8+7)
KV/Insert/Native/rows=1-30               129 ± 0%       129 ± 0%    ~     (all equal)
KV/Insert/Native/rows=10-30              272 ± 0%       272 ± 0%    ~     (all equal)
KV/Insert/Native/rows=1000-30          14.3k ± 0%     14.3k ± 0%    ~     (p=0.221 n=10+10)
KV/Insert/Native/rows=10000-30          141k ± 0%      141k ± 0%    ~     (p=0.356 n=10+9)
KV/Insert/SQL/rows=1-30                  349 ± 0%       348 ± 0%    ~     (p=0.776 n=9+10)
KV/Insert/SQL/rows=10-30                 625 ± 0%       625 ± 0%    ~     (p=0.068 n=9+8)
KV/Insert/SQL/rows=100-30              3.12k ± 0%     3.12k ± 0%    ~     (p=0.351 n=10+9)
KV/Insert/SQL/rows=1000-30             29.3k ± 0%     29.3k ± 0%    ~     (p=0.644 n=9+10)
KV/Insert/SQL/rows=10000-30             294k ± 0%      293k ± 0%    ~     (p=0.134 n=9+7)
KV/Update/Native/rows=1-30               181 ± 0%       181 ± 0%    ~     (all equal)
KV/Update/Native/rows=10-30              458 ± 0%       458 ± 0%    ~     (all equal)
KV/Update/Native/rows=1000-30          26.9k ± 0%     27.0k ± 0%    ~     (p=0.232 n=9+10)
KV/Update/Native/rows=10000-30          273k ± 0%      274k ± 0%    ~     (p=0.315 n=9+10)
KV/Update/SQL/rows=1-30                  503 ± 0%       503 ± 0%    ~     (p=1.000 n=10+10)
KV/Update/SQL/rows=10-30                 904 ± 1%       905 ± 0%    ~     (p=0.223 n=10+10)
KV/Update/SQL/rows=100-30              6.79k ± 0%     6.79k ± 0%    ~     (p=0.669 n=10+10)
KV/Delete/Native/rows=1-30               128 ± 0%       128 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10-30              248 ± 0%       248 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10000-30          112k ± 0%      112k ± 0%    ~     (p=0.825 n=10+9)
KV/Delete/SQL/rows=1-30                  331 ± 0%       331 ± 0%    ~     (all equal)
KV/Delete/SQL/rows=10-30                 512 ± 0%       512 ± 0%    ~     (p=0.406 n=8+10)
KV/Delete/SQL/rows=100-30              2.01k ± 0%     2.01k ± 0%    ~     (p=0.947 n=10+8)
KV/Delete/SQL/rows=1000-30             7.51k ± 0%     7.51k ± 0%    ~     (p=0.204 n=9+10)
KV/Delete/SQL/rows=10000-30            71.2k ± 0%     71.2k ± 0%    ~     (p=0.986 n=9+10)
KV/Scan/Native/rows=1-30                54.0 ± 0%      54.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=10-30               58.0 ± 0%      58.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=100-30              62.3 ± 1%      62.3 ± 1%    ~     (p=1.000 n=10+10)
KV/Scan/Native/rows=1000-30             72.3 ± 1%      72.0 ± 0%    ~     (p=0.137 n=10+8)
KV/Scan/Native/rows=10000-30             115 ± 1%       115 ± 1%    ~     (p=0.193 n=9+10)
KV/Scan/SQL/rows=1-30                    242 ± 0%       242 ± 0%    ~     (p=0.828 n=9+10)
KV/Scan/SQL/rows=100-30                  648 ± 0%       648 ± 0%    ~     (all equal)
KV/Scan/SQL/rows=1000-30               5.04k ± 0%     5.04k ± 0%    ~     (p=1.000 n=10+10)
KV/Scan/SQL/rows=10000-30              50.3k ± 0%     50.3k ± 0%    ~     (p=0.208 n=10+10)
KVAndStorageUsingSQL/versions=2-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.223 n=8+10)
KVAndStorageUsingSQL/versions=4-30     7.61k ± 0%     7.60k ± 0%    ~     (p=0.643 n=10+10)
KVAndStorageUsingSQL/versions=8-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.888 n=10+9)
KV/Update/Native/rows=100-30           2.76k ± 0%     2.76k ± 0%  +0.03%  (p=0.044 n=10+10)
KV/Delete/Native/rows=1000-30          11.3k ± 0%     11.3k ± 0%  +0.03%  (p=0.006 n=10+9)
KV/Insert/Native/rows=100-30           1.57k ± 0%     1.57k ± 0%  +0.06%  (p=0.000 n=8+7)
KV/Delete/Native/rows=100-30           1.27k ± 0%     1.28k ± 0%  +0.08%  (p=0.000 n=8+8)
```
</details>

<details>
<summary><b>YCSB benchmark suite</b></summary>
  
```
name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          15.2k ± 6%   15.4k ± 3%    ~     (p=0.579 n=10+10)
ycsb/B/nodes=3          28.7k ± 4%   28.9k ± 1%    ~     (p=0.780 n=10+9)
ycsb/C/nodes=3          41.3k ± 4%   41.4k ± 2%    ~     (p=0.529 n=10+10)
ycsb/D/nodes=3          33.9k ± 2%   33.5k ± 2%  -1.37%  (p=0.003 n=9+9)
ycsb/E/nodes=3          2.10k ± 2%   2.10k ± 1%    ~     (p=0.813 n=10+7)
ycsb/F/nodes=3          7.06k ± 5%   7.05k ± 4%    ~     (p=0.853 n=10+10)
ycsb/A/nodes=3/cpu=32   30.0k ± 9%   31.4k ± 3%    ~     (p=0.095 n=10+9)
ycsb/B/nodes=3/cpu=32   97.6k ± 2%   98.1k ± 2%    ~     (p=0.237 n=8+10)
ycsb/C/nodes=3/cpu=32    129k ± 2%    130k ± 1%    ~     (p=0.243 n=10+9)
ycsb/D/nodes=3/cpu=32    105k ± 2%    106k ± 1%  +1.06%  (p=0.034 n=10+8)
ycsb/E/nodes=3/cpu=32   3.33k ± 1%   3.33k ± 1%    ~     (p=0.951 n=10+9)
ycsb/F/nodes=3/cpu=32   15.5k ±10%   16.4k ± 5%  +5.35%  (p=0.029 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.32 ± 6%    6.23 ± 3%    ~     (p=0.635 n=10+10)
ycsb/B/nodes=3           5.02 ± 4%    4.97 ± 3%    ~     (p=0.542 n=10+10)
ycsb/C/nodes=3           3.49 ± 3%    3.50 ± 0%    ~     (p=0.443 n=10+9)
ycsb/D/nodes=3           2.80 ± 0%    2.87 ± 2%  +2.50%  (p=0.008 n=8+10)
ycsb/E/nodes=3           45.8 ± 1%    45.8 ± 1%    ~     (p=0.718 n=10+7)
ycsb/F/nodes=3           13.5 ± 3%    13.6 ± 4%    ~     (p=0.509 n=9+10)
ycsb/A/nodes=3/cpu=32    4.82 ±10%    4.62 ± 6%    ~     (p=0.092 n=10+10)
ycsb/B/nodes=3/cpu=32    2.00 ± 0%    1.96 ± 3%    ~     (p=0.065 n=9+10)
ycsb/C/nodes=3/cpu=32    1.50 ± 0%    1.50 ± 0%    ~     (all equal)
ycsb/D/nodes=3/cpu=32    1.40 ± 0%    1.36 ± 4%    ~     (p=0.065 n=9+10)
ycsb/E/nodes=3/cpu=32    43.3 ± 1%    43.3 ± 0%    ~     (p=0.848 n=10+9)
ycsb/F/nodes=3/cpu=32    9.30 ±11%    8.79 ± 5%  -5.48%  (p=0.022 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           52.0 ±17%    48.4 ±18%    ~     (p=0.379 n=10+10)
ycsb/B/nodes=3           32.2 ±11%    32.2 ± 9%    ~     (p=0.675 n=10+10)
ycsb/C/nodes=3           13.8 ± 6%    13.8 ± 3%    ~     (p=1.000 n=10+10)
ycsb/D/nodes=3           12.6 ± 0%    12.9 ± 2%  +2.38%  (p=0.023 n=8+10)
ycsb/E/nodes=3            200 ± 8%     197 ± 6%    ~     (p=0.375 n=10+8)
ycsb/F/nodes=3            124 ±19%     125 ±21%    ~     (p=0.856 n=9+10)
ycsb/A/nodes=3/cpu=32    68.1 ±17%    63.9 ± 5%    ~     (p=0.211 n=10+8)
ycsb/B/nodes=3/cpu=32    15.7 ± 0%    15.5 ± 2%    ~     (p=0.071 n=6+10)
ycsb/C/nodes=3/cpu=32    10.5 ± 0%    10.2 ± 3%  -2.86%  (p=0.003 n=8+10)
ycsb/D/nodes=3/cpu=32    8.70 ± 3%    8.40 ± 0%  -3.45%  (p=0.003 n=10+8)
ycsb/E/nodes=3/cpu=32     151 ± 0%     151 ± 0%    ~     (all equal)
ycsb/F/nodes=3/cpu=32     148 ±15%     145 ±10%    ~     (p=0.503 n=9+10)
```
</details>

----

Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix.

[^2]: see analysis in #36431 (comment).

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 24c56df May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team X-nostale Marks an issue/pr that should be ignored by the stale bot
Projects
None yet
8 participants