Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachpb/storage: introduce localUncertaintyLimit, teach MVCC about synthetic timestamps #57077

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Nov 24, 2020

Informs #52745.
Informs #36431.

This PR introduces a new localUncertaintyLimit parameter to MVCC. The PR also renames the existing Txn.MaxTimestamp to Txn.GlobalUncertaintyLimit.

localUncertaintyLimit is introduced alongside GlobalUncertaintyLimit for two reasons:

  1. it allows us to stop modifying global_uncertainty_limit (formerly max_timestamp) when applying an observed_timestamp to a transaction when evaluating on a replica. Instead, we can keep the localUncertaintyLimit directly on the stack, which better represents its lifetime.
  2. it allows MVCC logic to distinguish between the pre- and post-observed timestamp uncertainty interval, depending on whether a possibly-uncertain value has a standard or synthetic timestamp.

The field is defined as follows:

LocalUncertaintyLimit is the transaction's GlobalUncertaintyLimit, reduce
by any observed timestamp that the transaction has acquired from the node
that holds the lease for the range that the transaction is evaluating on.

The local uncertainty limit can reduce the uncertainty interval applied
to most values on a range. This can lead to values that would otherwise
be considered uncertain by the original global uncertainty limit to be
considered "certainly concurrent", and thus not causally related, with
the transaction due to observed timestamps.

However, the local uncertainty limit does not apply to all values on a
range. Specifically, values with "synthetic timestamps" must use the
transaction's original global uncertainty limit for the purposes of
uncertainty, because observed timestamps do not apply to values with
synthetic timestamps.

@nvanbenschoten nvanbenschoten requested a review from tbg November 24, 2020 19:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

The more I think about this, the more I'm open to bikeshed the observed_max_timestamp name. Or to keep it. But feel free to poke holes in the name if you find it ugly.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/observedts/uncertainty.go, line 15 at r5 (raw file):

import "github.com/cockroachdb/cockroach/pkg/util/hlc"

// IsUncertain deterines whether a value with the provided timestamp is

determines


pkg/roachpb/data.proto, line 361 at r3 (raw file):

  // holds the lease for the range that the transaction is evaluating on. As
  // such, the field is only set on the server in the context of an evaluating
  // request on a specific range.

Will this field ever be set in a stored Transaction proto, or used in an RPC?
Or is this just a convenient place to avoid plumbing through an extra parameter across function calls?


pkg/storage/pebble_mvcc_scanner.go, line 656 at r2 (raw file):

		return p.advanceKeyAtNewKey(origKey)
	}
	if p.curKey.Timestamp.LessEq(ts) {

is it worth keeping this check as an assertion, in case there is data corruption?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only read parts of this PR. Flushing a couple of more comments.

Reviewed 1 of 7 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/storage/pebble_mvcc_scanner.go, line 397 at r4 (raw file):

		// Note that if we own the intent (i.e. we're reading transactionally)
		// we want to read the intent regardless of our read timestamp and fall
		// into case 8 below.

Since !otherIntentVisible, we have !p.checkUncertainty || !p.isUncertainValue(metaTS). I am trying to understand this if-block. It would be executed when !p.isUncertainValue(metaTS). Ah, I guess there could be committed values that are < MaxTimestamp. Could you expand a bit on the code comment to explain this.


pkg/storage/pebble_mvcc_scanner.go, line 656 at r4 (raw file):

				return p.addAndAdvance(p.curRawKey, p.curValue)
			}
			// Iterate through uncertainty interval.

IIUC, this is needed because not everything in the interval (p.ts, seekTS] is an uncertainty error, since are doing seeks based on the worst-case uncertainty, because we need to look at the synthetic timestamp bit on what we find, to determine uncertainty. Could you add a longer comment here.

The loop below could also use a comment.

@tbg tbg requested a review from sumeerbhola December 2, 2020 13:41
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Truth be told I didn't scrutinize the test cases, but gave a close review of the rest. Thanks for the clean commit history, that helped. The MVCC changes can definitely benefit from extensive commentary; the mvcc scanner is super dense.

Reviewed 31 of 31 files at r1, 1 of 1 files at r2, 8 of 8 files at r3, 7 of 7 files at r4, 6 of 6 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/observedts/uncertainty.go, line 26 at r4 (raw file):

//
// NOTE: if observedMaxTs is not empty, it must be <= maxTs.
func IsUncertain(maxTs, observedMaxTs, valueTs hlc.Timestamp) bool {

Name bikeshedding thoughts: While we're considering a rename here, we may also want to ditch the "MaxTimestamp" nomenclature altogether. It doesn't tell the uninitiated anything! This is the upper boundary of the uncertainty interval. This is too much of a mouthful for a field name, but how about UncertaintyTS?

We could then name the adjusted value OrganicUncertaintyTS("organic" being the opposite of "synthetic"). The idea being, it's better to use this one, but you can only use it when encountering an organic, i.e. non-synthetic, timestamp.
If that doesn't resonate, AdjustedUncertaintyTS would be another option, or ReducedUncertaintyTS. Or extra descriptive with ReducedOrganicUncertaintyTS.


pkg/roachpb/data.go, line 1476 at r3 (raw file):

	// timestamp from the node.
	//
	// TODO(nvanbenschoten): how is this supposed to work for follower reads?

Follower reads don't hit that today, but would hit it with non-blocking txns, right?


pkg/roachpb/data.proto, line 342 at r3 (raw file):

  //
  // Reads which encounter values with timestamps between read_timestamp and
  // max_timestamp ("within the uncertainty interval") trigger a retry error

while you're massaging this, comment on whether the endpoints are inclusive.


pkg/roachpb/data.proto, line 357 at r3 (raw file):

  // been pushed; in this case, max_timestamp should be ignored.
  util.hlc.Timestamp max_timestamp = 7 [(gogoproto.nullable) = false];
  // The transaction's observed max timestamp is its max timestamp, reduce by

reduced


pkg/roachpb/data.proto, line 361 at r3 (raw file):

Previously, sumeerbhola wrote…

Will this field ever be set in a stored Transaction proto, or used in an RPC?
Or is this just a convenient place to avoid plumbing through an extra parameter across function calls?

It's just a convenient place for plumbing. I'd wager that Nathan dislikes this just as much as I do but doesn't want to get in the business of plumbing an auxiliary struct all the way down. I wonder though if now is a good time to do it. The change to avoid mutating MaxTimestamp in place is good, but introducing a bogus field smells. I'm on the fence. If we wrap the batch in a (non-proto) struct that can hold additional fields, we'll catch some allocations. Sure, we can amortize those easily over a sync.Pool, but still.


pkg/roachpb/data.proto, line 366 at r3 (raw file):

  // most values on a range. This can lead to values that would otherwise be
  // considered uncertain by the original max_timestamp to be considered
  // "certainly concurrent" with the transaction due to observed timestamps.

, and thus not causally related.


pkg/storage/pebble_mvcc_scanner.go, line 279 at r4 (raw file):

// Returns whether a value with the specified timestamp is within the
// transaction's uncertainty interval and is considered uncertainty.

considered uncertain


pkg/storage/pebble_mvcc_scanner.go, line 397 at r4 (raw file):

		// Note that if we own the intent (i.e. we're reading transactionally)
		// we want to read the intent regardless of our read timestamp and fall
		// into case 8 below.

also the case 8 reference is confusing because this is case 8.


pkg/storage/pebble_mvcc_scanner.go, line 397 at r4 (raw file):

Previously, sumeerbhola wrote…

Since !otherIntentVisible, we have !p.checkUncertainty || !p.isUncertainValue(metaTS). I am trying to understand this if-block. It would be executed when !p.isUncertainValue(metaTS). Ah, I guess there could be committed values that are < MaxTimestamp. Could you expand a bit on the code comment to explain this.


pkg/storage/pebble_mvcc_scanner.go, line 656 at r4 (raw file):

Previously, sumeerbhola wrote…

IIUC, this is needed because not everything in the interval (p.ts, seekTS] is an uncertainty error, since are doing seeks based on the worst-case uncertainty, because we need to look at the synthetic timestamp bit on what we find, to determine uncertainty. Could you add a longer comment here.

The loop below could also use a comment.

➕ I'm pretty sure any time spent on these comments will pay off handsomely for future generations

@nvanbenschoten nvanbenschoten added the A-multiregion Related to multi-region label Dec 2, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 29, 2020
This commit splits off a new hlc.ClockTimestamp type from the existing
hlc.Timestamp type through a type alias. While the two types share the same
memory and proto representation, they have different purposes and properties.
Timestamp serves the role of representing an arbitrary timestamp, one that makes
no claim about real time. ClockTimestamp serves the role of representing a real
timestamp pulled from one of the HLC clocks in the system. Because of this, it
has the added capability of being able to update a peer's HLC clock. As such, a
clock timestamp is an promise that some node in the system has a clock with a
reading equal to or above its value.

The commit also pushes towards a world where the ClockTimestamp specialization
is maintained through a combination of static and dynamic typing. While the
primary mechanisms that use ClockTimestamps will use static typing, Timestamp
will also carry a bit indicating whether it can be downcast to a ClockTimestamp.
This bit will replace the current synthetic flag. So instead of an interaction
like the one introduced in cockroachdb#57077 checking whether the value has a "synthetic
timestamp", it will instead check whether the value has a "clock timestamp".

This serves as an alternative to an approach like the one in cockroachdb#58213, where we
split the types in the other direction, keeping Timestamp to represent a clock
timestamp and introduce a new enginepb.TxnTimestamp to represent an arbitrary
MVCC timestamp. That change resulted in a significantly larger diff, as it
misjudged the extremely small percent of all Timestamp usages which care about
the capability of updating remote HLC clocks.

The current approach to the synthetic flag on Timestamps also had issues which
are avoided by the inversion of the flag's meaning as from_clock (see later
commit). Specifically, while inverting the flag optimizes the encoded size of
non-clock (currently synthetic) timestamps at the expense of the encoded size of
clock timestamps by 2 bytes, it comes with major benefits that outweigh this
cost. By making clock timestamps opt-in instead of opt-out, we more closely
match the capability model we're trying to establish, where a clock timestamp
can do everything a normal timestamp can, but can also be used to update an HLC
clock. The opt-in nature mitigates the risk of bugs that forget to set this flag
correctly. Instead of risking a capability escalation where a non-clock
timestamp is incorrectly interpreted as a clock timestamp and used to update an
HLC clock, we risk a much less harmful capability de-escalation where a clock
timestamp loses its ability to update an HLC clock. We can then much more
carefully audit the cases where the flag needs to be unset, such as in the
Timestamp.Add and Timestamp.Forward methods.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 30, 2020
This commit splits off a new hlc.ClockTimestamp type from the existing
hlc.Timestamp type through a type alias. While the two types share the
same memory and proto representation, they have different purposes and
properties. Timestamp serves the role of representing an arbitrary
timestamp, one that makes no claim about real time. ClockTimestamp
serves the role of representing a real timestamp pulled from one of the
HLC clocks in the system. Because of this, it has the added capability
to update a peer's HLC clock. As such, a clock timestamp is an promise
that some node in the system has a clock with a reading equal to or
above its value.

The commit also pushes towards a world where the ClockTimestamp
specialization is maintained through a combination of static and dynamic
typing. While the primary mechanisms that use ClockTimestamps will use
static typing, Timestamp will also carry a bit indicating whether it can
be downcast to a ClockTimestamp. This bit will replace the current flag
structure. So instead of an interaction like the one introduced in cockroachdb#57077
checking whether the value has a "synthetic" flag set, it will instead
check whether the value has a "clock timestamp".

This serves as an alternative to an approach like the one in cockroachdb#58213,
where we split the types in the other direction, keeping Timestamp to
represent a clock timestamp and introduce a new enginepb.TxnTimestamp to
represent an arbitrary MVCC timestamp. That change resulted in a
significantly larger diff, as it misjudged the extremely small percent
of all Timestamp usages which care about the capability of updating
remote HLC clocks. It also forced all users (tests, etc.) of timestamps
to address this question front-and-center, which had benefits but was
also a burden on all uses of timestamps.

The original intention of this change was to follow it up by inverting
the synthetic flag on Timestamps, replacing an "is_synthetic" bit with a
"from_clock" bit. While inverting the flag optimized the encoded size of
non-clock (currently synthetic) timestamps at the expense of the encoded
size of clock timestamps by 2 bytes, it came with major benefits. By
making clock timestamps opt-in instead of opt-out, we more closely match
the capability model we're trying to establish here with static typing,
where a clock timestamp can do everything a normal timestamp can, but
can also be used to update an HLC clock. The opt-in nature mitigated the
risk of bugs that forget to set this flag correctly. Instead of risking
a capability escalation where a non-clock timestamp is incorrectly
interpreted as a clock timestamp and used to update an HLC clock, we
risked a much less harmful capability de-escalation where a clock
timestamp loses its ability to update an HLC clock. We could then much
more carefully audit the cases where the flag needs to be unset, such as
in the Timestamp.Add and Timestamp.Forward methods.

Unfortunately, this "from_clock" approach (attempted in 4e34e20) came with
serious complications as well, which became clear only after making the
change. Chiefly, the approach made the mixed-version migration of this
field significantly more difficult. With v20.2 nodes unaware of the
flag, v21.1 nodes would need to find a way to either auto-assign it for
all timestamps coming from v21.1 nodes, or to work around its absence.
But this latter idea touches on a second complication – the absence of
the field resulted (intentionally) in an MVCC encoding that v20.2 nodes
would be unable to decode. So v21.1 nodes would need to be very careful
to always set the "from_clock" flag on all timestamps when in a mixed
version cluster. But this all made the migration towards not setting the
flag once in a v21.1-only cluster even more difficult. In the end,
opting into "is_synthetic" is a lot easier to work with than opting into
"from_clock", so that's how the rest of this PR will operate.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 30, 2020
Adapted from cockroachdb#57077.

Also switch over TestIntentInterleavingIter and TestIntentDemuxWriter.

No need to re-write the parsing logic again. We'll also want this to use
flags in the next commit.

This causes a large diff because all nanos are now considered seconds.
This doesn't actually change any behavior in the tests themselves.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 5, 2021
This commit splits off a new hlc.ClockTimestamp type from the existing
hlc.Timestamp type through a type alias. While the two types share the
same memory and proto representation, they have different purposes and
properties. Timestamp serves the role of representing an arbitrary
timestamp, one that makes no claim about real time. ClockTimestamp
serves the role of representing a real timestamp pulled from one of the
HLC clocks in the system. Because of this, it has the added capability
to update a peer's HLC clock. As such, a clock timestamp is an promise
that some node in the system has a clock with a reading equal to or
above its value.

The commit also pushes towards a world where the ClockTimestamp
specialization is maintained through a combination of static and dynamic
typing. While the primary mechanisms that use ClockTimestamps will use
static typing, Timestamp will also carry a bit indicating whether it can
be downcast to a ClockTimestamp. This bit will replace the current flag
structure. So instead of an interaction like the one introduced in cockroachdb#57077
checking whether the value has a "synthetic" flag set, it will instead
check whether the value has a "clock timestamp".

This serves as an alternative to an approach like the one in cockroachdb#58213,
where we split the types in the other direction, keeping Timestamp to
represent a clock timestamp and introduce a new enginepb.TxnTimestamp to
represent an arbitrary MVCC timestamp. That change resulted in a
significantly larger diff, as it misjudged the extremely small percent
of all Timestamp usages which care about the capability of updating
remote HLC clocks. It also forced all users (tests, etc.) of timestamps
to address this question front-and-center, which had benefits but was
also a burden on all uses of timestamps.

The original intention of this change was to follow it up by inverting
the synthetic flag on Timestamps, replacing an "is_synthetic" bit with a
"from_clock" bit. While inverting the flag optimized the encoded size of
non-clock (currently synthetic) timestamps at the expense of the encoded
size of clock timestamps by 2 bytes, it came with major benefits. By
making clock timestamps opt-in instead of opt-out, we more closely match
the capability model we're trying to establish here with static typing,
where a clock timestamp can do everything a normal timestamp can, but
can also be used to update an HLC clock. The opt-in nature mitigated the
risk of bugs that forget to set this flag correctly. Instead of risking
a capability escalation where a non-clock timestamp is incorrectly
interpreted as a clock timestamp and used to update an HLC clock, we
risked a much less harmful capability de-escalation where a clock
timestamp loses its ability to update an HLC clock. We could then much
more carefully audit the cases where the flag needs to be unset, such as
in the Timestamp.Add and Timestamp.Forward methods.

Unfortunately, this "from_clock" approach (attempted in 4e34e20) came with
serious complications as well, which became clear only after making the
change. Chiefly, the approach made the mixed-version migration of this
field significantly more difficult. With v20.2 nodes unaware of the
flag, v21.1 nodes would need to find a way to either auto-assign it for
all timestamps coming from v21.1 nodes, or to work around its absence.
But this latter idea touches on a second complication – the absence of
the field resulted (intentionally) in an MVCC encoding that v20.2 nodes
would be unable to decode. So v21.1 nodes would need to be very careful
to always set the "from_clock" flag on all timestamps when in a mixed
version cluster. But this all made the migration towards not setting the
flag once in a v21.1-only cluster even more difficult. In the end,
opting into "is_synthetic" is a lot easier to work with than opting into
"from_clock", so that's how the rest of this PR will operate.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 5, 2021
Adapted from cockroachdb#57077.

Also switch over TestIntentInterleavingIter and TestIntentDemuxWriter.

No need to re-write the parsing logic again. We'll also want this to use
flags in the next commit.

This causes a large diff because all nanos are now considered seconds.
This doesn't actually change any behavior in the tests themselves.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 5, 2021
This commit splits off a new hlc.ClockTimestamp type from the existing
hlc.Timestamp type through a type alias. While the two types share the
same memory and proto representation, they have different purposes and
properties. Timestamp serves the role of representing an arbitrary
timestamp, one that makes no claim about real time. ClockTimestamp
serves the role of representing a real timestamp pulled from one of the
HLC clocks in the system. Because of this, it has the added capability
to update a peer's HLC clock. As such, a clock timestamp is an promise
that some node in the system has a clock with a reading equal to or
above its value.

The commit also pushes towards a world where the ClockTimestamp
specialization is maintained through a combination of static and dynamic
typing. While the primary mechanisms that use ClockTimestamps will use
static typing, Timestamp will also carry a bit indicating whether it can
be downcast to a ClockTimestamp. This bit will replace the current flag
structure. So instead of an interaction like the one introduced in cockroachdb#57077
checking whether the value has a "synthetic" flag set, it will instead
check whether the value has a "clock timestamp".

This serves as an alternative to an approach like the one in cockroachdb#58213,
where we split the types in the other direction, keeping Timestamp to
represent a clock timestamp and introduce a new enginepb.TxnTimestamp to
represent an arbitrary MVCC timestamp. That change resulted in a
significantly larger diff, as it misjudged the extremely small percent
of all Timestamp usages which care about the capability of updating
remote HLC clocks. It also forced all users (tests, etc.) of timestamps
to address this question front-and-center, which had benefits but was
also a burden on all uses of timestamps.

The original intention of this change was to follow it up by inverting
the synthetic flag on Timestamps, replacing an "is_synthetic" bit with a
"from_clock" bit. While inverting the flag optimized the encoded size of
non-clock (currently synthetic) timestamps at the expense of the encoded
size of clock timestamps by 2 bytes, it came with major benefits. By
making clock timestamps opt-in instead of opt-out, we more closely match
the capability model we're trying to establish here with static typing,
where a clock timestamp can do everything a normal timestamp can, but
can also be used to update an HLC clock. The opt-in nature mitigated the
risk of bugs that forget to set this flag correctly. Instead of risking
a capability escalation where a non-clock timestamp is incorrectly
interpreted as a clock timestamp and used to update an HLC clock, we
risked a much less harmful capability de-escalation where a clock
timestamp loses its ability to update an HLC clock. We could then much
more carefully audit the cases where the flag needs to be unset, such as
in the Timestamp.Add and Timestamp.Forward methods.

Unfortunately, this "from_clock" approach (attempted in 4e34e20) came with
serious complications as well, which became clear only after making the
change. Chiefly, the approach made the mixed-version migration of this
field significantly more difficult. With v20.2 nodes unaware of the
flag, v21.1 nodes would need to find a way to either auto-assign it for
all timestamps coming from v21.1 nodes, or to work around its absence.
But this latter idea touches on a second complication – the absence of
the field resulted (intentionally) in an MVCC encoding that v20.2 nodes
would be unable to decode. So v21.1 nodes would need to be very careful
to always set the "from_clock" flag on all timestamps when in a mixed
version cluster. But this all made the migration towards not setting the
flag once in a v21.1-only cluster even more difficult. In the end,
opting into "is_synthetic" is a lot easier to work with than opting into
"from_clock", so that's how the rest of this PR will operate.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 5, 2021
Adapted from cockroachdb#57077.

Also switch over TestIntentInterleavingIter and TestIntentDemuxWriter.

No need to re-write the parsing logic again. We'll also want this to use
flags in the next commit.

This causes a large diff because all nanos are now considered seconds.
This doesn't actually change any behavior in the tests themselves.
craig bot pushed a commit that referenced this pull request Jan 5, 2021
58349: hlc: strongly type ClockTimestamp as specialization of Timestamp r=nvanbenschoten a=nvanbenschoten

Closes #57684 - no need to panic, type system now protects us.

This PR splits off a new `hlc.ClockTimestamp` type from the existing `hlc.Timestamp` type through a type alias. While the two types share the same memory and proto representation, they have different purposes and properties. `Timestamp` serves the role of representing an arbitrary timestamp, one that makes no claim about real-time. `ClockTimestamp` serves the role of representing a real timestamp pulled from one of the HLC clocks in the system. Because of this, it has the added capability to update a peer's HLC clock. As such, a clock timestamp is a promise that some node in the system has a clock with a reading equal to or above its value.

The PR also moves to a world where the `ClockTimestamp` specialization is maintained through a combination of static and dynamic typing. While the primary mechanisms that use `ClockTimestamps` will use static typing, Timestamp will also carry a bit indicating whether it can be downcast to a `ClockTimestamp`. This bit will replace the current flag structure. So instead of an interaction like the one introduced in #57077 checking whether the value has an arbitrary "synthetic" flag set, it will instead check whether the value has a "clock timestamp" (same mechanism, but slightly more consistent meaning).

This serves as an alternative to an approach like the one in #58213, where we split the types in the other direction, keeping `Timestamp` to represent a clock timestamp and introduce a new `enginepb.TxnTimestamp` to represent an arbitrary MVCC timestamp. That change resulted in a significantly larger diff, as it misjudged the extremely small percent of all `Timestamp` usages which care about the capability of updating remote HLC clocks. It also forced all users (tests, etc.) of timestamps to address this question front-and-center, which had benefits but was also a burden on all uses of timestamps.

The original intention of this change was to follow it up by inverting the synthetic flag on `Timestamp`, replacing an "is_synthetic" bit with a "from_clock" bit. While inverting the flag optimized the encoded size of non-clock (currently synthetic) timestamps at the expense of the encoded size of clock timestamps by 2 bytes, it came with major benefits. By making clock timestamps opt-in instead of opt-out, we more closely match the capability model we're trying to establish here with static typing, where a clock timestamp can do everything a normal timestamp can, but can also be used to update an HLC clock. The opt-in nature mitigated the risk of bugs that forget to set this flag correctly. Instead of risking a capability escalation where a non-clock timestamp is incorrectly interpreted as a clock timestamp and used to update an HLC clock, we risked a much less harmful capability de-escalation where a clock timestamp loses its ability to update an HLC clock. We could then much more carefully audit the cases where the flag needs to be unset, such as in the `Timestamp.Add` and `Timestamp.Forward` methods.

Unfortunately, this "from_clock" approach (attempted in 4e34e20) came with serious complications as well, which became clear only after making the change. Chiefly, the approach made the mixed-version migration of this field significantly more difficult. With v20.2 nodes unaware of the flag, v21.1 nodes would need to find a way to either auto-assign it for all timestamps coming from v21.1 nodes, or to work around its absence. But this latter idea touches on a second complication – the absence of the field resulted (intentionally) in an MVCC encoding that v20.2 nodes would be unable to decode. So v21.1 nodes would need to be very careful to always set the "from_clock" flag on all timestamps when in a mixed version cluster. But this all made the migration towards not setting the flag once in a v21.1-only cluster even more difficult. In the end, opting into "is_synthetic" is a lot easier to work with than opting into "from_clock", so that's how the rest of this PR operates.

58456: build: fix Pebble nightly benchmarks r=jbowens a=jbowens



Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 1, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/observedMaxTimestamp branch from 9c200b3 to a8c5ffe Compare February 1, 2021 16:02
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through some parts of this, including the change to pebble_mvcc_scanner.go (I'm unable to get reviewable to behave reasonably with many commits). I'll do a more careful pass tomorrow.

Reviewed 1 of 22 files at r6, 1 of 1 files at r9, 1 of 8 files at r10, 1 of 8 files at r11, 2 of 61 files at r15.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/storage/mvcc.go, line 1203 at r15 (raw file):

// behalf of a transaction), will result in a WriteIntentError. Because
// of this, the function is only called from places where intents have
// already been considered.

Was this "places where intents have already been considered" always the case, or is this new here?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a more careful pass tomorrow.

Finished a full pass -- only some small comments in addition to my earlier question in mvcc.go.

Reviewed 3 of 8 files at r11, 10 of 18 files at r12, 9 of 18 files at r14, 28 of 61 files at r15.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/observedts/doc.go, line 152 at r15 (raw file):

// of the lease that the request is executing under before using it to limit the
// request's uncertainty interval.
//

Should there be an addition to this comment to account for #36431


pkg/kv/kvserver/observedts/doc.go, line 173 at r15 (raw file):

// being served from a different replica in the range, the observed timestamp
// still places a bound on the values in the range that may have been written
// before the transaction began.

It would also need to know when the leaseholder acquired the lease for the range, yes?


pkg/roachpb/data.go, line 1486 at r15 (raw file):

	// This is tracked in #57685. We can now use the LocalUncertaintyLimit in
	// place of clockTS, which handles follower reads correctly.
	clockTS, ok := txn.GetObservedTimestamp(origin)

If the uncertainty was caused by the LocalUncertaintyLimit being adjusted by the lease acquisition time (which I think is not reflected in GetObservedTimestamp, since it is specific to a replica and not the node) won't this time not be high enough to remove the uncertainty? If so, will it keep retrying at this low timestamp indefinitely?


pkg/roachpb/data.proto, line 361 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Done with the help of #59086. How do you two like this approach?

looks good

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
craig bot pushed a commit that referenced this pull request Feb 6, 2021
59086: kv: move range lease checks and transfers below latching r=nvanbenschoten a=nvanbenschoten

Needed for #57688.

This PR reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general.

### Background

Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder.

This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend it.

### Motivation

The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!).

### Solution

The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will.

### Other benefits

Besides this primary change, a number of other benefits fall out of this restructuring.

1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on #57077.

### Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve attention.

The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like #22837 and #32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worst-case latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of #57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see #57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/observedMaxTimestamp branch 2 times, most recently from 5535ba0 to a57189e Compare February 9, 2021 02:21
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/observedts/doc.go, line 152 at r15 (raw file):

Previously, sumeerbhola wrote…

Should there be an addition to this comment to account for #36431

Yes, absolutely. I was planning on adding that once we fix #36431 by setting the synthetic flag when intents are moved during intent resolution.


pkg/kv/kvserver/observedts/doc.go, line 173 at r15 (raw file):

Previously, sumeerbhola wrote…

It would also need to know when the leaseholder acquired the lease for the range, yes?

Could you expand on this? The follower knows the lease start time, if that's what you mean.


pkg/roachpb/data.go, line 1486 at r15 (raw file):

Previously, sumeerbhola wrote…

If the uncertainty was caused by the LocalUncertaintyLimit being adjusted by the lease acquisition time (which I think is not reflected in GetObservedTimestamp, since it is specific to a replica and not the node) won't this time not be high enough to remove the uncertainty? If so, will it keep retrying at this low timestamp indefinitely?

I think this case is handled by the forward to err.ExistingTimestamp. In theory, that's all we need to bump our timestamp to in order to ensure that we observe the uncertain value after a restart and avoid an infinite retry loop. I guess we decided in 9eb55b7#diff-230152fb9c39bac311935d3cda2ec1d43791a94de9316a1366b314b21337f51eR735 that we might as well also forward the restart time up to the observed timestamp to avoid any other uncertainty restarts on this node.

I'm tracking this in #57685, so I'll clean this up and provide a better comment when I address that sometime in the next two weeks.


pkg/storage/mvcc.go, line 1203 at r15 (raw file):

Previously, sumeerbhola wrote…

Was this "places where intents have already been considered" always the case, or is this new here?

This was already the case, but we didn't call that out anywhere.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/observedMaxTimestamp branch from a57189e to 27d3358 Compare February 9, 2021 03:02
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 9, 2021
…ntervalRetryTimestamp

Fixes cockroachdb#57685.
All commits except last from cockroachdb#57077.

This commit updates readWithinUncertaintyIntervalRetryTimestamp to use the the
LocalUncertaintyLimit field from ReadWithinUncertaintyIntervalError in place of
ObservedTimestamps directly in readWithinUncertaintyIntervalRetryTimestamp. This
addresses a bug where ReadWithinUncertaintyIntervalErrors thrown by follower
replicas during follower reads would use meaningless observed timestamps.

This could use a test or two. I mainly wanted to get this out to garner opinions
about the question included in the final commit.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/observedMaxTimestamp branch from 27d3358 to 06618a2 Compare February 10, 2021 07:12
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'm not well versed in some of the KV code.

:lgtm:

Reviewed 2 of 31 files at r1, 5 of 22 files at r6, 9 of 20 files at r7, 9 of 33 files at r8, 11 of 97 files at r16, 3 of 8 files at r18, 7 of 18 files at r19, 12 of 18 files at r21, 59 of 61 files at r22.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 36 at r15 (raw file):

	//
	// Because of this, it declares a non-MVCC write over every addressable key
	// in the range, even through the only key the TransferLease actually writes

s/through/though/


pkg/kv/kvserver/batcheval/cmd_push_txn.go, line 127 at r15 (raw file):

	}
	now := cArgs.EvalCtx.Clock().Now()
	// TODO(nvanbenschoten): remove this limitation. But when doing so,

what is "this limitation"?


pkg/kv/kvserver/observedts/doc.go, line 173 at r15 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you expand on this? The follower knows the lease start time, if that's what you mean.

Yes, that's what I meant -- otherwise we could set the local uncertainty limit too low for follower reads. Does the follower maintain the history of lease start times, or it will just use the current leaseholder. I am assuming the latter is sufficient, and it seems that is what limit.go is doing. Some crude reasoning to convince myself (it would be nice to have a refined sketch of such reasoning as a code comment in limit.go):
T_txn: transaction time
T_lease_start: leaseholder's lease start time
T_closed_time: closed time from that leaseholder
T_txn <= T_closed_time to do follower reads
If T_txn >= T_lease_start, I think the correctness is straightforward.
However, it is possible that T_lease_start > T_txn, and that the txn had an observed timestamp from the earlier leaseholder. But we are safe because T_lease_start > than that observed timestamp from the earlier leaseholder.

Is this correct?


pkg/roachpb/data.go, line 1486 at r15 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think this case is handled by the forward to err.ExistingTimestamp. In theory, that's all we need to bump our timestamp to in order to ensure that we observe the uncertain value after a restart and avoid an infinite retry loop. I guess we decided in 9eb55b7#diff-230152fb9c39bac311935d3cda2ec1d43791a94de9316a1366b314b21337f51eR735 that we might as well also forward the restart time up to the observed timestamp to avoid any other uncertainty restarts on this node.

I'm tracking this in #57685, so I'll clean this up and provide a better comment when I address that sometime in the next two weeks.

(just replaying what you said to make sure I understood it right)
We expect the output of txn.GetObservedTimestamp(origin) to be typically > err.ExistingTimestamp, except when the local uncertainty was bumped up due to lease acquisition, where the latter could be higher. Either way we are increasing the txn timestamp beyond the write seen within the uncertainty limit, which ensures liveness.

A code comment here would be helpful.

My comment here was not about follower reads, but I'm assuming you mean that follower reads need improving here because the origin is the follower node and we don't want the follower node's observed timestamp. Correct?

Once we have seeked to a version, it's not possible for the current key
to be the same and for the timestamp of the current value to be above
the seek version. So this commit removes the confusing condition.

While doing so, rename a few variables.
This commit introduces a new LocalUncertaintyLimit field on
MVCCScanOptions and MVCCGetOptions. The field is currently unused, but
will be tied into MVCC logic. To support this, we introduce a new test
that will be updated once the field begins to be used.
…tamps

This commit updates the `pebbleMVCCScanner` to understand the new
`LocalUncertaintyLimit` field and its interaction with synthetic
timestamps. These interactions make the determination of when a value is
uncertain more complex, as it creates situations where some values in a
transaction's full uncertainty interval are uncertain while others
aren't. This requires forward iteration in a few new places.

To centralize this logic, we pull it into a new `observedts.IsUncertain`
helper, which will also be used by the lock table.

To test this out, the commit updates the corresponding mvcc_histories
testdata file.
…alError

This commit adds the new LocalUncertaintyLimit value to the
ReadWithinUncertaintyIntervalError proto.
Will be fixed in next commit. This is important because we can often
refresh away a WriteTooOld error on the server, but we don't do as good
of a job refreshing away ReadWithinUncertaintyInterval errors on the
server. It's also a more consistent error to return in this situation.
This commit replaces `LimitTxnMaxTimestamp` with `ComputeLocalUncertaintyLimit`.
Instead of updating the transaction proto's MaxTimestamp field, we now keep a
localUncertaintyLimit value on the stack during evaluation and use this to limit
uncertainty restarts. This makes things easier to understand, as we no longer
modify a request's transaction in-place.

In doing so, this also fixes a bug where an `mvccPutInternal` could return a
`ReadWithinUncertaintyInterval` error in cases where it should have returned a
`WriteTooOld` error.
This commit renames Txn.MaxTimestamp to Txn.GlobalUncertaintyLimit, to
parallel the existing localUncertaintyLimit. The commit updates various
commentary to reflect this change.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/observedMaxTimestamp branch from 06618a2 to 64aadfe Compare February 11, 2021 04:19
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 36 at r15 (raw file):

Previously, sumeerbhola wrote…

s/through/though/

Done.


pkg/kv/kvserver/batcheval/cmd_push_txn.go, line 127 at r15 (raw file):

Previously, sumeerbhola wrote…

what is "this limitation"?

The limitation was that transactions could not be pushed into the future with synthetic timestamps. This will be fixed by #59693.


pkg/kv/kvserver/observedts/doc.go, line 173 at r15 (raw file):

Previously, sumeerbhola wrote…

Yes, that's what I meant -- otherwise we could set the local uncertainty limit too low for follower reads. Does the follower maintain the history of lease start times, or it will just use the current leaseholder. I am assuming the latter is sufficient, and it seems that is what limit.go is doing. Some crude reasoning to convince myself (it would be nice to have a refined sketch of such reasoning as a code comment in limit.go):
T_txn: transaction time
T_lease_start: leaseholder's lease start time
T_closed_time: closed time from that leaseholder
T_txn <= T_closed_time to do follower reads
If T_txn >= T_lease_start, I think the correctness is straightforward.
However, it is possible that T_lease_start > T_txn, and that the txn had an observed timestamp from the earlier leaseholder. But we are safe because T_lease_start > than that observed timestamp from the earlier leaseholder.

Is this correct?

Just the current leaseholder. This code is actually agnostic to whether the read is off a leaseholder or a follower. I agree that we could discuss this more in the comment in observedts/doc.go's Follower Reads section. I'll do that when I go back to update that comment.


pkg/roachpb/data.go, line 1486 at r15 (raw file):

We expect the output of txn.GetObservedTimestamp(origin) to be typically > err.ExistingTimestamp, except when the local uncertainty was bumped up due to lease acquisition, where the latter could be higher. Either way we are increasing the txn timestamp beyond the write seen within the uncertainty limit, which ensures liveness.

Yes, this all sounds correct.

A code comment here would be helpful.

Agreed. We can discuss what more to add in #60021. The comment in there says a lot of this, but it may be missing aspects.

My comment here was not about follower reads, but I'm assuming you mean that follower reads need improving here because the origin is the follower node and we don't want the follower node's observed timestamp. Correct?

That was my original thinking, but I now think I was wrong. We will have an observed timestamp from the follower node, but it won't be relevant to the value because it can't be applied to establish a local uncertainty limit. So #60021 might not actually be necessary, but it seems like a good improvement regardless.

…terNewLease

There is no guarantee that the receiving end of a lease transfer will
have a high clock than the sending end after the lease transfer
completes. There is only a guarantee that the receiving end will have a
higher clock than the sending end had before the transfer began.

This is likely fallout from the recent switch to the use of
HybridManualClock in this test, but showed up in CI on this PR because
the test rename triggered CI to run it under extra stress.
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 11, 2021

Build succeeded:

@craig craig bot merged commit ff2a625 into cockroachdb:master Feb 11, 2021
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/observedMaxTimestamp branch February 11, 2021 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants