Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: ship timestamp cache summary during lease transfers and range merges #60521

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Feb 12, 2021

Fixes #57688.
Fixes #59679.
Fixes #60520.

This commit introduces new logic to ship summaries of a leaseholders timestamp cache through lease transfers and range merges. For lease transfers, the read summary is sent from the outgoing leaseholder to the incoming leaseholder. For range merges, the read summary is sent from the right-hand side leaseholder (through the SubsumeResponse), to the left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge freeze time used to play for lease transfers and range merges, respectively - the summaries instruct the post-operation leaseholder on how to update its timestamp cache to ensure that no future writes are allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:

  1. they can transfer a higher-resolution snapshot of the reads on the range through a lease transfer, to make the lease transfers less disruptive to writes because the timestamp cache won't be bumped as high. This avoids transaction aborts and retries after lease transfers and merges.
  2. they can transfer information about reads with synthetic timestamps, which are not otherwise captured by the new lease's start time. Because of this, they are needed for correctness on global_read ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very low-resolution read summaries. However, it sets up the infrastructure that will allow us to realize the benefit in the future by capturing and shipping higher-resolution read summaries. The commit does realize the second benefit, as it fixes correctness issues around future time reads.


The commit also fixes a related bug that was revealed during the development of this patch. As explained in #60520, it was possible for a range merge to be applied to the leaseholder of the LHS of the merge through a Raft snapshot. In such cases, we were not properly updating the leaseholder's timestamp cache to reflect the reads served on the RHS range. This could allow the post-merged range to invalidate reads served by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure. Merge triggers now write to the left-hand side's prior read summary with a read summary gathered from the right-hand side during subsumption. Later, upon ingesting a Raft snapshot, we check if we subsumed any replicas and if we are the leaseholder. If both of those conditions are true, we forward the replica's timestamp cache to the read summary on the range. Since this read summary must have been updated by the merge trigger, it will include all reads served on the pre-merge RHS range.

The existence of this bug was verified by a new variant of TestStoreRangeMergeTimestampCache, which only passes with the rest of this commit.


Release note (bug fix): Fixes a very rare, possible impossible in practice, bug where a range merge that applied through a Raft snapshot on the left-hand side range's leaseholder could allow that leaseholder to serve writes that invalidated reads from before the merge on the right-hand side.

Release justification: bug fix

@nvanbenschoten nvanbenschoten requested review from bdarnell, tbg and a team February 12, 2021 07:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/shipTimestampCache2 branch 2 times, most recently from f23254a to 669b44d Compare February 12, 2021 07:30
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense to me 👍

Reviewed 8 of 10 files at r2, 5 of 8 files at r3, 50 of 50 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)


pkg/clusterversion/cockroach_versions.go, line 239 at r4 (raw file):

	PostTruncatedAndRangeAppliedStateMigration
	// PriorReadSummaries introduces support for the use of read summary objects
	// ship information about reads on a range through lease changes and range

missing word (that?)


pkg/kv/kvserver/client_lease_test.go, line 444 at r4 (raw file):

}

// TestStoreLeaseTransferTimestampCacheTxnRecord the error returned by attempts

missing word (checks?)


pkg/kv/kvserver/client_merge_test.go, line 388 at r4 (raw file):

// - disjointLeaseholders: configures whether or not the leaseholder of the
//     LHS range is disjoint from the leaseholder of the RHS range. If false,
//     the leaseholders are collacted before the merge is initiated.

collocated


pkg/kv/kvserver/client_merge_test.go, line 416 at r4 (raw file):

) {
	// mergeCommitFilter is used to issue a sequence of operations on the LHS of
	// a range merge immediately before it

.


pkg/kv/kvserver/client_merge_test.go, line 420 at r4 (raw file):

	// blockHBAndGCs is used to black hole Heartbeat and GC requests for the
	// duration of the merge on the throughSnapshot path. Neither request type
	// is needed and both can create issues during the split leader-leaseholder

Can you document what these issues are?


pkg/kv/kvserver/client_merge_test.go, line 556 at r4 (raw file):

	}

	// Pause the cluster's clock. This accomplishes two things:

Random comment, this comment merge_queue.go seems wrong:

// The current implementation of merges requires rewriting the right-hand data
// onto the left-hand range, even when the ranges are collocated. This is
// expensive, so limit to one merge at a time.


pkg/kv/kvserver/client_merge_test.go, line 580 at r4 (raw file):

		//
		// But there's a wrinkle here that makes things more difficult: the
		// leaseholder needs to play a role in coordinating the range merge and

Could you refresh my memory on how this even works today? I thought we make sure the LHS and RHS lease were collocated, but don't think this true. I understand that once we have the SubsumeRequest applied on all RHS followers, the RHS lease won't change (unless the merge aborts). But what about the LHS? Nothing in (*Replica).AdminMerge or merge_queue.go seems to care. Sure, the merge queue checks for the lease, but it seems easily possible for the merge queue to check the lease, then the lease to move somewhere else, and then we run AdminMerge on a replica that's not the leaseholder, and I think it all should still work, right? I don't think (?) that how this all works is documented anywhere - now would be a good time for a braindump.


pkg/kv/kvserver/client_merge_test.go, line 585 at r4 (raw file):

		// leader-leaseholder state and lock down all communication between the
		// two _except_ for forwarded proposal from the leaseholder to the
		// leader. This allowed the leaseholder to make proposals, even though

allows


pkg/kv/kvserver/client_merge_test.go, line 588 at r4 (raw file):

		// it won't be able to hear their result. Because this is such a fragile
		// state, we enter it as late as possible - after the merge begins and
		// only upon receiving the merge's EndTxn request.

Oof, it's amazing that you got this done, but is there an alternative approach that's less susceptible to a bus factor of 1 and random breakage? We initiate the merge without any kind of partition, but then when the leaseholder wants to apply the merge trigger, we remove its replica instead. Then we lean back and wait for it to reappear, which necessarily must happen via a snapshot. That seems a lot more straightforward, the possible wrinkle being that "remove it instead of applying a command" might need some work, but I think Andrew's "replica can't exist in-mem across replicaID changes" work maybe helped?


pkg/kv/kvserver/replica_raftstorage.go, line 1001 at r4 (raw file):

	// we missed the application of a merge) and we are the new leaseholder, we
	// make sure to update the timestamp cache using the prior read summary to
	// account for any reads that were served on the right-hand side range(s).

💯


pkg/kv/kvserver/replica_tscache.go, line 553 at r4 (raw file):

		// will be closed next under the current lease. This will be easier to
		// think through with the new closed timestamp system.
		sum.Merge(rspb.FromTimestamp(maxClosed))

Agree, this seems impossible to get right with closed timestamps as we have them today. Doesn't this mean that during lease application, we always need to take the pessimistic update to the tsCache where we forward to newLease.Start?


pkg/kv/kvserver/store_merge.go, line 93 at r4 (raw file):

		// timestamps in the timestamp cache. For a full discussion, see the
		// comment on TestStoreRangeMergeTimestampCacheCausality.
		s.Clock().Update(freezeStart)

And we don't have to update with anything from rightReadSum because it can't contain anything ahead of freezeStart. That seems like a nice assertion to throw in?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 163 at r4 (raw file):

					EndKey: keys.MakeRangeIDReplicatedPrefix(mt.RightDesc.RangeID).PrefixEnd(),
				})
				// Merges the prior read summary from the RHS over to the LHS,

Merge incorporate the prior read summary from the RHS into the LHS',

Another thing to explain: we don't latch the RHS read summary because this is already provided to us with the merge trigger (it comes out of subsume request). And why is this different from the abort span, which is copied over inline in the merge trigger itself?

Something else that should be clarified is which part of the summary is optimization. For example, could we replace the summary by freezeStart throughout and things would be correct (similar to how in lease transfers we could fall back to the new lease start time)? Both seem like good assertions to add btw since we expect there to always be a relation between the summary and the respective fallback.


pkg/kv/kvserver/batcheval/cmd_lease.go, line 132 at r4 (raw file):
This is a totally standard migration, but still good to spell it out?

We elide this step in mixed-version clusters as old nodes would ignore the PriorReadSummary field (they don't know about it). It's possible that in this particular case we could get away with it (as the in-mem field only ever updates in-mem state) but it's easy to get things wrong (in which case they could easily take a catastrophic turn) and the benefit is low.


pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 92 at r4 (raw file):

	// reads at future times above the proposed lease start time.
	//
	// We can remove this in the future when we increase the resolution of read

Ah, ok. This answers my comment on GetCurrentReadSummary. Please drop a hint in that method so that the next reader's alarm bells don't go off :-)


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 176 at r4 (raw file):

  // 2. it can transfer information about reads with synthetic timestamps, which
  //    are not otherwise captured by the new lease's start time.
  kv.kvserver.readsummary.ReadSummary prior_read_summary = 22;

Mention that if this field is set, this proposal also persists the same summary under its replicated key.
The pattern here is almost silly btw. The replicated key is in the proposal that just got applied, so we could just put a bool here and read it from disk. I know it doesn't matter either way, but the way it is now you always worry a tiny but about ways in which you could have this field set and the thing on disk be different. OTOH with a bool you'd worry about cases where the bool is (not) set and the key has (not) changed, so still the same. I guess assertStateLocked takes care of it.


pkg/kv/kvserver/readsummary/rspb/summary.go, line 31 at r4 (raw file):

// Clone performs a deep-copy of the receiver.
func (c ReadSummary) Clone() *ReadSummary {

Doesn't this field have the usual proto pointer thingies? I guess that is fine, but is the copy really "deep"?


pkg/kv/kvserver/readsummary/rspb/summary.proto, line 24 at r4 (raw file):

// long as the resulting timestamp for each key does not regress. However, the
// timestamp of a key is allowed to advance as precision drops. This parallels a
// similar ratchetting policy in the timestamp cache (tscache.Cache).

If it's "crocheting" as my spellcheck claims it is, it's probably "ratcheting" too


pkg/kv/kvserver/readsummary/rspb/summary.proto, line 26 at r4 (raw file):

// similar ratchetting policy in the timestamp cache (tscache.Cache).
//
// For example, a high-resolution version of the summary may look like:

🏅 loving the illustration


pkg/roachpb/data.proto, line 168 at r4 (raw file):

  //    high.
  // 2. it can transfer information about reads with synthetic timestamps, which
  //    are not otherwise captured by the FreezeStart clock timestamp.

It's nice that this has the same comment, but it might be more ergonomical to be terse here and refer the reader to SubsumeResponse. That way, if the latter gets better, the reader won't assume they've learned all there is to know here.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM: once you fix the freeze time <-> summary interaction

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)


pkg/keys/printer_test.go, line 77 at r4 (raw file):

		{keys.RaftTruncatedStateKey(roachpb.RangeID(1000001)), "/Local/RangeID/1000001/u/RaftTruncatedState", revertSupportUnknown},
		{keys.RangeLeaseKey(roachpb.RangeID(1000001)), "/Local/RangeID/1000001/r/RangeLease", revertSupportUnknown},
		{keys.RangePriorReadSummaryKey(roachpb.RangeID(1000001)), "/Local/RangeID/1000001/r/RangePriorReadSummary", revertSupportUnknown},

is revertSupportUnknown indeed what you want?


pkg/kv/kvserver/replica_tscache.go, line 553 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Agree, this seems impossible to get right with closed timestamps as we have them today. Doesn't this mean that during lease application, we always need to take the pessimistic update to the tsCache where we forward to newLease.Start?

yeah, which I think we should do by having this method return a null before a cluster version. That cluster version would be my PR, but perhaps you want to introduce a dummy cluster version in this patch.


pkg/kv/kvserver/batcheval/cmd_subsume.go, line 155 at r4 (raw file):

	reply.FreezeStart = cArgs.EvalCtx.Clock().NowAsClockTimestamp()
	sum := cArgs.EvalCtx.GetCurrentReadSummary()
	reply.ReadSummary = &sum

forward to the freeze time like we discussed
When you do, comment that leaving the summary nil is not an option because we need it for the snapshot case.


pkg/kv/kvserver/batcheval/eval_context.go, line 106 at r4 (raw file):

	// return a meaningful summary if the caller has serialized with all other
	// requests on the range.
	GetCurrentReadSummary() rspb.ReadSummary

is there a way to assert the latches? Can this take a latchGuard?


pkg/kv/kvserver/readsummary/persist.go, line 24 at r4 (raw file):

)

// Load loads the range's prior read summary.

Say that it returns nil if there's no summary.


pkg/kv/kvserver/readsummary/rspb/summary.proto, line 22 at r4 (raw file):

// summary in that it may not represent these reads with perfect precision.
// Instead, it is allowed to lose resolution in exchange for reduced space, as
// long as the resulting timestamp for each key does not regress. However, the

I don't understand the "however"; the second sentence seems to just clarify the previous one


pkg/kv/kvserver/readsummary/rspb/summary.proto, line 26 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

🏅 loving the illustration

progressive JPEG


pkg/roachpb/api.proto, line 1659 at r4 (raw file):

  // being subsumed). It is suitable for use as the timestamp cache's low water
  // mark for the keys previously owned by the subsumed range.
  util.hlc.Timestamp freeze_start = 5 [(gogoproto.nullable) = false,

put the TODO saying that this is going away here too pls

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/shipTimestampCache2 branch 2 times, most recently from 7e27ce8 to 08287cb Compare February 16, 2021 07:17
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/shipTimestampCache2 branch from 08287cb to 0e2cd4e Compare February 23, 2021 20:28
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @bdarnell, @benesch, and @tbg)


pkg/clusterversion/cockroach_versions.go, line 239 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

missing word (that?)

Done.


pkg/keys/printer_test.go, line 77 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is revertSupportUnknown indeed what you want?

Yes, like all of these range-id local keys.


pkg/kv/kvserver/client_lease_test.go, line 444 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

missing word (checks?)

Done.


pkg/kv/kvserver/client_merge_test.go, line 388 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

collocated

Done.


pkg/kv/kvserver/client_merge_test.go, line 416 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

.

Done.


pkg/kv/kvserver/client_merge_test.go, line 420 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you document what these issues are?

Done.


pkg/kv/kvserver/client_merge_test.go, line 556 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Random comment, this comment merge_queue.go seems wrong:

// The current implementation of merges requires rewriting the right-hand data
// onto the left-hand range, even when the ranges are collocated. This is
// expensive, so limit to one merge at a time.

Done.


pkg/kv/kvserver/client_merge_test.go, line 580 at r4 (raw file):

I thought we make sure the LHS and RHS lease were collocated, but don't think this true.

No, this is not true. The leases can be on different nodes.

I understand that once we have the SubsumeRequest applied on all RHS followers, the RHS lease won't change (unless the merge aborts).

I don't think this is quite true either. The SubsumeRequest lands on a RHS leaseholder, but doesn't get propagated to followers (it doesn't go through Raft). If the RHS leaseholder fails, a new lease can be acquired, in which case the new leaseholder will notice the deletion intent on the local descriptor and remain "subsumed" - see maybeWatchForMergeLocked. This is actually related to the deadlock we found in #60905.

But what about the LHS? Nothing in (*Replica).AdminMerge or merge_queue.go seems to care. Sure, the merge queue checks for the lease, but it seems easily possible for the merge queue to check the lease, then the lease to move somewhere else, and then we run AdminMerge on a replica that's not the leaseholder, and I think it all should still work, right?

AdminMerge can run on a non-leaseholder, but the actual EndTxn with the MergeTrigger will always go through the leaseholder.

I don't think (?) that how this all works is documented anywhere - now would be a good time for a braindump.

It is, actually! @benesch's parting gift was to leave us with docs/tech-notes/range-merges.md, which I've found useful on multiple occasions.


pkg/kv/kvserver/client_merge_test.go, line 585 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

allows

Done.


pkg/kv/kvserver/client_merge_test.go, line 588 at r4 (raw file):

we remove its replica instead

What do you mean by this? I don't think we'll be able to run any ChangeReplicas operations because doing so requires touching the range descriptor, which is already covered by the merge's intent.


pkg/kv/kvserver/replica_tscache.go, line 553 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

yeah, which I think we should do by having this method return a null before a cluster version. That cluster version would be my PR, but perhaps you want to introduce a dummy cluster version in this patch.

Done.


pkg/kv/kvserver/store_merge.go, line 93 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

And we don't have to update with anything from rightReadSum because it can't contain anything ahead of freezeStart. That seems like a nice assertion to throw in?

It can contain reads ahead of freezeStart now that we can perform reads in the future of present-time. But those reads must all be at synthetic timestamps.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 163 at r4 (raw file):

Merge incorporate the prior read summary from the RHS into the LHS',

Done.

Another thing to explain: we don't latch the RHS read summary because this is already provided to us with the merge trigger (it comes out of subsume request). And why is this different from the abort span, which is copied over inline in the merge trigger itself?

Done. Wow, I didn't realize that's how we copy over the abort span.

Something else that should be clarified is which part of the summary is optimization. For example, could we replace the summary by freezeStart throughout and things would be correct (similar to how in lease transfers we could fall back to the new lease start time)? Both seem like good assertions to add btw since we expect there to always be a relation between the summary and the respective fallback.

This isn't quite true once we start talking about future time operations.


pkg/kv/kvserver/batcheval/cmd_lease.go, line 132 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a totally standard migration, but still good to spell it out?

We elide this step in mixed-version clusters as old nodes would ignore the PriorReadSummary field (they don't know about it). It's possible that in this particular case we could get away with it (as the in-mem field only ever updates in-mem state) but it's easy to get things wrong (in which case they could easily take a catastrophic turn) and the benefit is low.

Done.


pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 92 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ah, ok. This answers my comment on GetCurrentReadSummary. Please drop a hint in that method so that the next reader's alarm bells don't go off :-)

Done.


pkg/kv/kvserver/batcheval/cmd_subsume.go, line 155 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

forward to the freeze time like we discussed
When you do, comment that leaving the summary nil is not an option because we need it for the snapshot case.

Done.


pkg/kv/kvserver/batcheval/eval_context.go, line 106 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is there a way to assert the latches? Can this take a latchGuard?

Yeah, that's what we do in SpanSetReplicaEvalContext.


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 176 at r4 (raw file):
Done. The idea here was to allow a compressed version of the summary to be persisted and a high-resolution version of the summary to be passed through the ReplicatedEvalResult. That will allow for more efficient lease transfers (with respect to the timestamp cache) in the common case where a leaseholder applies the lease transfer / merge through the log without needing to risk storing large summaries indefinitely on a range in case the leaseholder applies the lease transfer / merge through a snapshot.

I added this to the comment.

When a ReadSummary is set in a ReplicatedEvalResult, there is always also a
write to the RangePriorReadSummaryKey in the RaftCommand.WriteBatch. The
persisted summary may be identical to the summary in this field, but it
does not have to be. Notably, we intended for the summary included in the
ReplicatedEvalResult to eventually be a much higher-resolution version of
the ReadSummmary than the version persisted. This scheme of persisting a
compressed ReadSummary indefinitely and including a higher-resolution
ReadSummary on the RaftCommand allows us to optimize for the common case
where the lease transfer is applied on the new leaseholder through Raft log
application while ensuring correctness in the case where the lease transfer
is applied on the new leaseholder through a Raft snapshot.


pkg/kv/kvserver/readsummary/persist.go, line 24 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Say that it returns nil if there's no summary.

Done.


pkg/kv/kvserver/readsummary/rspb/summary.go, line 31 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Doesn't this field have the usual proto pointer thingies? I guess that is fine, but is the copy really "deep"?

There aren't any pointers in ReadSummary at the moment, so it is deep, but you're right that this could easily rot. Added a comment.


pkg/kv/kvserver/readsummary/rspb/summary.proto, line 22 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't understand the "however"; the second sentence seems to just clarify the previous one

Done.


pkg/kv/kvserver/readsummary/rspb/summary.proto, line 24 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

If it's "crocheting" as my spellcheck claims it is, it's probably "ratcheting" too

Done.


pkg/roachpb/api.proto, line 1659 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

put the TODO saying that this is going away here too pls

Done.


pkg/roachpb/data.proto, line 168 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's nice that this has the same comment, but it might be more ergonomical to be terse here and refer the reader to SubsumeResponse. That way, if the latter gets better, the reader won't assume they've learned all there is to know here.

I added a bit more to this comment that is specific to MergeTriggers as they pass through Raft, so I'll keep this split for now, though I agree with everything you said in principle.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/shipTimestampCache2 branch from 0e2cd4e to 9ce9703 Compare February 23, 2021 21:33
@nvanbenschoten
Copy link
Member Author

bors r+

@nvanbenschoten
Copy link
Member Author

bors r-

@craig
Copy link
Contributor

craig bot commented Feb 23, 2021

Canceled.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/shipTimestampCache2 branch from 9ce9703 to 7ee1283 Compare February 24, 2021 03:12
@nvanbenschoten
Copy link
Member Author

bors r+

@nvanbenschoten
Copy link
Member Author

bors r-

We keep conflicting on updates to pkg/clusterversion/cockroach_versions.go.

@craig
Copy link
Contributor

craig bot commented Feb 24, 2021

Canceled.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/shipTimestampCache2 branch from 7ee1283 to a5d3e34 Compare February 24, 2021 05:05
@nvanbenschoten
Copy link
Member Author

bors r+

@cockroachdb cockroachdb deleted a comment from craig bot Feb 24, 2021
Fixes cockroachdb#57688.
Fixes cockroachdb#59679.
Fixes cockroachdb#60520.

This commit introduces new logic to ship summaries of a leaseholders
timestamp cache through lease transfers and range merges. For lease
transfers, the read summary is sent from the outgoing leaseholder to the
incoming leaseholder. For range merges, the read summary is sent from
the right-hand side leaseholder (through the SubsumeResponse), to the
left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge
freeze time used to play for lease transfers and range merges,
respectively - the summaries instruct the post-operation leaseholder on
how to update its timestamp cache to ensure that no future writes are
allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:
1. they can transfer a higher-resolution snapshot of the reads on the
    range through a lease transfer, to make the lease transfers less
    disruptive to writes because the timestamp cache won't be bumped as
    high. This avoids transaction aborts and retries after lease
    transfers and merges.
2. they can transfer information about reads with synthetic timestamps,
    which are not otherwise captured by the new lease's start time.
    Because of this, they are needed for correctness on `global_read`
    ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very
low-resolution read summaries. However, it sets up the infrastructure
that will allow us to realize the benefit in the future by capturing and
shipping higher-resolution read summaries. The commit does realize the
second benefit, as it fixes correctness issues around future time reads.

----

The commit also fixes a related bug that was revealed during the
development of this patch. As explained in cockroachdb#60520, it was possible for a
range merge to be applied to the leaseholder of the LHS of the merge
through a Raft snapshot. In such cases, we were not properly updating
the leaseholder's timestamp cache to reflect the reads served on the RHS
range. This could allow the post-merged range to invalidate reads served
by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure.
Merge triggers now write to the left-hand side's prior read summary with
a read summary gathered from the right-hand side during subsumption.
Later, upon ingesting a Raft snapshot, we check if we subsumed any
replicas and if we are the leaseholder. If both of those conditions are
true, we forward the replica's timestamp cache to the read summary on
the range. Since this read summary must have been updated by the merge
trigger, it will include all reads served on the pre-merge RHS range.

----

Release note (bug fix): Fixes a very rare, possible impossible in
practice, bug where a range merge that applied through a Raft snapshot
on the left-hand side range's leaseholder could allow that leaseholder
to serve writes that invalidated reads from before the merge on the
right-hand side.

Release justification: bug fix
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/shipTimestampCache2 branch from a5d3e34 to a7472e3 Compare February 24, 2021 05:46
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 24, 2021

Build succeeded:

@craig craig bot merged commit 7f4c550 into cockroachdb:master Feb 24, 2021
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/shipTimestampCache2 branch February 25, 2021 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants