-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: ship timestamp cache summary during lease transfers and range merges #60521
kv: ship timestamp cache summary during lease transfers and range merges #60521
Conversation
f23254a
to
669b44d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense to me 👍
Reviewed 8 of 10 files at r2, 5 of 8 files at r3, 50 of 50 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
pkg/clusterversion/cockroach_versions.go, line 239 at r4 (raw file):
PostTruncatedAndRangeAppliedStateMigration // PriorReadSummaries introduces support for the use of read summary objects // ship information about reads on a range through lease changes and range
missing word (that?)
pkg/kv/kvserver/client_lease_test.go, line 444 at r4 (raw file):
} // TestStoreLeaseTransferTimestampCacheTxnRecord the error returned by attempts
missing word (checks?)
pkg/kv/kvserver/client_merge_test.go, line 388 at r4 (raw file):
// - disjointLeaseholders: configures whether or not the leaseholder of the // LHS range is disjoint from the leaseholder of the RHS range. If false, // the leaseholders are collacted before the merge is initiated.
collocated
pkg/kv/kvserver/client_merge_test.go, line 416 at r4 (raw file):
) { // mergeCommitFilter is used to issue a sequence of operations on the LHS of // a range merge immediately before it
.
pkg/kv/kvserver/client_merge_test.go, line 420 at r4 (raw file):
// blockHBAndGCs is used to black hole Heartbeat and GC requests for the // duration of the merge on the throughSnapshot path. Neither request type // is needed and both can create issues during the split leader-leaseholder
Can you document what these issues are?
pkg/kv/kvserver/client_merge_test.go, line 556 at r4 (raw file):
} // Pause the cluster's clock. This accomplishes two things:
Random comment, this comment merge_queue.go seems wrong:
cockroach/pkg/kv/kvserver/merge_queue.go
Lines 39 to 41 in fa32402
// The current implementation of merges requires rewriting the right-hand data | |
// onto the left-hand range, even when the ranges are collocated. This is | |
// expensive, so limit to one merge at a time. |
pkg/kv/kvserver/client_merge_test.go, line 580 at r4 (raw file):
// // But there's a wrinkle here that makes things more difficult: the // leaseholder needs to play a role in coordinating the range merge and
Could you refresh my memory on how this even works today? I thought we make sure the LHS and RHS lease were collocated, but don't think this true. I understand that once we have the SubsumeRequest
applied on all RHS followers, the RHS lease won't change (unless the merge aborts). But what about the LHS? Nothing in (*Replica).AdminMerge
or merge_queue.go
seems to care. Sure, the merge queue checks for the lease, but it seems easily possible for the merge queue to check the lease, then the lease to move somewhere else, and then we run AdminMerge
on a replica that's not the leaseholder, and I think it all should still work, right? I don't think (?) that how this all works is documented anywhere - now would be a good time for a braindump.
pkg/kv/kvserver/client_merge_test.go, line 585 at r4 (raw file):
// leader-leaseholder state and lock down all communication between the // two _except_ for forwarded proposal from the leaseholder to the // leader. This allowed the leaseholder to make proposals, even though
allows
pkg/kv/kvserver/client_merge_test.go, line 588 at r4 (raw file):
// it won't be able to hear their result. Because this is such a fragile // state, we enter it as late as possible - after the merge begins and // only upon receiving the merge's EndTxn request.
Oof, it's amazing that you got this done, but is there an alternative approach that's less susceptible to a bus factor of 1 and random breakage? We initiate the merge without any kind of partition, but then when the leaseholder wants to apply the merge trigger, we remove its replica instead. Then we lean back and wait for it to reappear, which necessarily must happen via a snapshot. That seems a lot more straightforward, the possible wrinkle being that "remove it instead of applying a command" might need some work, but I think Andrew's "replica can't exist in-mem across replicaID changes" work maybe helped?
pkg/kv/kvserver/replica_raftstorage.go, line 1001 at r4 (raw file):
// we missed the application of a merge) and we are the new leaseholder, we // make sure to update the timestamp cache using the prior read summary to // account for any reads that were served on the right-hand side range(s).
💯
pkg/kv/kvserver/replica_tscache.go, line 553 at r4 (raw file):
// will be closed next under the current lease. This will be easier to // think through with the new closed timestamp system. sum.Merge(rspb.FromTimestamp(maxClosed))
Agree, this seems impossible to get right with closed timestamps as we have them today. Doesn't this mean that during lease application, we always need to take the pessimistic update to the tsCache where we forward to newLease.Start?
pkg/kv/kvserver/store_merge.go, line 93 at r4 (raw file):
// timestamps in the timestamp cache. For a full discussion, see the // comment on TestStoreRangeMergeTimestampCacheCausality. s.Clock().Update(freezeStart)
And we don't have to update with anything from rightReadSum
because it can't contain anything ahead of freezeStart
. That seems like a nice assertion to throw in?
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 163 at r4 (raw file):
EndKey: keys.MakeRangeIDReplicatedPrefix(mt.RightDesc.RangeID).PrefixEnd(), }) // Merges the prior read summary from the RHS over to the LHS,
Merge incorporate the prior read summary from the RHS into the LHS',
Another thing to explain: we don't latch the RHS read summary because this is already provided to us with the merge trigger (it comes out of subsume request). And why is this different from the abort span, which is copied over inline in the merge trigger itself?
Something else that should be clarified is which part of the summary is optimization. For example, could we replace the summary by freezeStart throughout and things would be correct (similar to how in lease transfers we could fall back to the new lease start time)? Both seem like good assertions to add btw since we expect there to always be a relation between the summary and the respective fallback.
pkg/kv/kvserver/batcheval/cmd_lease.go, line 132 at r4 (raw file):
This is a totally standard migration, but still good to spell it out?
We elide this step in mixed-version clusters as old nodes would ignore the PriorReadSummary field (they don't know about it). It's possible that in this particular case we could get away with it (as the in-mem field only ever updates in-mem state) but it's easy to get things wrong (in which case they could easily take a catastrophic turn) and the benefit is low.
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 92 at r4 (raw file):
// reads at future times above the proposed lease start time. // // We can remove this in the future when we increase the resolution of read
Ah, ok. This answers my comment on GetCurrentReadSummary
. Please drop a hint in that method so that the next reader's alarm bells don't go off :-)
pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 176 at r4 (raw file):
// 2. it can transfer information about reads with synthetic timestamps, which // are not otherwise captured by the new lease's start time. kv.kvserver.readsummary.ReadSummary prior_read_summary = 22;
Mention that if this field is set, this proposal also persists the same summary under its replicated key.
The pattern here is almost silly btw. The replicated key is in the proposal that just got applied, so we could just put a bool here and read it from disk. I know it doesn't matter either way, but the way it is now you always worry a tiny but about ways in which you could have this field set and the thing on disk be different. OTOH with a bool you'd worry about cases where the bool is (not) set and the key has (not) changed, so still the same. I guess assertStateLocked
takes care of it.
pkg/kv/kvserver/readsummary/rspb/summary.go, line 31 at r4 (raw file):
// Clone performs a deep-copy of the receiver. func (c ReadSummary) Clone() *ReadSummary {
Doesn't this field have the usual proto pointer thingies? I guess that is fine, but is the copy really "deep"?
pkg/kv/kvserver/readsummary/rspb/summary.proto, line 24 at r4 (raw file):
// long as the resulting timestamp for each key does not regress. However, the // timestamp of a key is allowed to advance as precision drops. This parallels a // similar ratchetting policy in the timestamp cache (tscache.Cache).
If it's "crocheting" as my spellcheck claims it is, it's probably "ratcheting" too
pkg/kv/kvserver/readsummary/rspb/summary.proto, line 26 at r4 (raw file):
// similar ratchetting policy in the timestamp cache (tscache.Cache). // // For example, a high-resolution version of the summary may look like:
🏅 loving the illustration
pkg/roachpb/data.proto, line 168 at r4 (raw file):
// high. // 2. it can transfer information about reads with synthetic timestamps, which // are not otherwise captured by the FreezeStart clock timestamp.
It's nice that this has the same comment, but it might be more ergonomical to be terse here and refer the reader to SubsumeResponse. That way, if the latter gets better, the reader won't assume they've learned all there is to know here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once you fix the freeze time <-> summary interaction
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)
pkg/keys/printer_test.go, line 77 at r4 (raw file):
{keys.RaftTruncatedStateKey(roachpb.RangeID(1000001)), "/Local/RangeID/1000001/u/RaftTruncatedState", revertSupportUnknown}, {keys.RangeLeaseKey(roachpb.RangeID(1000001)), "/Local/RangeID/1000001/r/RangeLease", revertSupportUnknown}, {keys.RangePriorReadSummaryKey(roachpb.RangeID(1000001)), "/Local/RangeID/1000001/r/RangePriorReadSummary", revertSupportUnknown},
is revertSupportUnknown
indeed what you want?
pkg/kv/kvserver/replica_tscache.go, line 553 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Agree, this seems impossible to get right with closed timestamps as we have them today. Doesn't this mean that during lease application, we always need to take the pessimistic update to the tsCache where we forward to newLease.Start?
yeah, which I think we should do by having this method return a null before a cluster version. That cluster version would be my PR, but perhaps you want to introduce a dummy cluster version in this patch.
pkg/kv/kvserver/batcheval/cmd_subsume.go, line 155 at r4 (raw file):
reply.FreezeStart = cArgs.EvalCtx.Clock().NowAsClockTimestamp() sum := cArgs.EvalCtx.GetCurrentReadSummary() reply.ReadSummary = &sum
forward to the freeze time like we discussed
When you do, comment that leaving the summary nil
is not an option because we need it for the snapshot case.
pkg/kv/kvserver/batcheval/eval_context.go, line 106 at r4 (raw file):
// return a meaningful summary if the caller has serialized with all other // requests on the range. GetCurrentReadSummary() rspb.ReadSummary
is there a way to assert the latches? Can this take a latchGuard
?
pkg/kv/kvserver/readsummary/persist.go, line 24 at r4 (raw file):
) // Load loads the range's prior read summary.
Say that it returns nil if there's no summary.
pkg/kv/kvserver/readsummary/rspb/summary.proto, line 22 at r4 (raw file):
// summary in that it may not represent these reads with perfect precision. // Instead, it is allowed to lose resolution in exchange for reduced space, as // long as the resulting timestamp for each key does not regress. However, the
I don't understand the "however"; the second sentence seems to just clarify the previous one
pkg/kv/kvserver/readsummary/rspb/summary.proto, line 26 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
🏅 loving the illustration
progressive JPEG
pkg/roachpb/api.proto, line 1659 at r4 (raw file):
// being subsumed). It is suitable for use as the timestamp cache's low water // mark for the keys previously owned by the subsumed range. util.hlc.Timestamp freeze_start = 5 [(gogoproto.nullable) = false,
put the TODO saying that this is going away here too pls
7e27ce8
to
08287cb
Compare
08287cb
to
0e2cd4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @bdarnell, @benesch, and @tbg)
pkg/clusterversion/cockroach_versions.go, line 239 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
missing word (that?)
Done.
pkg/keys/printer_test.go, line 77 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
is
revertSupportUnknown
indeed what you want?
Yes, like all of these range-id local keys.
pkg/kv/kvserver/client_lease_test.go, line 444 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
missing word (checks?)
Done.
pkg/kv/kvserver/client_merge_test.go, line 388 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
collocated
Done.
pkg/kv/kvserver/client_merge_test.go, line 416 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
.
Done.
pkg/kv/kvserver/client_merge_test.go, line 420 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you document what these issues are?
Done.
pkg/kv/kvserver/client_merge_test.go, line 556 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Random comment, this comment merge_queue.go seems wrong:
cockroach/pkg/kv/kvserver/merge_queue.go
Lines 39 to 41 in fa32402
// The current implementation of merges requires rewriting the right-hand data // onto the left-hand range, even when the ranges are collocated. This is // expensive, so limit to one merge at a time.
Done.
pkg/kv/kvserver/client_merge_test.go, line 580 at r4 (raw file):
I thought we make sure the LHS and RHS lease were collocated, but don't think this true.
No, this is not true. The leases can be on different nodes.
I understand that once we have the SubsumeRequest applied on all RHS followers, the RHS lease won't change (unless the merge aborts).
I don't think this is quite true either. The SubsumeRequest lands on a RHS leaseholder, but doesn't get propagated to followers (it doesn't go through Raft). If the RHS leaseholder fails, a new lease can be acquired, in which case the new leaseholder will notice the deletion intent on the local descriptor and remain "subsumed" - see maybeWatchForMergeLocked
. This is actually related to the deadlock we found in #60905.
But what about the LHS? Nothing in (*Replica).AdminMerge or merge_queue.go seems to care. Sure, the merge queue checks for the lease, but it seems easily possible for the merge queue to check the lease, then the lease to move somewhere else, and then we run AdminMerge on a replica that's not the leaseholder, and I think it all should still work, right?
AdminMerge
can run on a non-leaseholder, but the actual EndTxn
with the MergeTrigger
will always go through the leaseholder.
I don't think (?) that how this all works is documented anywhere - now would be a good time for a braindump.
It is, actually! @benesch's parting gift was to leave us with docs/tech-notes/range-merges.md
, which I've found useful on multiple occasions.
pkg/kv/kvserver/client_merge_test.go, line 585 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
allows
Done.
pkg/kv/kvserver/client_merge_test.go, line 588 at r4 (raw file):
we remove its replica instead
What do you mean by this? I don't think we'll be able to run any ChangeReplicas operations because doing so requires touching the range descriptor, which is already covered by the merge's intent.
pkg/kv/kvserver/replica_tscache.go, line 553 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
yeah, which I think we should do by having this method return a null before a cluster version. That cluster version would be my PR, but perhaps you want to introduce a dummy cluster version in this patch.
Done.
pkg/kv/kvserver/store_merge.go, line 93 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
And we don't have to update with anything from
rightReadSum
because it can't contain anything ahead offreezeStart
. That seems like a nice assertion to throw in?
It can contain reads ahead of freezeStart
now that we can perform reads in the future of present-time. But those reads must all be at synthetic timestamps.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 163 at r4 (raw file):
Merge incorporate the prior read summary from the RHS into the LHS',
Done.
Another thing to explain: we don't latch the RHS read summary because this is already provided to us with the merge trigger (it comes out of subsume request). And why is this different from the abort span, which is copied over inline in the merge trigger itself?
Done. Wow, I didn't realize that's how we copy over the abort span.
Something else that should be clarified is which part of the summary is optimization. For example, could we replace the summary by freezeStart throughout and things would be correct (similar to how in lease transfers we could fall back to the new lease start time)? Both seem like good assertions to add btw since we expect there to always be a relation between the summary and the respective fallback.
This isn't quite true once we start talking about future time operations.
pkg/kv/kvserver/batcheval/cmd_lease.go, line 132 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is a totally standard migration, but still good to spell it out?
We elide this step in mixed-version clusters as old nodes would ignore the PriorReadSummary field (they don't know about it). It's possible that in this particular case we could get away with it (as the in-mem field only ever updates in-mem state) but it's easy to get things wrong (in which case they could easily take a catastrophic turn) and the benefit is low.
Done.
pkg/kv/kvserver/batcheval/cmd_lease_transfer.go, line 92 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ah, ok. This answers my comment on
GetCurrentReadSummary
. Please drop a hint in that method so that the next reader's alarm bells don't go off :-)
Done.
pkg/kv/kvserver/batcheval/cmd_subsume.go, line 155 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
forward to the freeze time like we discussed
When you do, comment that leaving the summarynil
is not an option because we need it for the snapshot case.
Done.
pkg/kv/kvserver/batcheval/eval_context.go, line 106 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
is there a way to assert the latches? Can this take a
latchGuard
?
Yeah, that's what we do in SpanSetReplicaEvalContext
.
pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 176 at r4 (raw file):
Done. The idea here was to allow a compressed version of the summary to be persisted and a high-resolution version of the summary to be passed through the ReplicatedEvalResult
. That will allow for more efficient lease transfers (with respect to the timestamp cache) in the common case where a leaseholder applies the lease transfer / merge through the log without needing to risk storing large summaries indefinitely on a range in case the leaseholder applies the lease transfer / merge through a snapshot.
I added this to the comment.
When a ReadSummary is set in a ReplicatedEvalResult, there is always also a
write to the RangePriorReadSummaryKey in the RaftCommand.WriteBatch. The
persisted summary may be identical to the summary in this field, but it
does not have to be. Notably, we intended for the summary included in the
ReplicatedEvalResult to eventually be a much higher-resolution version of
the ReadSummmary than the version persisted. This scheme of persisting a
compressed ReadSummary indefinitely and including a higher-resolution
ReadSummary on the RaftCommand allows us to optimize for the common case
where the lease transfer is applied on the new leaseholder through Raft log
application while ensuring correctness in the case where the lease transfer
is applied on the new leaseholder through a Raft snapshot.
pkg/kv/kvserver/readsummary/persist.go, line 24 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Say that it returns nil if there's no summary.
Done.
pkg/kv/kvserver/readsummary/rspb/summary.go, line 31 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Doesn't this field have the usual proto pointer thingies? I guess that is fine, but is the copy really "deep"?
There aren't any pointers in ReadSummary
at the moment, so it is deep, but you're right that this could easily rot. Added a comment.
pkg/kv/kvserver/readsummary/rspb/summary.proto, line 22 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't understand the "however"; the second sentence seems to just clarify the previous one
Done.
pkg/kv/kvserver/readsummary/rspb/summary.proto, line 24 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
If it's "crocheting" as my spellcheck claims it is, it's probably "ratcheting" too
Done.
pkg/roachpb/api.proto, line 1659 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
put the TODO saying that this is going away here too pls
Done.
pkg/roachpb/data.proto, line 168 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's nice that this has the same comment, but it might be more ergonomical to be terse here and refer the reader to SubsumeResponse. That way, if the latter gets better, the reader won't assume they've learned all there is to know here.
I added a bit more to this comment that is specific to MergeTriggers as they pass through Raft, so I'll keep this split for now, though I agree with everything you said in principle.
0e2cd4e
to
9ce9703
Compare
bors r+ |
bors r- |
Canceled. |
9ce9703
to
7ee1283
Compare
bors r+ |
bors r- We keep conflicting on updates to |
Canceled. |
7ee1283
to
a5d3e34
Compare
bors r+ |
Fixes cockroachdb#57688. Fixes cockroachdb#59679. Fixes cockroachdb#60520. This commit introduces new logic to ship summaries of a leaseholders timestamp cache through lease transfers and range merges. For lease transfers, the read summary is sent from the outgoing leaseholder to the incoming leaseholder. For range merges, the read summary is sent from the right-hand side leaseholder (through the SubsumeResponse), to the left-hand side leaseholder (through the MergeTrigger). The read summaries perform the role of the lease start time and merge freeze time used to play for lease transfers and range merges, respectively - the summaries instruct the post-operation leaseholder on how to update its timestamp cache to ensure that no future writes are allowed to invalidate prior reads. Read summaries have two distinct advantages over the old approach: 1. they can transfer a higher-resolution snapshot of the reads on the range through a lease transfer, to make the lease transfers less disruptive to writes because the timestamp cache won't be bumped as high. This avoids transaction aborts and retries after lease transfers and merges. 2. they can transfer information about reads with synthetic timestamps, which are not otherwise captured by the new lease's start time. Because of this, they are needed for correctness on `global_read` ranges, which can serve reads in the future. This commit does not realize the first benefit, because it uses very low-resolution read summaries. However, it sets up the infrastructure that will allow us to realize the benefit in the future by capturing and shipping higher-resolution read summaries. The commit does realize the second benefit, as it fixes correctness issues around future time reads. ---- The commit also fixes a related bug that was revealed during the development of this patch. As explained in cockroachdb#60520, it was possible for a range merge to be applied to the leaseholder of the LHS of the merge through a Raft snapshot. In such cases, we were not properly updating the leaseholder's timestamp cache to reflect the reads served on the RHS range. This could allow the post-merged range to invalidate reads served by the pre-merge RHS range. This commit fixes this bug using the new read summary infrastructure. Merge triggers now write to the left-hand side's prior read summary with a read summary gathered from the right-hand side during subsumption. Later, upon ingesting a Raft snapshot, we check if we subsumed any replicas and if we are the leaseholder. If both of those conditions are true, we forward the replica's timestamp cache to the read summary on the range. Since this read summary must have been updated by the merge trigger, it will include all reads served on the pre-merge RHS range. ---- Release note (bug fix): Fixes a very rare, possible impossible in practice, bug where a range merge that applied through a Raft snapshot on the left-hand side range's leaseholder could allow that leaseholder to serve writes that invalidated reads from before the merge on the right-hand side. Release justification: bug fix
a5d3e34
to
a7472e3
Compare
bors r+ |
Build succeeded: |
Fixes #57688.
Fixes #59679.
Fixes #60520.
This commit introduces new logic to ship summaries of a leaseholders timestamp cache through lease transfers and range merges. For lease transfers, the read summary is sent from the outgoing leaseholder to the incoming leaseholder. For range merges, the read summary is sent from the right-hand side leaseholder (through the SubsumeResponse), to the left-hand side leaseholder (through the MergeTrigger).
The read summaries perform the role of the lease start time and merge freeze time used to play for lease transfers and range merges, respectively - the summaries instruct the post-operation leaseholder on how to update its timestamp cache to ensure that no future writes are allowed to invalidate prior reads.
Read summaries have two distinct advantages over the old approach:
global_read
ranges, which can serve reads in the future.This commit does not realize the first benefit, because it uses very low-resolution read summaries. However, it sets up the infrastructure that will allow us to realize the benefit in the future by capturing and shipping higher-resolution read summaries. The commit does realize the second benefit, as it fixes correctness issues around future time reads.
The commit also fixes a related bug that was revealed during the development of this patch. As explained in #60520, it was possible for a range merge to be applied to the leaseholder of the LHS of the merge through a Raft snapshot. In such cases, we were not properly updating the leaseholder's timestamp cache to reflect the reads served on the RHS range. This could allow the post-merged range to invalidate reads served by the pre-merge RHS range.
This commit fixes this bug using the new read summary infrastructure. Merge triggers now write to the left-hand side's prior read summary with a read summary gathered from the right-hand side during subsumption. Later, upon ingesting a Raft snapshot, we check if we subsumed any replicas and if we are the leaseholder. If both of those conditions are true, we forward the replica's timestamp cache to the read summary on the range. Since this read summary must have been updated by the merge trigger, it will include all reads served on the pre-merge RHS range.
The existence of this bug was verified by a new variant of
TestStoreRangeMergeTimestampCache
, which only passes with the rest of this commit.Release note (bug fix): Fixes a very rare, possible impossible in practice, bug where a range merge that applied through a Raft snapshot on the left-hand side range's leaseholder could allow that leaseholder to serve writes that invalidated reads from before the merge on the right-hand side.
Release justification: bug fix