-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: follower reads #26362
RFC: follower reads #26362
Conversation
LGTM Reviewed 1 of 1 files at r1. docs/RFCS/20180603_follower_reads.md, line 61 at r1 (raw file):
docs/RFCS/20180603_follower_reads.md, line 94 at r1 (raw file):
s/If the replica is not the leaseholder/If the CT conditions are not met/ docs/RFCS/20180603_follower_reads.md, line 164 at r1 (raw file):
I'd phrase this differently, to make it clear that this is a consequence of an optimization instead of a fundamental property of the system: As an optimization, updated closed timestamps are propagated to quiesced replicas lazily, as reads are attempted. This includes validation of the CT heartbeat's liveness epoch. As a result, if the liveness epoch is incremented before any reads have been served, we must discard the closed timestamp since we cannot tell whether the heartbeat was valid when it was received. Comments from Reviewable |
Review status: docs/RFCS/20180603_follower_reads.md, line 14 at r1 (raw file):
docs/RFCS/20180603_follower_reads.md, line 52 at r1 (raw file):
Would be nice to quickly elaborate on what these guarantees/properties are (e.g. monotonically increasing, not bound to a term, etc.). docs/RFCS/20180603_follower_reads.md, line 83 at r1 (raw file):
Should we take estimated clock offset into this equation as well? It's possible that the SQL gateway's clock is ahead of the replica by some delta and so even though it expects that the request will be beneath the range's closed timestamp, it isn't yet. An incorrect guess like this isn't a correctness issue, it just forces us to redirect to the leaseholder, so we don't need to worry about the clock max offset. Still, taking the estimated clock offset between the gateway and the replica's node into account should help us avoid these redirects. It's also possible that the raft heartbeat interval is such an overestimate that this never matters in practice. We could also look at this another way. Contacting the nearest replica will almost always be significantly cheaper than contacting the leaseholder. Taken to its extreme, that makes one of these redirects due to an incorrect guess essentially free. Does this mean that we should be more aggressive than waiting out the entire raft heartbeat interval? docs/RFCS/20180603_follower_reads.md, line 91 at r1 (raw file):
I'm realizing now that this means we can't use follower reads to improve range lookups. That's a shame. docs/RFCS/20180603_follower_reads.md, line 374 at r1 (raw file):
started before the CT? docs/RFCS/20180603_follower_reads.md, line 538 at r1 (raw file):
Decreasing this duration will be problematic because clients will expect it to be in effect immediately when determining whether to send to followers while Ranges will necessarily take longer to react to the change (at least Comments from Reviewable |
Review status: docs/RFCS/20180603_follower_reads.md, line 83 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think in practice clients that want follower reads will (should) be very conservative in their timestamp choices, perhaps using double the closed timestamp interval to ensure that they're almost certain to get a follower read, instead of trying to get as close as possible to the threshold. Comments from Reviewable |
…mestamps This change lays the ground work for follower reads by adding the prototype implementation into our main codebase Most of the logic is disabled by default. It is only exercised during specific unit tests or when running with a nonzero COCKROACH_CLOSED_TIMESTAMP_INTERVAL. The code contains several potential correctness anomalies and makes no attempt at handling lagging followers gracefully. It should not be used outside of testing and in fact should not be used at all (though we may want to write roachtests early). The [follower reads RFC] and many TODOs in this code hint at the upcoming changes. Most prominently, known correctness gotchas will be addressed, but the role of quiescence and coalesced heartbeats will be untangled from the main proposal, which hopefully can clarify the code somewhat as well. In the meantime, the commit message below documents what is implemented here, even though it is subject to change: Nodes send a closed timestamp with coalesced heartbeats. Receipt of a heartbeat from a node which is the leaseholder for the range means a closed timestamp can be trusted to apply to each follower replica which has committed at or over a min lease applied index, a new value supplied with coalesced heartbeats. Nodes keep track of their "min proposal timestamp" (MinPropTS), which is an HLC timestamp. On every heartbeat, the MinPropTS is persisted locally to ensure monotonicity on node restart. At startup, a node reads the last persisted MinPropTS, and forwards the HLC clock to the MPT timestamp + max safe interval if necessary. Nodes check MinPropTS on command evaluation; a command's timestamp is forwarded if less than MinPropTS. Things get more interesting when a range quiesces. Replicas of quiesced ranges no longer receive info on coalesced heartbeats. However, if a replica is quiesced, we can continue to rely on the most recent store-wide closed timestamp supplied with coalesced heartbeats, so long as the liveness epoch (reported with heartbeats) remains stable and no heartbeats are skipped. This can continue for as long as a range is quiesced, but requires that the leaseholder notifies all followers on the first heartbeat after a range is unquiesced. Note there is no concern that on leaseholder change, the new leaseholder allows a write at an earlier timestamp than a previously reported closed timestamp. This is due to the low water timestamp in the timestamp cache being reset on leaseholder transfer to prevent rewriting history in general. Release note: None [follower reads RFC]: cockroachdb#26362
Maybe I'm missing something fundamental here, but I think we should make the following changes that drastically simplify the design (and impl):
This works as long as the node keeps its lease (and promises). If the node loses its lease, then either
|
Reminder to self: it's not enough to teach DistSender about the existence of followers that can serve historical reads, we also need to make changes in
|
The new proposal makes sense to me. I think the next step is to produce an updated version of the RFC with this proposal so we can think about it more directly (instead of thinking about changes from the original version to the new one).
As a corollary to this, nodes should never send closed timestamps that are higher than their confirmed liveness expiration. As long as leaseholders never send timestamps that might exceed their liveness expiration, the recipient-side check here is redundant.
Relying on the fact that there is a new MLAI pointing to a new lease seems subtle to me. I think it might be better to send an explicit flag indicating that the lease is being transferred away. |
Incorporates the ideas in [this comment], amounting to a complete rewrite. Release note: None [1] cockroachdb#26362 (comment),
@bdarnell rewrote the thing and pushed a big update, not sure if it even makes sense to look at it as a diff. It will need some more polish but I hope the main ideas (and particularly correctness) is somewhat more intuitive now. |
Review status: docs/RFCS/20180603_follower_reads.md, line 205 at r3 (raw file):
Isn't it at least two varints (rangeID and MLAI)? (Maybe a few more bytes if it's a proto docs/RFCS/20180603_follower_reads.md, line 206 at r3 (raw file):
s/leaseholders/leased replicas per store/ docs/RFCS/20180603_follower_reads.md, line 217 at r3 (raw file):
Is there supposed to be something else here? docs/RFCS/20180603_follower_reads.md, line 222 at r3 (raw file):
Missing backtick after "Store" docs/RFCS/20180603_follower_reads.md, line 231 at r3 (raw file):
s/high/low/ docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):
I don't think this is guaranteed today. The timestamp of a lease transfer is not carried in the request header so I think it would be possible (without additional safeguards) to propose a lease transfer that takes effect before the MPT. docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):
s/cases/ceases/ Don't we need to take the greater of the MPTs of the two ranges or something like that? docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):
Sending the full heartbeat would have similar impact to disabling quiescence, which we've shown to have significant performance impact. Comments from Reviewable |
TFTR @bdarnell. This RFC is now entering the final comment period. Review status: docs/RFCS/20180603_follower_reads.md, line 14 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 52 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 61 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 83 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 91 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 94 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 164 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I added a new section about checking the leases and wove that facet in there, with an example. docs/RFCS/20180603_follower_reads.md, line 374 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Reworded and removed that particular wording (you can see intents before and after). docs/RFCS/20180603_follower_reads.md, line 538 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Added comment. docs/RFCS/20180603_follower_reads.md, line 205 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Fixed. docs/RFCS/20180603_follower_reads.md, line 206 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I'm not sure that the term "lease replica" is clear enough. Went with "leaseholder replicas". docs/RFCS/20180603_follower_reads.md, line 217 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
A discussion of not including ranges here. Moved elsewhere. docs/RFCS/20180603_follower_reads.md, line 222 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 231 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
This section was copied in from the original tech note, but I think the argument isn't necessary any more. For leases that were taken over, the start is after the expiration. For transfers, the old leaseholder won't propose anything above the proposed ts itself, and it will force the followers to see the lease transfer (via MLAI). docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):
Done. Don't we need to take the greater of the MPTs of the two ranges or something like that? no, the merge command acts like a write that absorbs all of the data formerly on the RHS. docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Did we ever pin the perf impact on the network i/o? If it's not network I/O but something in-process, then the situation for follower reads could be different (basically because we can incrementally update the state locally as described in the next paragraph). Comments from Reviewable |
Review status: docs/RFCS/20180603_follower_reads.md, line 53 at r4 (raw file):
"the these" choose one :) docs/RFCS/20180603_follower_reads.md, line 68 at r4 (raw file):
s/information when/information only when/ docs/RFCS/20180603_follower_reads.md, line 85 at r4 (raw file):
below or equal to? Did this change? docs/RFCS/20180603_follower_reads.md, line 114 at r4 (raw file):
The current code doesn't throw away existing MLAIs. Instead, it discards new ones until the old one is resolved (achieved? reached? what's the word for this?). There's a trade-off here that depends on whether the range expects continual load or expects the load to eventually stop. Overwriting existing MLAIs ensures that we won't miss the high-water mark of closed timestamps after load stops on a range and the follower fully catches up. However, overwriting existing MLAIs might also mean that we never move a follower's closed timestamp forward because the MLAI is continually overwritten before we get to it. Picture a finish line in a race that keeps moving forward such that it's never reachable. This is the same problem I was discussing in the fourth comment of #21056 (comment), and there are pretty simple structures we could use with tunable space upper bounds to optimize for both scenarios. docs/RFCS/20180603_follower_reads.md, line 140 at r4 (raw file):
I still think we should move to a more reactive approach to closing out timestamps on followers and move this off the read path. A timestamp can be closed out in exactly two places:
Moving the updates to these two locations is easier to understand in my opinion. The read path can then grab a read-lock on the CT state to check whether follower reads are possible without needing to worry about updating it. Also, FWIW we'll need this for RangeFeeds. docs/RFCS/20180603_follower_reads.md, line 212 at r4 (raw file):
s/an/a/ docs/RFCS/20180603_follower_reads.md, line 226 at r4 (raw file):
major nit: extra space after "be" docs/RFCS/20180603_follower_reads.md, line 235 at r4 (raw file):
It's interesting to make this analogy. Doing so helped me when I was getting familiar with the initial proposal of follower reads. Do note though that actually doing this would be terribly inefficient with the current skiplist implementation of the timestamp cache because it doesn't collapse adjacent spans with equal timestamps. docs/RFCS/20180603_follower_reads.md, line 254 at r4 (raw file):
nit: through 15 and the ones being evaluated will receive 16 and 17 docs/RFCS/20180603_follower_reads.md, line 288 at r4 (raw file):
below or equal to? docs/RFCS/20180603_follower_reads.md, line 490 at r4 (raw file):
all intents that can commit. Aborted intents are fine to ignore. Comments from Reviewable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
docs/RFCS/20180603_follower_reads.md, line 205 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Fixed.
Done.
docs/RFCS/20180603_follower_reads.md, line 206 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
I'm not sure that the term "lease replica" is clear enough. Went with "leaseholder replicas".
Done.
docs/RFCS/20180603_follower_reads.md, line 217 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
A discussion of not including ranges here. Moved elsewhere.
Done.
docs/RFCS/20180603_follower_reads.md, line 53 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"the these" choose one :)
Done.
docs/RFCS/20180603_follower_reads.md, line 68 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/information when/information only when/
Done.
docs/RFCS/20180603_follower_reads.md, line 85 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
below or equal to? Did this change?
Should be <=
, but I had missed this one. Done.
docs/RFCS/20180603_follower_reads.md, line 114 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
overwriting existing *MLAI*s
The current code doesn't throw away existing MLAIs. Instead, it discards new ones until the old one is resolved (achieved? reached? what's the word for this?). There's a trade-off here that depends on whether the range expects continual load or expects the load to eventually stop. Overwriting existing MLAIs ensures that we won't miss the high-water mark of closed timestamps after load stops on a range and the follower fully catches up. However, overwriting existing MLAIs might also mean that we never move a follower's closed timestamp forward because the MLAI is continually overwritten before we get to it. Picture a finish line in a race that keeps moving forward such that it's never reachable.
This is the same problem I was discussing in the fourth comment of #21056 (comment), and there are pretty simple structures we could use with tunable space upper bounds to optimize for both scenarios.
To reflect our conversation, I added a "Further work" section below which touches on these points and also mentions RangeFeeds.
docs/RFCS/20180603_follower_reads.md, line 140 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I still think we should move to a more reactive approach to closing out timestamps on followers and move this off the read path. A timestamp can be closed out in exactly two places:
- when the CT update is received
- when the LAI is bumped beneath Raft
Moving the updates to these two locations is easier to understand in my opinion. The read path can then grab a read-lock on the CT state to check whether follower reads are possible without needing to worry about updating it.
Also, FWIW we'll need this for RangeFeeds.
This isn't exactly mentioned in the "Further work" section, but what is mentioned is that there is a spectrum of ways to address this that includes your suggestion.
docs/RFCS/20180603_follower_reads.md, line 212 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/an/a/
Done.
docs/RFCS/20180603_follower_reads.md, line 226 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
major nit: extra space after "be"
Done.
docs/RFCS/20180603_follower_reads.md, line 235 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's interesting to make this analogy. Doing so helped me when I was getting familiar with the initial proposal of follower reads. Do note though that actually doing this would be terribly inefficient with the current skiplist implementation of the timestamp cache because it doesn't collapse adjacent spans with equal timestamps.
Done.
docs/RFCS/20180603_follower_reads.md, line 288 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
below or equal to?
Done.
docs/RFCS/20180603_follower_reads.md, line 490 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
all intents that can commit. Aborted intents are fine to ignore.
An intent by definition has PENDING
status, and resolving commits or removes it. There's no such thing as an aborted intent (it's simply the absence of an intent). Maybe I'm misunderstanding your suggestion?:q
g
Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
This section was copied in from the original tech note, but I think the argument isn't necessary any more. For leases that were taken over, the start is after the expiration. For transfers, the old leaseholder won't propose anything above the proposed ts itself, and it will force the followers to see the lease transfer (via MLAI).
I think we still need some discussion and possible safeguards around transfers.
Lease transfers are always proposed at the current time, and timestamps are closed when they are in the past, so a situation in which a lease transfer is proposed at a timestamp prior to the current MPT is unlikely, but I think it is still theoretically possible because lease proposals go through a slightly different path and don't (IIRC) have their timestamp in the header so they wouldn't be affected by the usual rules that enforce the MPT.
A conflict in the other direction is more plausible: A lease transfer is proposed at time T1 and LAI 1. It applies, and then we close timestamp T2 with MLAI 1 (even guaranteeing this much may require some additional checks, as above: since the lease transfer timestamp is not in the header, we must make sure that proposing a transfer will prevent older timestamps from closing). But applying LAI1 is not sufficient to serve reads at T2: you need a closed timestamp from the new leaseholder.
docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
s/cases/ceases/
Done.
Don't we need to take the greater of the MPTs of the two ranges or something like that?
no, the merge command acts like a write that absorbs all of the data formerly on the RHS.
If the RHS has a greater closed timestamp than the LHS, I think we have a problem on merge similar to lease transfers. The RHS's promise not to serve writes below some time must be propagated to the merged range. I think if the GetSnapshotForMerge operation counts as a write this probably happens naturally, but I'm not sure that that's the case. Anything that's not a normal txn has a chance of messing this up.
docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Did we ever pin the perf impact on the network i/o? If it's not network I/O but something in-process, then the situation for follower reads could be different (basically because we can incrementally update the state locally as described in the next paragraph).
No, we never pinned it on the network I/O (and in fact it looks like the CPU cost is likely more significant).
So to be clear, is this paragraph arguing that it might be better to send full updates every time than to spend more CPU tracking which replicas are up to date (and potentially more expensive recovery from missed heartbeats)?
docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):
But this strategy can miss necessary updates as leases get transferred to otherwise inactive ranges. To guard against these rare cases, the second strategy serves as a fallback.
What is the mechanism for this fallback? +1 to having a fallback instead of "either everything's perfect or it all falls apart" (which we're too prone to), but this is a late addition that I think deserves more discussion.
Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):
Lease transfers are always proposed at the current time, and timestamps are closed when they are in the past, so a situation in which a lease transfer is proposed at a timestamp prior to the current MPT is unlikely, but I think it is still theoretically possible because lease proposals go through a slightly different path and don't (IIRC) have their timestamp in the header so they wouldn't be affected by the usual rules that enforce the MPT.
I think this is mentioned in the "implied guarantees" section, though I made sure to mention that the guarantee is something we need to make sure holds:
- the origin store won't (ever) initiate a lease transfer that would allow
another node to write at or below the closed timestamp. In other words, in the
case of a lease transfer the next lease would start at a timestamp greater than
the closed timestamp. This is likely impossible in practice since the transfer
timestamps and proposed closed timestamps are taken from the same hybrid logical
clock, but an explicit safeguard will be added just in case.
A conflict in the other direction is more plausible: A lease transfer is proposed at time T1 and LAI 1. It applies, and then we close timestamp T2 with MLAI 1 (even guaranteeing this much may require some additional checks, as above: since the lease transfer timestamp is not in the header, we must make sure that proposing a transfer will prevent older timestamps from closing). But applying LAI1 is not sufficient to serve reads at T2: you need a closed timestamp from the new leaseholder.
I could see your point if we threw all updates from various stores into the same pot and then tried to serve correct reads from that, but the proof always involves the lease known to the replica trying to serve the follower read. Let's say replica1 transfers its lease to replica2 with a start timestamp of 100 at LAI 33, and the closed timestamp (on replica1) is 99 with an MLAI of 32.
While the transfer is in progress and even when the new leaseholder has started serving reads, its safe for the old one to use (99@32) for follower reads (forever). Now replica1's store closes out a higher timestamp, say (101@33). Note how this has to contain the lease transfer's LAI. You can't use this until caught up to 33. But at 33, your lease is for replica2, and so you're going to have to look for its closed timestamp state instead to prove that you can serve.
Stores only send out information for epochs they know they are live in. But they don't try to send only heartbeats for replicas that they hold the lease for at the closed timestamp. That check is much easier to do on the receiving end.
If you think there is a way to throw all updates into a pot, please let me know. That would be interesting for the recovery use case: right now we'll have to persist a few past leases or we won't be able to look at a data directory including persisted CT state and figure out which reads are possible.
docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
If the RHS has a greater closed timestamp than the LHS, I think we have a problem on merge similar to lease transfers. The RHS's promise not to serve writes below some time must be propagated to the merged range. I think if the GetSnapshotForMerge operation counts as a write this probably happens naturally, but I'm not sure that that's the case. Anything that's not a normal txn has a chance of messing this up.
That's a really good point, I was stuck thinking about the old ways with LHS and RHS' leaseholders colocated. I updated the description. @nvanbenschoten, initially I'm planning to use the timestamp cache here as it's the simplest way of solving this without introducing additional structure. Do you think the cost of doing this relative to the merge operation as a whole is going to matter right off the bat?
docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
No, we never pinned it on the network I/O (and in fact it looks like the CPU cost is likely more significant).
So to be clear, is this paragraph arguing that it might be better to send full updates every time than to spend more CPU tracking which replicas are up to date (and potentially more expensive recovery from missed heartbeats)?
Each node itself also has to keep track of its own aggregated update (well, it doesn't strictly have to, but it makes a lot of sense that it would keep the information it gives to other nodes itself as well, and it's required for recovery). Once you have that, sending the full state every time becomes cheap iff the bandwidth is cheap enough.
I don't think tracking which replicas are up to date is very expensive. We send a full state initially and then we don't really track anything, we just blindly relay updates (and react to requests by peers).
Since we're expecting payloads of multiple megabytes with very very large clusters (or small max_size) and possibly a relatively frequent update cadence (CDC), I think it's not reasonable to send the full state all the time, especially since the implementation doesn't seem that different. But for many clusters it would be perfectly fine and it would make things a little easier, for example by obviating the second update mechanism and reducing the chances of introducing correctness bugs regarding inactive ranges.
Not sure what to do with this section: I personally would just remove it.
docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
What is the mechanism for this fallback? +1 to having a fallback instead of "either everything's perfect or it all falls apart" (which we're too prone to), but this is a late addition that I think deserves more discussion.
The communication between origin and recipient store is bidirectional, so the recipient store just asks for additional RangeIDs and receives MLAIs in the next update. I added a paragraph.
Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362
Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Lease transfers are always proposed at the current time, and timestamps are closed when they are in the past, so a situation in which a lease transfer is proposed at a timestamp prior to the current MPT is unlikely, but I think it is still theoretically possible because lease proposals go through a slightly different path and don't (IIRC) have their timestamp in the header so they wouldn't be affected by the usual rules that enforce the MPT.
I think this is mentioned in the "implied guarantees" section, though I made sure to mention that the guarantee is something we need to make sure holds:
- the origin store won't (ever) initiate a lease transfer that would allow
another node to write at or below the closed timestamp. In other words, in the
case of a lease transfer the next lease would start at a timestamp greater than
the closed timestamp. This is likely impossible in practice since the transfer
timestamps and proposed closed timestamps are taken from the same hybrid logical
clock, but an explicit safeguard will be added just in case.A conflict in the other direction is more plausible: A lease transfer is proposed at time T1 and LAI 1. It applies, and then we close timestamp T2 with MLAI 1 (even guaranteeing this much may require some additional checks, as above: since the lease transfer timestamp is not in the header, we must make sure that proposing a transfer will prevent older timestamps from closing). But applying LAI1 is not sufficient to serve reads at T2: you need a closed timestamp from the new leaseholder.
I could see your point if we threw all updates from various stores into the same pot and then tried to serve correct reads from that, but the proof always involves the lease known to the replica trying to serve the follower read. Let's say replica1 transfers its lease to replica2 with a start timestamp of 100 at LAI 33, and the closed timestamp (on replica1) is 99 with an MLAI of 32.
While the transfer is in progress and even when the new leaseholder has started serving reads, its safe for the old one to use (99@32) for follower reads (forever). Now replica1's store closes out a higher timestamp, say (101@33). Note how this has to contain the lease transfer's LAI. You can't use this until caught up to 33. But at 33, your lease is for replica2, and so you're going to have to look for its closed timestamp state instead to prove that you can serve.
Stores only send out information for epochs they know they are live in. But they don't try to send only heartbeats for replicas that they hold the lease for at the closed timestamp. That check is much easier to do on the receiving end.
If you think there is a way to throw all updates into a pot, please let me know. That would be interesting for the recovery use case: right now we'll have to persist a few past leases or we won't be able to look at a data directory including persisted CT state and figure out which reads are possible.
Ack, SGTM. It might be worth mentioning that the lease validity check is performed when the MLAI is reached, not when the CT is received.
docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Each node itself also has to keep track of its own aggregated update (well, it doesn't strictly have to, but it makes a lot of sense that it would keep the information it gives to other nodes itself as well, and it's required for recovery). Once you have that, sending the full state every time becomes cheap iff the bandwidth is cheap enough.
I don't think tracking which replicas are up to date is very expensive. We send a full state initially and then we don't really track anything, we just blindly relay updates (and react to requests by peers).
Since we're expecting payloads of multiple megabytes with very very large clusters (or small max_size) and possibly a relatively frequent update cadence (CDC), I think it's not reasonable to send the full state all the time, especially since the implementation doesn't seem that different. But for many clusters it would be perfectly fine and it would make things a little easier, for example by obviating the second update mechanism and reducing the chances of introducing correctness bugs regarding inactive ranges.
Not sure what to do with this section: I personally would just remove it.
I'm fine with removing this section.
docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
The communication between origin and recipient store is bidirectional, so the recipient store just asks for additional RangeIDs and receives MLAIs in the next update. I added a paragraph.
OK, so this is a new RPC instead of piggybacking on raft heartbeats?
Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. Historical reads include both `AS OF SYSTEM TIME` queries as well as transactions with a read timestamp sufficiently in the past (for example long-running analytics queries). The key enabling technology is the exchange of **closed timestamp updates** between stores. A closed timestamp update (*CT update*) is a store-wide timestamp together with (sparse) per-range information on the Raft progress. Follower replicas use local state built up from successive received CT updates to ascertain that they have all the state necessary to serve consistent reads at and below the leaseholder store's closed timestamp. Follower reads are only possible for epoch-based leases, which includes all user ranges but excludes some system ranges (such as the addressing metadata ranges). Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r=bdarnell,nvanbenschoten
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Ack, SGTM. It might be worth mentioning that the lease validity check is performed when the MLAI is reached, not when the CT is received.
Done.
docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I'm fine with removing this section.
Done.
docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
OK, so this is a new RPC instead of piggybacking on raft heartbeats?
Yep. Trying to introduce most of the code in separate packages and interface it cleanly with the storage package where it needs to touch. See #27595.
26362: RFC: follower reads r=bdarnell,nvanbenschoten a=tschottdorf NB: this is extracted from #21056; please don't add new commentary on the tech note there. ---- Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](#25229). Release note: None 27699: storage: fix stopper race in compactor r=petermattis a=tschottdorf Starting workers without a surrounding task is unfortunately often not the right thing to do when the worker accesses other state that might become invalidated once the stopper begins to stop. In this particular case, the compactor might end up accessing the engine even though it had already been closed. I wasn't able to repro this failure in the first place, but pretty sure this: Fixes #27232. Release note: None 27704: issues: fix email fallback r=petermattis a=tschottdorf This was not my email address. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Build succeeded |
NB: this is extracted from #21056; please don't add new commentary on the
tech note there.
Follower reads are consistent reads at historical timestamps from follower
replicas. They make the non-leader replicas in a range suitable sources for
historical reads.
The key enabling technology is the propagation of closed timestamp
heartbeats from the range leaseholder to followers. The closed timestamp
heartbeat (CT heartbeat) is more than just a timestamp. It is a set of
conditions, that if met, guarantee that a follower replica has all state
necessary to satisfy reads at or before the CT.
Consistent historical reads are useful for analytics queries and in particular
allow such queries to be carried out more efficiently and, with appropriate
configuration, away from foreground traffic. But historical reads are also key
to a proposal for reference-like tables aimed at cutting
down on foreign key check latencies particularly in geo-distributed clusters;
they help recover a reasonably recent consistent snapshot of a cluster after a
loss of quorum; and they are one of the ingredients for Change Data
Capture.
Release note: None