RFC: follower reads #26362

tbg · 2018-06-04T16:06:11Z

NB: this is extracted from #21056; please don't add new commentary on the
tech note there.

Follower reads are consistent reads at historical timestamps from follower
replicas. They make the non-leader replicas in a range suitable sources for
historical reads.

The key enabling technology is the propagation of closed timestamp
heartbeats from the range leaseholder to followers. The closed timestamp
heartbeat (CT heartbeat) is more than just a timestamp. It is a set of
conditions, that if met, guarantee that a follower replica has all state
necessary to satisfy reads at or before the CT.

Consistent historical reads are useful for analytics queries and in particular
allow such queries to be carried out more efficiently and, with appropriate
configuration, away from foreground traffic. But historical reads are also key
to a proposal for reference-like tables aimed at cutting
down on foreign key check latencies particularly in geo-distributed clusters;
they help recover a reasonably recent consistent snapshot of a cluster after a
loss of quorum; and they are one of the ingredients for Change Data
Capture.

Release note: None

cockroach-teamcity · 2018-06-04T16:06:17Z

This change is

bdarnell · 2018-06-07T17:57:41Z

LGTM

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved, all commit checks successful.

docs/RFCS/20180603_follower_reads.md, line 61 at r1 (raw file):

safe timestamp to the followers.

Since a node can house on the order of 50.000 replicas, it's untenable to

s/./,/

docs/RFCS/20180603_follower_reads.md, line 94 at r1 (raw file):

do not check for clock uncertainty restarts.

If the replica is not the leaseholder, it returns a suitable error and

s/If the replica is not the leaseholder/If the CT conditions are not met/

docs/RFCS/20180603_follower_reads.md, line 164 at r1 (raw file):

prevents these scenarios.

Quiesced replicas which do not serve any reads before the liveness

I'd phrase this differently, to make it clear that this is a consequence of an optimization instead of a fundamental property of the system:

As an optimization, updated closed timestamps are propagated to quiesced replicas lazily, as reads are attempted. This includes validation of the CT heartbeat's liveness epoch. As a result, if the liveness epoch is incremented before any reads have been served, we must discard the closed timestamp since we cannot tell whether the heartbeat was valid when it was received.

Comments from Reviewable

nvanbenschoten · 2018-06-14T21:56:22Z

Review status: complete! 0 of 0 LGTMs obtained (and 1 stale)

docs/RFCS/20180603_follower_reads.md, line 14 at r1 (raw file):

historical reads.

The key enabling technology is the propagation of a **closed timestamp

s/of a/of/

docs/RFCS/20180603_follower_reads.md, line 52 at r1 (raw file):

Raft-internal reproposals and reorderings. In a sense, the Raft Applied Index
is much more intuitive than the log position, and we need to use it here due
to the stronger guarantees it provides.

Would be nice to quickly elaborate on what these guarantees/properties are (e.g. monotonically increasing, not bound to a term, etc.).

docs/RFCS/20180603_follower_reads.md, line 83 at r1 (raw file):

straightforward. At the distributed sender, if a read is for a historical
timestamp earlier than the current time less a target duration (equal to the
target closed timestamp interval plus the raft heartbeat interval), it is sent

Should we take estimated clock offset into this equation as well? It's possible that the SQL gateway's clock is ahead of the replica by some delta and so even though it expects that the request will be beneath the range's closed timestamp, it isn't yet. An incorrect guess like this isn't a correctness issue, it just forces us to redirect to the leaseholder, so we don't need to worry about the clock max offset. Still, taking the estimated clock offset between the gateway and the replica's node into account should help us avoid these redirects. It's also possible that the raft heartbeat interval is such an overestimate that this never matters in practice.

We could also look at this another way. Contacting the nearest replica will almost always be significantly cheaper than contacting the leaseholder. Taken to its extreme, that makes one of these redirects due to an incorrect guess essentially free. Does this mean that we should be more aggressive than waiting out the entire raft heartbeat interval?

docs/RFCS/20180603_follower_reads.md, line 91 at r1 (raw file):

met. This avoids forwarding to the leaseholder and avoids updating the
timestamp cache. Note that closed timestamps and follower reads are
supported only on ranges with epoch-based leases, and that follower reads

I'm realizing now that this means we can't use follower reads to improve range lookups. That's a shame.

docs/RFCS/20180603_follower_reads.md, line 374 at r1 (raw file):

Similarly, this does not impede the usefulness of the CT mechanism for
recovery: the restored consistent state may contain intents that belong
to transactions that started after the CT; however, they will simply be

started before the CT?

docs/RFCS/20180603_follower_reads.md, line 538 at r1 (raw file):

For now, the min proposal timestamp roughly trails real time by five seconds.
This can be made configurable, for example via a cluster setting or, if more
granularity is required, via zone configs.

Decreasing this duration will be problematic because clients will expect it to be in effect immediately when determining whether to send to followers while Ranges will necessarily take longer to react to the change (at least oldDur - newDur). Without any extra logic, this will lead to a spike in incorrect follower read attempts and subsequent redirects. We could avoid this by maintaining a history of setting's value and examining it on the client. This isn't worth worrying about at this time.

Comments from Reviewable

bdarnell · 2018-06-15T19:35:55Z

Review status: complete! 1 of 0 LGTMs obtained

docs/RFCS/20180603_follower_reads.md, line 83 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we take estimated clock offset into this equation as well? It's possible that the SQL gateway's clock is ahead of the replica by some delta and so even though it expects that the request will be beneath the range's closed timestamp, it isn't yet. An incorrect guess like this isn't a correctness issue, it just forces us to redirect to the leaseholder, so we don't need to worry about the clock max offset. Still, taking the estimated clock offset between the gateway and the replica's node into account should help us avoid these redirects. It's also possible that the raft heartbeat interval is such an overestimate that this never matters in practice.

We could also look at this another way. Contacting the nearest replica will almost always be significantly cheaper than contacting the leaseholder. Taken to its extreme, that makes one of these redirects due to an incorrect guess essentially free. Does this mean that we should be more aggressive than waiting out the entire raft heartbeat interval?

I think in practice clients that want follower reads will (should) be very conservative in their timestamp choices, perhaps using double the closed timestamp interval to ensure that they're almost certain to get a follower read, instead of trying to get as close as possible to the threshold.

Comments from Reviewable

…mestamps This change lays the ground work for follower reads by adding the prototype implementation into our main codebase Most of the logic is disabled by default. It is only exercised during specific unit tests or when running with a nonzero COCKROACH_CLOSED_TIMESTAMP_INTERVAL. The code contains several potential correctness anomalies and makes no attempt at handling lagging followers gracefully. It should not be used outside of testing and in fact should not be used at all (though we may want to write roachtests early). The [follower reads RFC] and many TODOs in this code hint at the upcoming changes. Most prominently, known correctness gotchas will be addressed, but the role of quiescence and coalesced heartbeats will be untangled from the main proposal, which hopefully can clarify the code somewhat as well. In the meantime, the commit message below documents what is implemented here, even though it is subject to change: Nodes send a closed timestamp with coalesced heartbeats. Receipt of a heartbeat from a node which is the leaseholder for the range means a closed timestamp can be trusted to apply to each follower replica which has committed at or over a min lease applied index, a new value supplied with coalesced heartbeats. Nodes keep track of their "min proposal timestamp" (MinPropTS), which is an HLC timestamp. On every heartbeat, the MinPropTS is persisted locally to ensure monotonicity on node restart. At startup, a node reads the last persisted MinPropTS, and forwards the HLC clock to the MPT timestamp + max safe interval if necessary. Nodes check MinPropTS on command evaluation; a command's timestamp is forwarded if less than MinPropTS. Things get more interesting when a range quiesces. Replicas of quiesced ranges no longer receive info on coalesced heartbeats. However, if a replica is quiesced, we can continue to rely on the most recent store-wide closed timestamp supplied with coalesced heartbeats, so long as the liveness epoch (reported with heartbeats) remains stable and no heartbeats are skipped. This can continue for as long as a range is quiesced, but requires that the leaseholder notifies all followers on the first heartbeat after a range is unquiesced. Note there is no concern that on leaseholder change, the new leaseholder allows a write at an earlier timestamp than a previously reported closed timestamp. This is due to the low water timestamp in the timestamp cache being reset on leaseholder transfer to prevent rewriting history in general. Release note: None [follower reads RFC]: cockroachdb#26362

tbg · 2018-06-22T12:53:38Z

Maybe I'm missing something fundamental here, but I think we should make the following changes that drastically simplify the design (and impl):

rename "quiescent" to "frozen" when referring to closed timestamps (for
clarification; the concept actually vanishes later in this list)
on receipt of a CT update, check whether the origin store's gossiped liveness has the claimed epoch and an expiration strictly greater than the closed timestamp (note that it doesn't matter if the origin store is still live at some "current" timestamp). If not, the update is treated as a missed sequence number (we could tell the origin store about this in the future).
ranges don't explicitly freeze, but they are basically always frozen when an MLAI for them is received. That is, a range which is mentioned in one update and not in a subsequent one is frozen in that latter update. In effect, all ranges are frozen all the time. When a range is explicitly mentioned, then it's just to increase the MLAI. (After an update is missed, or initially before updates are received, conceptually the MLAI is infinity, and a follower read attempt goes to the leaseholder triggering the transmission of an MLAI).
the promise about unfreezing ranges simply becomes the promise to not omit MLAIs when they are necessary, which is already a promise required without the concept of freezing.
there are no separate code paths for frozen/unfrozen ranges. In effect, the concept of freezing goes away.
there is no state that sits on the replica (at least there doesn't have to be, we should probably still keep a confirmed timestamp); the closed timestamp status is just a view into a store-level map keyed on the origin store whose values get updated (or invalidated) on receipt of store heartbeats. To serve a follower read, the follower's lease epoch must match the closed timestamp epoch and the desired read timestamp must be less than or equal to the closed timestamp.

This works as long as the node keeps its lease (and promises). If the node loses its lease, then either

it loses it because it became non-live. The earliest conflicting leaseholder's timestamp is after any observed liveness expiration observed from the original leaseholder, so the check in 2) prevents stale reads.
it transfers the lease away. A lease transfer is different from a lease request in that it requires a valid lease applied index. So if a leaseholder proposes a lease transfer that has any chance of succeeding, it also increments the MLAI in the process. That means that a closed timestamp sent by the original leaseholder (which must always be less than the timestamp of any attempted lease transfer) for any timestamp that the new leaseholder would be responsible for also forces the follower to first apply the transfer itself, at which point it knows about the new lease.

tbg · 2018-06-25T14:26:02Z

Reminder to self: it's not enough to teach DistSender about the existence of followers that can serve historical reads, we also need to make changes in

cockroach/pkg/sql/distsqlplan/span_resolver.go

Line 307 in c2977e9

    
           func (it *spanResolverIterator) ReplicaInfo(ctx context.Context) (kv.ReplicaInfo, error) {

bdarnell · 2018-06-26T20:21:25Z

The new proposal makes sense to me. I think the next step is to produce an updated version of the RFC with this proposal so we can think about it more directly (instead of thinking about changes from the original version to the new one).

it loses it because it became non-live. The earliest conflicting leaseholder's timestamp is after any observed liveness expiration observed from the original leaseholder, so the check in 2) prevents stale reads.

As a corollary to this, nodes should never send closed timestamps that are higher than their confirmed liveness expiration. As long as leaseholders never send timestamps that might exceed their liveness expiration, the recipient-side check here is redundant.

it transfers the lease away. A lease transfer is different from a lease request in that it requires a valid lease applied index. So if a leaseholder proposes a lease transfer that has any chance of succeeding, it also increments the MLAI in the process. That means that a closed timestamp sent by the original leaseholder (which must always be less than the timestamp of any attempted lease transfer) for any timestamp that the new leaseholder would be responsible for also forces the follower to first apply the transfer itself, at which point it knows about the new lease.

Relying on the fact that there is a new MLAI pointing to a new lease seems subtle to me. I think it might be better to send an explicit flag indicating that the lease is being transferred away.

Incorporates the ideas in [this comment], amounting to a complete rewrite. Release note: None [1] cockroachdb#26362 (comment),

tbg · 2018-06-28T14:11:20Z

@bdarnell rewrote the thing and pushed a big update, not sure if it even makes sense to look at it as a diff. It will need some more polish but I hope the main ideas (and particularly correctness) is somewhat more intuitive now.

bdarnell · 2018-06-28T21:11:03Z

Review status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 205 at r3 (raw file):

We opt for the first mechanism because

1. the payload is essentially a varint for each range, which takes no more than

Isn't it at least two varints (rangeID and MLAI)? (Maybe a few more bytes if it's a proto map instead of two parallel arrays). Still workable.

docs/RFCS/20180603_follower_reads.md, line 206 at r3 (raw file):

1. the payload is essentially a varint for each range, which takes no more than
   10 bytes on the wire, adding up to a 500kb payload at 50000 leaseholders.

s/leaseholders/leased replicas per store/

docs/RFCS/20180603_follower_reads.md, line 217 at r3 (raw file):

If a range is not included in the initial update, it is because of one of the
following:

Is there supposed to be something else here?

docs/RFCS/20180603_follower_reads.md, line 222 at r3 (raw file):

To get in the right mindset of this, consider the simplified situation of a
`Store without any pending or (near) future write activity, that is, there are

Missing backtick after "Store"

docs/RFCS/20180603_follower_reads.md, line 231 at r3 (raw file):

2. Tracking an *MLAI* for each replica (for which the lease for the epoch is held).

The first requirement is roughly equivalent to bumping the high water mark of

s/high/low/

docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):

not). Such transfers necessarily imply that the expiration (or transfer
timestamp) of the old lease precedes the start of the new one, and so the MPT
of the new leaseholder is strictly larger than that of the old one.

I don't think this is guaranteed today. The timestamp of a lease transfer is not carried in the request header so I think it would be possible (without additional safeguards) to propose a lease transfer that takes effect before the MPT.

docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):

is added to the next round of outgoing updates.

For merges, no action is necessary as the right hand side cases to exist while

s/cases/ceases/

Don't we need to take the greater of the MPTs of the two ranges or something like that?

docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):

## Unresolved questions

Can we just send the full heartbeat every time? Per the computation above, we don't expect more than

Sending the full heartbeat would have similar impact to disabling quiescence, which we've shown to have significant performance impact.

Comments from Reviewable

tbg · 2018-07-02T16:29:40Z

TFTR @bdarnell.

This RFC is now entering the final comment period.

Review status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 14 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/of a/of/

Done.

docs/RFCS/20180603_follower_reads.md, line 52 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would be nice to quickly elaborate on what these guarantees/properties are (e.g. monotonically increasing, not bound to a term, etc.).

Done.

docs/RFCS/20180603_follower_reads.md, line 61 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/./,/

Done.

docs/RFCS/20180603_follower_reads.md, line 83 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think in practice clients that want follower reads will (should) be very conservative in their timestamp choices, perhaps using double the closed timestamp interval to ensure that they're almost certain to get a follower read, instead of trying to get as close as possible to the threshold.

Done.

docs/RFCS/20180603_follower_reads.md, line 91 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm realizing now that this means we can't use follower reads to improve range lookups. That's a shame.

Done.

docs/RFCS/20180603_follower_reads.md, line 94 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/If the replica is not the leaseholder/If the CT conditions are not met/

Done.

docs/RFCS/20180603_follower_reads.md, line 164 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd phrase this differently, to make it clear that this is a consequence of an optimization instead of a fundamental property of the system:

As an optimization, updated closed timestamps are propagated to quiesced replicas lazily, as reads are attempted. This includes validation of the CT heartbeat's liveness epoch. As a result, if the liveness epoch is incremented before any reads have been served, we must discard the closed timestamp since we cannot tell whether the heartbeat was valid when it was received.

I added a new section about checking the leases and wove that facet in there, with an example.

docs/RFCS/20180603_follower_reads.md, line 374 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

started before the CT?

Reworded and removed that particular wording (you can see intents before and after).

docs/RFCS/20180603_follower_reads.md, line 538 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Decreasing this duration will be problematic because clients will expect it to be in effect immediately when determining whether to send to followers while Ranges will necessarily take longer to react to the change (at least oldDur - newDur). Without any extra logic, this will lead to a spike in incorrect follower read attempts and subsequent redirects. We could avoid this by maintaining a history of setting's value and examining it on the client. This isn't worth worrying about at this time.

Added comment.

docs/RFCS/20180603_follower_reads.md, line 205 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Isn't it at least two varints (rangeID and MLAI)? (Maybe a few more bytes if it's a proto map instead of two parallel arrays). Still workable.

Fixed.

docs/RFCS/20180603_follower_reads.md, line 206 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/leaseholders/leased replicas per store/

I'm not sure that the term "lease replica" is clear enough. Went with "leaseholder replicas".

docs/RFCS/20180603_follower_reads.md, line 217 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is there supposed to be something else here?

A discussion of not including ranges here. Moved elsewhere.

docs/RFCS/20180603_follower_reads.md, line 222 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Missing backtick after "Store"

Done.

docs/RFCS/20180603_follower_reads.md, line 231 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/high/low/

Done.

docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think this is guaranteed today. The timestamp of a lease transfer is not carried in the request header so I think it would be possible (without additional safeguards) to propose a lease transfer that takes effect before the MPT.

This section was copied in from the original tech note, but I think the argument isn't necessary any more. For leases that were taken over, the start is after the expiration. For transfers, the old leaseholder won't propose anything above the proposed ts itself, and it will force the followers to see the lease transfer (via MLAI).

docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):

s/cases/ceases/

Done.

Don't we need to take the greater of the MPTs of the two ranges or something like that?

no, the merge command acts like a write that absorbs all of the data formerly on the RHS.

docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Sending the full heartbeat would have similar impact to disabling quiescence, which we've shown to have significant performance impact.

Did we ever pin the perf impact on the network i/o? If it's not network I/O but something in-process, then the situation for follower reads could be different (basically because we can incrementally update the state locally as described in the next paragraph).

Comments from Reviewable

nvanbenschoten · 2018-07-03T14:17:20Z

Review status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 53 at r4 (raw file):

The closed timestamp mechanism provides between each pair of stores a regular
(on the order of seconds) exchange of information to that end. At a high level,
the these updates contain what one might intuitively expect:

"the these" choose one :)

docs/RFCS/20180603_follower_reads.md, line 68 at r4 (raw file):

associated closed timestamp on that replica.

Providing the information when there has been write activity on a given range

s/information when/information only when/

docs/RFCS/20180603_follower_reads.md, line 85 at r4 (raw file):

promise of the form

> no more proposals writing to timestamps below `T` are going to apply after log

below or equal to? Did this change?

docs/RFCS/20180603_follower_reads.md, line 114 at r4 (raw file):

overwriting existing *MLAI*s

The current code doesn't throw away existing MLAIs. Instead, it discards new ones until the old one is resolved (achieved? reached? what's the word for this?). There's a trade-off here that depends on whether the range expects continual load or expects the load to eventually stop. Overwriting existing MLAIs ensures that we won't miss the high-water mark of closed timestamps after load stops on a range and the follower fully catches up. However, overwriting existing MLAIs might also mean that we never move a follower's closed timestamp forward because the MLAI is continually overwritten before we get to it. Picture a finish line in a race that keeps moving forward such that it's never reachable.

This is the same problem I was discussing in the fourth comment of #21056 (comment), and there are pretty simple structures we could use with tunable space upper bounds to optimize for both scenarios.

docs/RFCS/20180603_follower_reads.md, line 140 at r4 (raw file):

1. looks up the CT state known for this node and epoch.
1. checks whether the read timestamp `T` is less than or equal to the closed timestamp.
1. checks whether its *Lease Applied Index* matches or exceeds the *MLAI* for the range (in the absence of an *MLAI*, this check fails by default).

I still think we should move to a more reactive approach to closing out timestamps on followers and move this off the read path. A timestamp can be closed out in exactly two places:

when the CT update is received
when the LAI is bumped beneath Raft

Moving the updates to these two locations is easier to understand in my opinion. The read path can then grab a read-lock on the CT state to check whether follower reads are possible without needing to worry about updating it.

Also, FWIW we'll need this for RangeFeeds.

docs/RFCS/20180603_follower_reads.md, line 212 at r4 (raw file):

   20 bytes on the wire, adding up to a 1mb payload at 50000 leaseholder replicas
   (but likely much less in practice).
   Even with 10x as many, an rare enough 10mb payload seems unproblematic,

s/an/a/

docs/RFCS/20180603_follower_reads.md, line 226 at r4 (raw file):

To get in the right mindset of this, consider the simplified situation of a
`Store` without any pending or (near) future write activity, that is, there are
(and will be ) no in-flight Raft proposals. Now, we want to send an initial CT

major nit: extra space after "be"

docs/RFCS/20180603_follower_reads.md, line 235 at r4 (raw file):

The first requirement is roughly equivalent to bumping the low water mark of
the timestamp cache to one logical tick above the desired closed timestamp.

It's interesting to make this analogy. Doing so helped me when I was getting familiar with the initial proposal of follower reads. Do note though that actually doing this would be terribly inefficient with the current skiplist implementation of the timestamp cache because it doesn't collapse adjacent spans with equal timestamps.

docs/RFCS/20180603_follower_reads.md, line 254 at r4 (raw file):

in-flight whereas another two have cleared the command queue but are still
evaluating. Presumably the in-flight proposals were assigned to *Lease Applied
Indexes* 13 through 16, and the ones being evaluated will receive 14 and 15

nit: through 15

and the ones being evaluated will receive 16 and 17

docs/RFCS/20180603_follower_reads.md, line 288 at r4 (raw file):

The MPT consists of the previously emitted closed timestamp (zero initially) and
a prospective next closed timestamp aptly named `next` (always strictly larger
than `closed`) below which new writes are not accepted. It also contains two ref

below or equal to?

docs/RFCS/20180603_follower_reads.md, line 490 at r4 (raw file):

Note that for the CDC use case, this closed timestamp mechanism is a necessary,
but not sufficient, solution. In particular, a CDC consumer must find (or track)
and resolve all intents at timestamps below a given closed timestamp first.

all intents that can commit. Aborted intents are fine to ignore.

Comments from Reviewable

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 205 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Fixed.

Done.

docs/RFCS/20180603_follower_reads.md, line 206 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm not sure that the term "lease replica" is clear enough. Went with "leaseholder replicas".

Done.

docs/RFCS/20180603_follower_reads.md, line 217 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

A discussion of not including ranges here. Moved elsewhere.

Done.

docs/RFCS/20180603_follower_reads.md, line 53 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"the these" choose one :)

Done.

docs/RFCS/20180603_follower_reads.md, line 68 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/information when/information only when/

Done.

docs/RFCS/20180603_follower_reads.md, line 85 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

below or equal to? Did this change?

Should be <=, but I had missed this one. Done.

docs/RFCS/20180603_follower_reads.md, line 114 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

overwriting existing *MLAI*s
The current code doesn't throw away existing MLAIs. Instead, it discards new ones until the old one is resolved (achieved? reached? what's the word for this?). There's a trade-off here that depends on whether the range expects continual load or expects the load to eventually stop. Overwriting existing MLAIs ensures that we won't miss the high-water mark of closed timestamps after load stops on a range and the follower fully catches up. However, overwriting existing MLAIs might also mean that we never move a follower's closed timestamp forward because the MLAI is continually overwritten before we get to it. Picture a finish line in a race that keeps moving forward such that it's never reachable.

This is the same problem I was discussing in the fourth comment of #21056 (comment), and there are pretty simple structures we could use with tunable space upper bounds to optimize for both scenarios.

To reflect our conversation, I added a "Further work" section below which touches on these points and also mentions RangeFeeds.

docs/RFCS/20180603_follower_reads.md, line 140 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I still think we should move to a more reactive approach to closing out timestamps on followers and move this off the read path. A timestamp can be closed out in exactly two places:

when the CT update is received

when the LAI is bumped beneath Raft

Moving the updates to these two locations is easier to understand in my opinion. The read path can then grab a read-lock on the CT state to check whether follower reads are possible without needing to worry about updating it.

Also, FWIW we'll need this for RangeFeeds.

This isn't exactly mentioned in the "Further work" section, but what is mentioned is that there is a spectrum of ways to address this that includes your suggestion.

docs/RFCS/20180603_follower_reads.md, line 212 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/an/a/

Done.

docs/RFCS/20180603_follower_reads.md, line 226 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

major nit: extra space after "be"

Done.

docs/RFCS/20180603_follower_reads.md, line 235 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's interesting to make this analogy. Doing so helped me when I was getting familiar with the initial proposal of follower reads. Do note though that actually doing this would be terribly inefficient with the current skiplist implementation of the timestamp cache because it doesn't collapse adjacent spans with equal timestamps.

Done.

docs/RFCS/20180603_follower_reads.md, line 288 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

below or equal to?

Done.

docs/RFCS/20180603_follower_reads.md, line 490 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

all intents that can commit. Aborted intents are fine to ignore.

An intent by definition has PENDING status, and resolving commits or removes it. There's no such thing as an aborted intent (it's simply the absence of an intent). Maybe I'm misunderstanding your suggestion?:q
g

Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This section was copied in from the original tech note, but I think the argument isn't necessary any more. For leases that were taken over, the start is after the expiration. For transfers, the old leaseholder won't propose anything above the proposed ts itself, and it will force the followers to see the lease transfer (via MLAI).

I think we still need some discussion and possible safeguards around transfers.

Lease transfers are always proposed at the current time, and timestamps are closed when they are in the past, so a situation in which a lease transfer is proposed at a timestamp prior to the current MPT is unlikely, but I think it is still theoretically possible because lease proposals go through a slightly different path and don't (IIRC) have their timestamp in the header so they wouldn't be affected by the usual rules that enforce the MPT.

A conflict in the other direction is more plausible: A lease transfer is proposed at time T1 and LAI 1. It applies, and then we close timestamp T2 with MLAI 1 (even guaranteeing this much may require some additional checks, as above: since the lease transfer timestamp is not in the header, we must make sure that proposing a transfer will prevent older timestamps from closing). But applying LAI1 is not sufficient to serve reads at T2: you need a closed timestamp from the new leaseholder.

docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/cases/ceases/

Done.

Don't we need to take the greater of the MPTs of the two ranges or something like that?

no, the merge command acts like a write that absorbs all of the data formerly on the RHS.

If the RHS has a greater closed timestamp than the LHS, I think we have a problem on merge similar to lease transfers. The RHS's promise not to serve writes below some time must be propagated to the merged range. I think if the GetSnapshotForMerge operation counts as a write this probably happens naturally, but I'm not sure that that's the case. Anything that's not a normal txn has a chance of messing this up.

docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Did we ever pin the perf impact on the network i/o? If it's not network I/O but something in-process, then the situation for follower reads could be different (basically because we can incrementally update the state locally as described in the next paragraph).

No, we never pinned it on the network I/O (and in fact it looks like the CPU cost is likely more significant).

So to be clear, is this paragraph arguing that it might be better to send full updates every time than to spend more CPU tracking which replicas are up to date (and potentially more expensive recovery from missed heartbeats)?

docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):

But this strategy can miss necessary updates as leases get transferred to
otherwise inactive ranges. To guard against these rare cases, the second
strategy serves as a fallback.

What is the mechanism for this fallback? +1 to having a fallback instead of "either everything's perfect or it all falls apart" (which we're too prone to), but this is a late addition that I think deserves more discussion.

Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):

Lease transfers are always proposed at the current time, and timestamps are closed when they are in the past, so a situation in which a lease transfer is proposed at a timestamp prior to the current MPT is unlikely, but I think it is still theoretically possible because lease proposals go through a slightly different path and don't (IIRC) have their timestamp in the header so they wouldn't be affected by the usual rules that enforce the MPT.

I think this is mentioned in the "implied guarantees" section, though I made sure to mention that the guarantee is something we need to make sure holds:

the origin store won't (ever) initiate a lease transfer that would allow
another node to write at or below the closed timestamp. In other words, in the
case of a lease transfer the next lease would start at a timestamp greater than
the closed timestamp. This is likely impossible in practice since the transfer
timestamps and proposed closed timestamps are taken from the same hybrid logical
clock, but an explicit safeguard will be added just in case.

A conflict in the other direction is more plausible: A lease transfer is proposed at time T1 and LAI 1. It applies, and then we close timestamp T2 with MLAI 1 (even guaranteeing this much may require some additional checks, as above: since the lease transfer timestamp is not in the header, we must make sure that proposing a transfer will prevent older timestamps from closing). But applying LAI1 is not sufficient to serve reads at T2: you need a closed timestamp from the new leaseholder.

I could see your point if we threw all updates from various stores into the same pot and then tried to serve correct reads from that, but the proof always involves the lease known to the replica trying to serve the follower read. Let's say replica1 transfers its lease to replica2 with a start timestamp of 100 at LAI 33, and the closed timestamp (on replica1) is 99 with an MLAI of 32.

While the transfer is in progress and even when the new leaseholder has started serving reads, its safe for the old one to use (99@32) for follower reads (forever). Now replica1's store closes out a higher timestamp, say (101@33). Note how this has to contain the lease transfer's LAI. You can't use this until caught up to 33. But at 33, your lease is for replica2, and so you're going to have to look for its closed timestamp state instead to prove that you can serve.

Stores only send out information for epochs they know they are live in. But they don't try to send only heartbeats for replicas that they hold the lease for at the closed timestamp. That check is much easier to do on the receiving end.

If you think there is a way to throw all updates into a pot, please let me know. That would be interesting for the recovery use case: right now we'll have to persist a few past leases or we won't be able to look at a data directory including persisted CT state and figure out which reads are possible.

docs/RFCS/20180603_follower_reads.md, line 476 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the RHS has a greater closed timestamp than the LHS, I think we have a problem on merge similar to lease transfers. The RHS's promise not to serve writes below some time must be propagated to the merged range. I think if the GetSnapshotForMerge operation counts as a write this probably happens naturally, but I'm not sure that that's the case. Anything that's not a normal txn has a chance of messing this up.

That's a really good point, I was stuck thinking about the old ways with LHS and RHS' leaseholders colocated. I updated the description. @nvanbenschoten, initially I'm planning to use the timestamp cache here as it's the simplest way of solving this without introducing additional structure. Do you think the cost of doing this relative to the merge operation as a whole is going to matter right off the bat?

docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

No, we never pinned it on the network I/O (and in fact it looks like the CPU cost is likely more significant).

So to be clear, is this paragraph arguing that it might be better to send full updates every time than to spend more CPU tracking which replicas are up to date (and potentially more expensive recovery from missed heartbeats)?

Each node itself also has to keep track of its own aggregated update (well, it doesn't strictly have to, but it makes a lot of sense that it would keep the information it gives to other nodes itself as well, and it's required for recovery). Once you have that, sending the full state every time becomes cheap iff the bandwidth is cheap enough.

I don't think tracking which replicas are up to date is very expensive. We send a full state initially and then we don't really track anything, we just blindly relay updates (and react to requests by peers).

Since we're expecting payloads of multiple megabytes with very very large clusters (or small max_size) and possibly a relatively frequent update cadence (CDC), I think it's not reasonable to send the full state all the time, especially since the implementation doesn't seem that different. But for many clusters it would be perfectly fine and it would make things a little easier, for example by obviating the second update mechanism and reducing the chances of introducing correctness bugs regarding inactive ranges.

Not sure what to do with this section: I personally would just remove it.

docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What is the mechanism for this fallback? +1 to having a fallback instead of "either everything's perfect or it all falls apart" (which we're too prone to), but this is a late addition that I think deserves more discussion.

The communication between origin and recipient store is bidirectional, so the recipient store just asks for additional RangeIDs and receives MLAIs in the next update. I added a paragraph.

Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Lease transfers are always proposed at the current time, and timestamps are closed when they are in the past, so a situation in which a lease transfer is proposed at a timestamp prior to the current MPT is unlikely, but I think it is still theoretically possible because lease proposals go through a slightly different path and don't (IIRC) have their timestamp in the header so they wouldn't be affected by the usual rules that enforce the MPT.

I think this is mentioned in the "implied guarantees" section, though I made sure to mention that the guarantee is something we need to make sure holds:

the origin store won't (ever) initiate a lease transfer that would allow
another node to write at or below the closed timestamp. In other words, in the
case of a lease transfer the next lease would start at a timestamp greater than
the closed timestamp. This is likely impossible in practice since the transfer
timestamps and proposed closed timestamps are taken from the same hybrid logical
clock, but an explicit safeguard will be added just in case.

A conflict in the other direction is more plausible: A lease transfer is proposed at time T1 and LAI 1. It applies, and then we close timestamp T2 with MLAI 1 (even guaranteeing this much may require some additional checks, as above: since the lease transfer timestamp is not in the header, we must make sure that proposing a transfer will prevent older timestamps from closing). But applying LAI1 is not sufficient to serve reads at T2: you need a closed timestamp from the new leaseholder.

I could see your point if we threw all updates from various stores into the same pot and then tried to serve correct reads from that, but the proof always involves the lease known to the replica trying to serve the follower read. Let's say replica1 transfers its lease to replica2 with a start timestamp of 100 at LAI 33, and the closed timestamp (on replica1) is 99 with an MLAI of 32.

While the transfer is in progress and even when the new leaseholder has started serving reads, its safe for the old one to use (99@32) for follower reads (forever). Now replica1's store closes out a higher timestamp, say (101@33). Note how this has to contain the lease transfer's LAI. You can't use this until caught up to 33. But at 33, your lease is for replica2, and so you're going to have to look for its closed timestamp state instead to prove that you can serve.

Stores only send out information for epochs they know they are live in. But they don't try to send only heartbeats for replicas that they hold the lease for at the closed timestamp. That check is much easier to do on the receiving end.

If you think there is a way to throw all updates into a pot, please let me know. That would be interesting for the recovery use case: right now we'll have to persist a few past leases or we won't be able to look at a data directory including persisted CT state and figure out which reads are possible.

Ack, SGTM. It might be worth mentioning that the lease validity check is performed when the MLAI is reached, not when the CT is received.

docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Each node itself also has to keep track of its own aggregated update (well, it doesn't strictly have to, but it makes a lot of sense that it would keep the information it gives to other nodes itself as well, and it's required for recovery). Once you have that, sending the full state every time becomes cheap iff the bandwidth is cheap enough.

I don't think tracking which replicas are up to date is very expensive. We send a full state initially and then we don't really track anything, we just blindly relay updates (and react to requests by peers).

Since we're expecting payloads of multiple megabytes with very very large clusters (or small max_size) and possibly a relatively frequent update cadence (CDC), I think it's not reasonable to send the full state all the time, especially since the implementation doesn't seem that different. But for many clusters it would be perfectly fine and it would make things a little easier, for example by obviating the second update mechanism and reducing the chances of introducing correctness bugs regarding inactive ranges.

Not sure what to do with this section: I personally would just remove it.

I'm fine with removing this section.

docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The communication between origin and recipient store is bidirectional, so the recipient store just asks for additional RangeIDs and receives MLAIs in the next update. I added a paragraph.

OK, so this is a new RPC instead of piggybacking on raft heartbeats?

Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. Historical reads include both `AS OF SYSTEM TIME` queries as well as transactions with a read timestamp sufficiently in the past (for example long-running analytics queries). The key enabling technology is the exchange of **closed timestamp updates** between stores. A closed timestamp update (*CT update*) is a store-wide timestamp together with (sparse) per-range information on the Raft progress. Follower replicas use local state built up from successive received CT updates to ascertain that they have all the state necessary to serve consistent reads at and below the leaseholder store's closed timestamp. Follower reads are only possible for epoch-based leases, which includes all user ranges but excludes some system ranges (such as the addressing metadata ranges). Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. Release note: None

tbg

bors r=bdarnell,nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

docs/RFCS/20180603_follower_reads.md, line 464 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Ack, SGTM. It might be worth mentioning that the lease validity check is performed when the MLAI is reached, not when the CT is received.

Done.

docs/RFCS/20180603_follower_reads.md, line 507 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm fine with removing this section.

Done.

docs/RFCS/20180603_follower_reads.md, line 220 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

OK, so this is a new RPC instead of piggybacking on raft heartbeats?

Yep. Trying to introduce most of the code in separate packages and interface it cleanly with the storage package where it needs to touch. See #27595.

26362: RFC: follower reads r=bdarnell,nvanbenschoten a=tschottdorf NB: this is extracted from #21056; please don't add new commentary on the tech note there. ---- Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](#25229). Release note: None 27699: storage: fix stopper race in compactor r=petermattis a=tschottdorf Starting workers without a surrounding task is unfortunately often not the right thing to do when the worker accesses other state that might become invalidated once the stopper begins to stop. In this particular case, the compactor might end up accessing the engine even though it had already been closed. I wasn't able to repro this failure in the first place, but pretty sure this: Fixes #27232. Release note: None 27704: issues: fix email fallback r=petermattis a=tschottdorf This was not my email address. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

craig · 2018-07-18T16:00:22Z

Build succeeded

GitHub CI (Cockroach)

tbg requested a review from a team as a code owner June 4, 2018 16:06

tbg mentioned this pull request Jun 4, 2018

storage: implement follower reads using replica closed timestamps #21056

Closed

rmloveland mentioned this pull request Jun 20, 2018

Document follower reads cockroachdb/docs#3287

Closed

tbg mentioned this pull request Jun 24, 2018

storage: add min proposal timestamp tracker #26941

Merged

tbg force-pushed the rfc/follower-reads branch from ac6c3da to 8368ee2 Compare June 28, 2018 14:10

tbg added a commit to tbg/cockroach that referenced this pull request Jun 28, 2018

RFC: major rewriting of follower reads RFC

8368ee2

Incorporates the ideas in [this comment], amounting to a complete rewrite. Release note: None [1] cockroachdb#26362 (comment),

tbg force-pushed the rfc/follower-reads branch from 8368ee2 to ba68743 Compare July 2, 2018 16:29

tbg force-pushed the rfc/follower-reads branch from ba68743 to c72e434 Compare July 4, 2018 10:48

tbg commented Jul 4, 2018

View reviewed changes

bdarnell reviewed Jul 5, 2018

View reviewed changes

jseldess mentioned this pull request Jul 10, 2018

Read from follower with timestamp bound cockroachdb/docs#2962

Closed

tbg commented Jul 10, 2018

View reviewed changes

tbg force-pushed the rfc/follower-reads branch from c72e434 to 51d70f9 Compare July 10, 2018 20:29

tbg force-pushed the rfc/follower-reads branch from 51d70f9 to 312c0d3 Compare July 10, 2018 20:31

bdarnell approved these changes Jul 11, 2018

View reviewed changes

tbg force-pushed the rfc/follower-reads branch from 312c0d3 to b0297b6 Compare July 18, 2018 15:29

tbg commented Jul 18, 2018

View reviewed changes

craig bot merged commit b0297b6 into cockroachdb:master Jul 18, 2018

tbg mentioned this pull request Jul 22, 2018

storage: handle unavailable ranges at the storage level #25058

Closed

tbg deleted the rfc/follower-reads branch July 26, 2018 09:49

bdarnell mentioned this pull request Apr 8, 2019

roachtest: scrub/all-checks/tpcc/w=1000 failed #35986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: follower reads #26362

RFC: follower reads #26362

tbg commented Jun 4, 2018 •

edited

Loading

cockroach-teamcity commented Jun 4, 2018

bdarnell commented Jun 7, 2018

nvanbenschoten commented Jun 14, 2018

bdarnell commented Jun 15, 2018

tbg commented Jun 22, 2018

tbg commented Jun 25, 2018

bdarnell commented Jun 26, 2018

tbg commented Jun 28, 2018

bdarnell commented Jun 28, 2018

tbg commented Jul 2, 2018

nvanbenschoten commented Jul 3, 2018

tbg left a comment

bdarnell left a comment

tbg left a comment

bdarnell left a comment

tbg left a comment

craig bot commented Jul 18, 2018

RFC: follower reads #26362

RFC: follower reads #26362

Conversation

tbg commented Jun 4, 2018 • edited Loading

cockroach-teamcity commented Jun 4, 2018

bdarnell commented Jun 7, 2018

nvanbenschoten commented Jun 14, 2018

bdarnell commented Jun 15, 2018

tbg commented Jun 22, 2018

tbg commented Jun 25, 2018

bdarnell commented Jun 26, 2018

tbg commented Jun 28, 2018

bdarnell commented Jun 28, 2018

tbg commented Jul 2, 2018

nvanbenschoten commented Jul 3, 2018

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

craig bot commented Jul 18, 2018

Build succeeded

tbg commented Jun 4, 2018 •

edited

Loading