read from a follower with timestamp bound #16593

tbg · 2017-06-19T05:58:30Z

I ended up thinking about this tonight due to a related problem, so here are
some notes. The difficulty is making this zone configurable. Might've missed
something.

Goals

assuming a read timestamp far enough in the "past", (usually) be able to read
from any replica. (think: analytics, time travel queries, backups, queries
that can't or don't need to pay the latency to a far-away lease holder).
configurable on the level of zone configs (i.e. table)

Sketch of implementation

Add a field max_write_age to the zone configs (a value of zero behaves like
MaxUint64). The idea is that the timestamp caches of the affected ranges have
a low watermark that does not trail (now-max_write_age). Note that this
effectively limits how long transactions can write to approximately
max_write_age. In turn, when running a read-only transaction, once the
current HLC timestamp has passed read_timestamp + max_write_age + max_offset,
any replica can serve reads.

add a field max_write_age to the lease proto.
whenever a lease is proposed, max_write_age is populated with the value
the proposer believes is current.
lease extensions must not alter max_write_age. If a lease holder realizes
that the ZoneConfig's max_write_age has changed, it must request a new lease
(in practice, it only has to do this in case max_write_age increases) and let
the old one expire (or transfer its lease away). There is room for optimization
here: the replica could extend the lease with the new max_write_age, but all
members must enforce the smaller max_write_ages for as long as the "old"
version is not expired.
Make DistSender aware of max_write_age. When considering a read-only
BatchRequest with a timestamp eligible for a follower-served read, consider
followers, prioritizing those in close proximity.
A follower which receives a read-only batch first checks if the current
lease is active (not whether it holds the lease itself). If not, it behaves as
it would today (requests the lease). Otherwise, if it is not the lease holder,
it checks if the batch timestamp is eligible for a follower-served read based
on the information in the lease and the current timestamp. If so, it serves it
(it does not need to update the timestamp cache).
on writes that violate now - max_write_age < write_ts, behave as if there
were a timestamp cache entry at now.

An interesting observation is that this can also be modified to allow serving
read queries when Raft completely breaks down (think all inter-DC connections
fail): a replica can always serve what is "safe" based on the last known lease.
There is much more work to do to get these replicas to agree on a timestamp,
though. The resulting syntax could be something along the lines of

SELECT (...) AS OF SYSTEM TIME STALE

and DistSender would consult its cache to find the minimal timestamp covered
by all leases (but even that timestamp may not work).

Caveats

This relies on clocks and thus on MaxOffset plus not having goroutines
stalled in inconvenient locations (such a stall would violate MaxOffset too,
but be very unlikely to be caught): If a write passes the check but then gets
delayed until it doesn't hold any more, followers may serve reads that are
then invalidated by the proceeding write. (This does not seem more fragile
than what we already have with our read leases though).
if a Range isn't split along a ZoneConfig, the more restrictive
max_write_age will be in effect.

The text was updated successfully, but these errors were encountered:

tbg · 2017-06-19T18:25:53Z

cc @arjunravinarayan

bdarnell · 2017-06-19T19:41:48Z

For expiration-based leases, knowing the current lease is confirmation that you're reasonably up-to-date. But for epoch-based leases (i.e. for all regular tables), it doesn't tell you much. A replica could be arbitrarily far behind (or even be removed from the range!) and still pass the test in step 5.

Let's ditch the current time from this protocol altogether. Instead, each replica tracks a max_write_timestamp, updated as commands are applied. Leaders also track a max_proposed_write_timestamp, and guarantee not to propose a write with a timestamp less than max_proposed_write_timestamp - max_write_age (TODO: how does this interact with reproposals?). Followers can serve reads with timestamps less than max_write_timestamp - max_write_age. Reads newer than this time would get a NotLeaseHolderError as usual.

When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.

tbg · 2017-06-19T20:38:44Z

You're right, what I was proposing gives you only consistent, not up-to-date data. I was hoping to avoid having a steady stream of Raft traffic when there isn't any write activity, but that's not possible without something like quorum leases.

a-robinson · 2017-06-19T21:51:57Z

Thanks for writing this up, I've been thinking about this lately as well as a side project but hadn't gotten nearly as far into the details.

Do we have any sense of how far in the past our timestamp cache low watermark is in typical deployments? Or how far in the past users that have asked about this are ok with having their reads be? We'd want to have some idea before going too far with this.

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

rjnn · 2017-06-20T16:59:50Z

The informal thinking around the watermark is to trail it at ~10seconds, but that number isn't particularly well rationalized. We do not raise the watermark eagerly, since there is currently no need to do so, and keeping it as far back as possible retains the maximum flexibility, but there are certain future use-cases (i.e. Naiad) that have the opposing incentives - Naiad wants the watermark raised as aggressively as possible as the watermark duration delays materialization of materialized views until the watermark has passed on all ranges. We could add a debug flag to show empirically what our watermarks are at - do you think that information would be useful to know?

tbg · 2017-06-20T19:36:44Z

The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

Replicas will just refuse what they can't serve, and if we use the eager path as Ben suggested you should "usually" not get refused assuming the timestamp you chose should be safe based on your local HLC and MaxOffset. For example, if a request ends up at the lease holder but is one that should be safe from the follower (and perhaps was tried there first), the lease holder will, for the next X seconds, eagerly bump max_write_timestamp.

We need to somehow limit the amount of proposals we're sending due to this. If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce. Since this information doesn't necessarily have to flow through Raft, we could hoist it up to the node level and, say, send (inverted) bloom filters of RangeIDs which were bumped to a common timestamp. I imagine that'll get us further than reality could go for a while.

bdarnell · 2017-06-21T22:21:28Z

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

Yes, we definitely want to use the same timestamp (unless we introduce some concept of non-transactional batching). But we also have a fallback path if we choose "incorrectly": go to the remote lease holder. So I'd suggest that when we have flexibility in the timestamp, we choose one based on what we see at the first range we touch.

If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce.

That's true if all ranges are receiving read traffic via multiple replicas. But that's not necessarily true. There's a hierarchy of range activity:

Write traffic: consensus with every operation
Read traffic spread throughout the cluster: requires either A) high-latency remote reads from the lease holder, B) periodic (but potentially frequent) consensus updates to spread timestamps, or C) quorum leases
Read traffic via a single node: rebalance the lease holder there and operate in read-only/quiesced mode
Inactive: fully quiesced

tbg · 2017-06-23T19:24:21Z

@spencerkimball, curious to hear your proposal on this.

a-robinson · 2017-06-23T22:34:39Z

Yes, we definitely want to use the same timestamp (unless we introduce

some concept of non-transactional batching). But we also have a fallback path if we choose "incorrectly": go to the remote lease holder. So I'd suggest that when we have flexibility in the timestamp, we choose one based on what we see at the first range we touch. We can fall back to the leaseholder, but occasionally needing to fall back adds a lot of variability in response time, which makes it tough in turn for apps to deliver a reliable response time. It's great for averages, but the tail behavior could make the feature unusable for certain customers. On Jun 23, 2017 3:24 PM, "Tobias Schottdorf" <[email protected]> wrote: @spencerkimball <https://github.com/spencerkimball>, curious to hear your proposal on this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16593 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGwdH7vI6PI56bBT0I-I5fDFNvR6Hvzlks5sHBDygaJpZM4N9yOM> .

petermattis · 2017-06-24T19:17:09Z

The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.

FYI (in case you missed it), TimestampCache.lowWater is only used for testing. The timestamp cache is now shared by all replicas and we update the low water for a replica by using a span which encompasses the entire range.

When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.

Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.

TODO: how does this interact with reproposals?

This seems to be the biggest hole in this approach.

bdarnell · 2017-06-27T19:31:01Z

Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.

We need to guarantee that any future lease holder will know about this timestamp. Writes accomplish this since they are replicated via the raft log. Quiesce messages do not since there is no guarantee that they will reach a quorum. We could poll a quorum without going all the way through the raft log, but that would need to be a new mechanism instead of piggybacking on quiescence.

tbg · 2017-07-05T22:47:04Z

The Storage-Level Change Feed Primitive has strong connections with follower reads. But in particular, there is one requirement that hasn't been discussed here (quote below by @bdarnell):

For transactional writes, the change feed event is emitted when the intent is resolved, instead of when the original write happens. For follower reads, we've discussed implementing close timestamps via the timestamp cache or something like it, with the semantics that "no more writes with a timestamp less than T will be accepted". However, there could still be unresolved intents with timestamps less than T. For change feeds, we require that "no new events with timestamp less than T will be emitted", so we must resolve all intents before we can advance the close timestamp.

The interesting case here is that in which a timestamp is made available for follower reads, but there is still an intent visible at that timestamp. This is fine for follower reads, though a bit awkward, as intent resolution must be carried out and that takes time. Avoiding this situation would (mostly) create parity with the needs of the change feeds primitive, but can be awkward since we won't be able to raise the safe timestamp until all intents are gone and that is hard to accomplish, seeing that we don't know where the intents are.

tomholub · 2018-05-28T22:23:21Z

Just wanted to chip in with a bit of encouragement. Option to read from non-lease holders is super useful, please keep this up.

a-robinson · 2018-05-29T14:36:08Z

Thanks @tomholub. Work has continued (#21056) and we expect to include a version of this feature in this fall's 2.1 release.

Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362

26941: storage: add min proposal timestamp tracker r=nvanbenschoten a=tschottdorf This extracts the code for the min proposal timestamp used in the prototype for follower reads into a new package with associated testing and commentary. Touches #16593. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches cockroachdb#16593. Release note: None

tim-o · 2018-08-12T15:56:11Z

Zendesk ticket #2720 has been linked to this issue.

This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches cockroachdb#16593. Release note: None

28091: storage: serve reads based on closed timestamps r=nvanbenschoten a=tschottdorf This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches #16593. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

33474: docs: add RFC to expose follower reads to clients r=ajwerner a=ajwerner Relates to #16593. Release note: None 34399: storage: fix NPE while printing trivial truncateDecision r=bdarnell a=tbg Fixes #34398. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

33478: sql,kv,followerreadsccl: enable follower reads for historical queries r=ajwerner a=ajwerner Follower reads are reads which can be served from any replica as opposed to just the current lease holder. The foundation for this change was laid with the work to introduce closed timestamps and to support follower reads at the replica level. This change adds the required support to the sql and kv layers and additionally exposes a new syntax to ease client adoption of the functionality. The change adds the followerreadsccl package with logic to check when follower reads are safe and to inject the functionality so that it can be packaged as an enterprise feature. Modifies `AS OF SYSTEM TIME` semantics to allow for the evaluation of a new builtin tentatively called `follower_read_timestamp()` in addition to constant expressions. This new builtin ensures that an enterprise license exists and then returns a time that can likely be used to read from a follower. The change abstracts (and renames to the more appropriate replicaoracle) the existing leaseHolderOracle in the distsqlplan package to allow a follower read aware policy to be injected. Lastly the change add to kv a site to inject a function for checking if follower reads are safe and allowed given a cluster, settings, and batch request. This change includes a high level roachtest which validates observable behavior of performing follower reads by examining latencies for reads in a geo- replicated setting. Implements #33474 Fixes #16593 Release note (enterprise change): Add support for performing sufficiently old historical reads against closest replicas rather than leaseholders. A new builtin function `follower_read_timestamp()` which can be used with `AS OF SYSTEM TIME` clauses to generate a timestamp which is likely to be safe for reads from a follower. Co-authored-by: Andrew Werner <[email protected]>

tbg mentioned this issue Jun 19, 2017

Support for read-only availability when in a minority partition #6150

Open

tbg mentioned this issue Jun 21, 2017

storage: Track and gossip number of applied commands per second in store capacity #16664

Merged

tbg mentioned this issue Aug 10, 2017

RFC: Change Data Capture #17535

Closed

danhhz added this to the 2.1 milestone Mar 5, 2018

danhhz mentioned this issue Mar 5, 2018

Feature Request: Row-level replication control #16978

Closed

petermattis assigned spencerkimball Mar 28, 2018

tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018

tbg assigned tbg and unassigned spencerkimball May 24, 2018

jordanlewis mentioned this issue May 31, 2018

sql: implement immutable tables that automatically use follower reads #26301

Closed

tbg mentioned this issue Jun 24, 2018

storage: add min proposal timestamp tracker #26941

Merged

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018

tbg mentioned this issue Jul 30, 2018

storage: implement follower reads using replica closed timestamps #21056

Closed

tbg mentioned this issue Jul 31, 2018

storage: serve reads based on closed timestamps #28091

Merged

petermattis removed this from the 2.1 milestone Oct 5, 2018

ajwerner self-assigned this Jan 3, 2019

This was referenced Jan 3, 2019

sql,kv,followerreadsccl: enable follower reads for historical queries #33478

Merged

docs: add RFC to expose follower reads to clients #33474

Merged

craig bot closed this as completed in #33478 Feb 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read from a follower with timestamp bound #16593

read from a follower with timestamp bound #16593

tbg commented Jun 19, 2017 •

edited

Loading

tbg commented Jun 19, 2017

bdarnell commented Jun 19, 2017

tbg commented Jun 19, 2017

a-robinson commented Jun 19, 2017

rjnn commented Jun 20, 2017

tbg commented Jun 20, 2017

bdarnell commented Jun 21, 2017

tbg commented Jun 23, 2017

a-robinson commented Jun 23, 2017 via email

petermattis commented Jun 24, 2017

bdarnell commented Jun 27, 2017

tbg commented Jul 5, 2017 •

edited

Loading

tomholub commented May 28, 2018

a-robinson commented May 29, 2018

tim-o commented Aug 12, 2018

read from a follower with timestamp bound #16593

read from a follower with timestamp bound #16593

Comments

tbg commented Jun 19, 2017 • edited Loading

Goals

Sketch of implementation

Caveats

tbg commented Jun 19, 2017

bdarnell commented Jun 19, 2017

tbg commented Jun 19, 2017

a-robinson commented Jun 19, 2017

rjnn commented Jun 20, 2017

tbg commented Jun 20, 2017

bdarnell commented Jun 21, 2017

tbg commented Jun 23, 2017

a-robinson commented Jun 23, 2017 via email

petermattis commented Jun 24, 2017

bdarnell commented Jun 27, 2017

tbg commented Jul 5, 2017 • edited Loading

tomholub commented May 28, 2018

a-robinson commented May 29, 2018

tim-o commented Aug 12, 2018

tbg commented Jun 19, 2017 •

edited

Loading

tbg commented Jul 5, 2017 •

edited

Loading