Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read from a follower with timestamp bound #16593

Closed
tbg opened this issue Jun 19, 2017 · 15 comments · Fixed by #33478
Closed

read from a follower with timestamp bound #16593

tbg opened this issue Jun 19, 2017 · 15 comments · Fixed by #33478
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jun 19, 2017

I ended up thinking about this tonight due to a related problem, so here are
some notes. The difficulty is making this zone configurable. Might've missed
something.

Goals

  • assuming a read timestamp far enough in the "past", (usually) be able to read
    from any replica. (think: analytics, time travel queries, backups, queries
    that can't or don't need to pay the latency to a far-away lease holder).
  • configurable on the level of zone configs (i.e. table)

Sketch of implementation

Add a field max_write_age to the zone configs (a value of zero behaves like
MaxUint64). The idea is that the timestamp caches of the affected ranges have
a low watermark that does not trail (now-max_write_age). Note that this
effectively limits how long transactions can write to approximately
max_write_age. In turn, when running a read-only transaction, once the
current HLC timestamp has passed read_timestamp + max_write_age + max_offset,
any replica can serve reads.

  1. add a field max_write_age to the lease proto.
  2. whenever a lease is proposed, max_write_age is populated with the value
    the proposer believes is current.
  3. lease extensions must not alter max_write_age. If a lease holder realizes
    that the ZoneConfig's max_write_age has changed, it must request a new lease
    (in practice, it only has to do this in case max_write_age increases) and let
    the old one expire (or transfer its lease away). There is room for optimization
    here: the replica could extend the lease with the new max_write_age, but all
    members must enforce the smaller max_write_ages for as long as the "old"
    version is not expired.
  4. Make DistSender aware of max_write_age. When considering a read-only
    BatchRequest with a timestamp eligible for a follower-served read, consider
    followers, prioritizing those in close proximity.
  5. A follower which receives a read-only batch first checks if the current
    lease is active (not whether it holds the lease itself). If not, it behaves as
    it would today (requests the lease). Otherwise, if it is not the lease holder,
    it checks if the batch timestamp is eligible for a follower-served read based
    on the information in the lease and the current timestamp. If so, it serves it
    (it does not need to update the timestamp cache).
  6. on writes that violate now - max_write_age < write_ts, behave as if there
    were a timestamp cache entry at now.

An interesting observation is that this can also be modified to allow serving
read queries when Raft completely breaks down (think all inter-DC connections
fail): a replica can always serve what is "safe" based on the last known lease.
There is much more work to do to get these replicas to agree on a timestamp,
though. The resulting syntax could be something along the lines of

SELECT (...) AS OF SYSTEM TIME STALE

and DistSender would consult its cache to find the minimal timestamp covered
by all leases (but even that timestamp may not work).

Caveats

  • This relies on clocks and thus on MaxOffset plus not having goroutines
    stalled in inconvenient locations (such a stall would violate MaxOffset too,
    but be very unlikely to be caught): If a write passes the check but then gets
    delayed until it doesn't hold any more, followers may serve reads that are
    then invalidated by the proceeding write. (This does not seem more fragile
    than what we already have with our read leases though).
  • if a Range isn't split along a ZoneConfig, the more restrictive
    max_write_age will be in effect.
@tbg
Copy link
Member Author

tbg commented Jun 19, 2017

cc @arjunravinarayan

@bdarnell
Copy link
Contributor

For expiration-based leases, knowing the current lease is confirmation that you're reasonably up-to-date. But for epoch-based leases (i.e. for all regular tables), it doesn't tell you much. A replica could be arbitrarily far behind (or even be removed from the range!) and still pass the test in step 5.

Let's ditch the current time from this protocol altogether. Instead, each replica tracks a max_write_timestamp, updated as commands are applied. Leaders also track a max_proposed_write_timestamp, and guarantee not to propose a write with a timestamp less than max_proposed_write_timestamp - max_write_age (TODO: how does this interact with reproposals?). Followers can serve reads with timestamps less than max_write_timestamp - max_write_age. Reads newer than this time would get a NotLeaseHolderError as usual.

When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.

@tbg
Copy link
Member Author

tbg commented Jun 19, 2017

You're right, what I was proposing gives you only consistent, not up-to-date data. I was hoping to avoid having a steady stream of Raft traffic when there isn't any write activity, but that's not possible without something like quorum leases.

@a-robinson
Copy link
Contributor

Thanks for writing this up, I've been thinking about this lately as well as a side project but hadn't gotten nearly as far into the details.

Do we have any sense of how far in the past our timestamp cache low watermark is in typical deployments? Or how far in the past users that have asked about this are ok with having their reads be? We'd want to have some idea before going too far with this.

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

@rjnn
Copy link
Contributor

rjnn commented Jun 20, 2017

The informal thinking around the watermark is to trail it at ~10seconds, but that number isn't particularly well rationalized. We do not raise the watermark eagerly, since there is currently no need to do so, and keeping it as far back as possible retains the maximum flexibility, but there are certain future use-cases (i.e. Naiad) that have the opposing incentives - Naiad wants the watermark raised as aggressively as possible as the watermark duration delays materialization of materialized views until the watermark has passed on all ranges. We could add a debug flag to show empirically what our watermarks are at - do you think that information would be useful to know?

@tbg
Copy link
Member Author

tbg commented Jun 20, 2017

The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

Replicas will just refuse what they can't serve, and if we use the eager path as Ben suggested you should "usually" not get refused assuming the timestamp you chose should be safe based on your local HLC and MaxOffset. For example, if a request ends up at the lease holder but is one that should be safe from the follower (and perhaps was tried there first), the lease holder will, for the next X seconds, eagerly bump max_write_timestamp.

We need to somehow limit the amount of proposals we're sending due to this. If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce. Since this information doesn't necessarily have to flow through Raft, we could hoist it up to the node level and, say, send (inverted) bloom filters of RangeIDs which were bumped to a common timestamp. I imagine that'll get us further than reality could go for a while.

@bdarnell
Copy link
Contributor

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

Yes, we definitely want to use the same timestamp (unless we introduce some concept of non-transactional batching). But we also have a fallback path if we choose "incorrectly": go to the remote lease holder. So I'd suggest that when we have flexibility in the timestamp, we choose one based on what we see at the first range we touch.

If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce.

That's true if all ranges are receiving read traffic via multiple replicas. But that's not necessarily true. There's a hierarchy of range activity:

  1. Write traffic: consensus with every operation
  2. Read traffic spread throughout the cluster: requires either A) high-latency remote reads from the lease holder, B) periodic (but potentially frequent) consensus updates to spread timestamps, or C) quorum leases
  3. Read traffic via a single node: rebalance the lease holder there and operate in read-only/quiesced mode
  4. Inactive: fully quiesced

@tbg
Copy link
Member Author

tbg commented Jun 23, 2017

@spencerkimball, curious to hear your proposal on this.

@a-robinson
Copy link
Contributor

a-robinson commented Jun 23, 2017 via email

@petermattis
Copy link
Collaborator

The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.

FYI (in case you missed it), TimestampCache.lowWater is only used for testing. The timestamp cache is now shared by all replicas and we update the low water for a replica by using a span which encompasses the entire range.

When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.

Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.

TODO: how does this interact with reproposals?

This seems to be the biggest hole in this approach.

@bdarnell
Copy link
Contributor

Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.

We need to guarantee that any future lease holder will know about this timestamp. Writes accomplish this since they are replicated via the raft log. Quiesce messages do not since there is no guarantee that they will reach a quorum. We could poll a quorum without going all the way through the raft log, but that would need to be a new mechanism instead of piggybacking on quiescence.

@tbg
Copy link
Member Author

tbg commented Jul 5, 2017

The Storage-Level Change Feed Primitive has strong connections with follower reads. But in particular, there is one requirement that hasn't been discussed here (quote below by @bdarnell):

For transactional writes, the change feed event is emitted when the intent is resolved, instead of when the original write happens. For follower reads, we've discussed implementing close timestamps via the timestamp cache or something like it, with the semantics that "no more writes with a timestamp less than T will be accepted". However, there could still be unresolved intents with timestamps less than T. For change feeds, we require that "no new events with timestamp less than T will be emitted", so we must resolve all intents before we can advance the close timestamp.

The interesting case here is that in which a timestamp is made available for follower reads, but there is still an intent visible at that timestamp. This is fine for follower reads, though a bit awkward, as intent resolution must be carried out and that takes time. Avoiding this situation would (mostly) create parity with the needs of the change feeds primitive, but can be awkward since we won't be able to raise the safe timestamp until all intents are gone and that is hard to accomplish, seeing that we don't know where the intents are.

@danhhz danhhz added this to the 2.1 milestone Mar 5, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018
@tbg tbg assigned tbg and unassigned spencerkimball May 24, 2018
@tomholub
Copy link

Just wanted to chip in with a bit of encouragement. Option to read from non-lease holders is super useful, please keep this up.

@a-robinson
Copy link
Contributor

Thanks @tomholub. Work has continued (#21056) and we expect to include a version of this feature in this fall's 2.1 release.

tbg added a commit to tbg/cockroach that referenced this issue Jul 11, 2018
Implement the minimum proposal timestamp tracker outlined in the
[follower reads RFC].

The implementation will likely need follow-up work to reduce lock
contention. The intended usage will place a call to Track and to
the returned closure into the write path.

Touches cockroachdb#16593.

Release note: None

[follower reads RFC]: cockroachdb#26362
tbg added a commit to tbg/cockroach that referenced this issue Jul 11, 2018
Implement the minimum proposal timestamp tracker outlined in the
[follower reads RFC].

The implementation will likely need follow-up work to reduce lock
contention. The intended usage will place a call to Track and to
the returned closure into the write path.

Touches cockroachdb#16593.

Release note: None

[follower reads RFC]: cockroachdb#26362
craig bot pushed a commit that referenced this issue Jul 11, 2018
26941: storage: add min proposal timestamp tracker r=nvanbenschoten a=tschottdorf

This extracts the code for the min proposal timestamp used in the
prototype for follower reads into a new package with associated
testing and commentary.

Touches #16593.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018
tbg added a commit to tbg/cockroach that referenced this issue Jul 31, 2018
This "finishes" hooking up the storage side of closed timestamps. After
checking their lease and deciding that the lease does not allow serving
a read, replicas now check whether they can serve the batch as a
follower read. This requires that the range is epoch based, that
appropriate information is stored in the closed timestamp subsystem, and
finally that the cluster setting to enable this is set.

Added a test that verifies that a test server will serve follower reads
(directly from the replicas, without routing through DistSender).

Introducing machinery at the distributed sender to actually consider
routing reads to follower replicas is the next step.

TODO: take perf numbers before/after this change to verify that there
isn't a noticeable regression.

Touches cockroachdb#16593.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 1, 2018
This "finishes" hooking up the storage side of closed timestamps. After
checking their lease and deciding that the lease does not allow serving
a read, replicas now check whether they can serve the batch as a
follower read. This requires that the range is epoch based, that
appropriate information is stored in the closed timestamp subsystem, and
finally that the cluster setting to enable this is set.

Added a test that verifies that a test server will serve follower reads
(directly from the replicas, without routing through DistSender).

Introducing machinery at the distributed sender to actually consider
routing reads to follower replicas is the next step.

TODO: take perf numbers before/after this change to verify that there
isn't a noticeable regression.

Touches cockroachdb#16593.

Release note: None
@tim-o
Copy link
Contributor

tim-o commented Aug 12, 2018

Zendesk ticket #2720 has been linked to this issue.

tbg added a commit to tbg/cockroach that referenced this issue Aug 13, 2018
This "finishes" hooking up the storage side of closed timestamps. After
checking their lease and deciding that the lease does not allow serving
a read, replicas now check whether they can serve the batch as a
follower read. This requires that the range is epoch based, that
appropriate information is stored in the closed timestamp subsystem, and
finally that the cluster setting to enable this is set.

Added a test that verifies that a test server will serve follower reads
(directly from the replicas, without routing through DistSender).

Introducing machinery at the distributed sender to actually consider
routing reads to follower replicas is the next step.

TODO: take perf numbers before/after this change to verify that there
isn't a noticeable regression.

Touches cockroachdb#16593.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 13, 2018
This "finishes" hooking up the storage side of closed timestamps. After
checking their lease and deciding that the lease does not allow serving
a read, replicas now check whether they can serve the batch as a
follower read. This requires that the range is epoch based, that
appropriate information is stored in the closed timestamp subsystem, and
finally that the cluster setting to enable this is set.

Added a test that verifies that a test server will serve follower reads
(directly from the replicas, without routing through DistSender).

Introducing machinery at the distributed sender to actually consider
routing reads to follower replicas is the next step.

TODO: take perf numbers before/after this change to verify that there
isn't a noticeable regression.

Touches cockroachdb#16593.

Release note: None
craig bot pushed a commit that referenced this issue Aug 13, 2018
28091: storage: serve reads based on closed timestamps r=nvanbenschoten a=tschottdorf

This "finishes" hooking up the storage side of closed timestamps. After
checking their lease and deciding that the lease does not allow serving
a read, replicas now check whether they can serve the batch as a
follower read. This requires that the range is epoch based, that
appropriate information is stored in the closed timestamp subsystem, and
finally that the cluster setting to enable this is set.

Added a test that verifies that a test server will serve follower reads
(directly from the replicas, without routing through DistSender).

Introducing machinery at the distributed sender to actually consider
routing reads to follower replicas is the next step.

TODO: take perf numbers before/after this change to verify that there
isn't a noticeable regression.

Touches #16593.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@petermattis petermattis removed this from the 2.1 milestone Oct 5, 2018
@ajwerner ajwerner self-assigned this Jan 3, 2019
craig bot pushed a commit that referenced this issue Jan 30, 2019
33474: docs: add RFC to expose follower reads to clients r=ajwerner a=ajwerner

Relates to #16593.
Release note: None

34399: storage: fix NPE while printing trivial truncateDecision r=bdarnell a=tbg

Fixes #34398.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Tobias Schottdorf <[email protected]>
craig bot pushed a commit that referenced this issue Feb 5, 2019
33478: sql,kv,followerreadsccl: enable follower reads for historical queries r=ajwerner a=ajwerner

Follower reads are reads which can be served from any replica as opposed to just
the current lease holder. The foundation for this change was laid with the work
to introduce closed timestamps and to support follower reads at the replica
level. This change adds the required support to the sql and kv layers and
additionally exposes a new syntax to ease client adoption of the functionality.

The change adds the followerreadsccl package with logic to check when follower
reads are safe and to inject the functionality so that it can be packaged as an
enterprise feature.

Modifies `AS OF SYSTEM TIME` semantics to allow for the evaluation of a new
builtin tentatively called `follower_read_timestamp()` in addition to constant
expressions. This new builtin ensures that an enterprise license exists and then
returns a time that can likely be used to read from a follower.

The change abstracts (and renames to the more appropriate replicaoracle) the
existing leaseHolderOracle in the distsqlplan package to allow a follower read
aware policy to be injected.

Lastly the change add to kv a site to inject a function for checking if follower
reads are safe and allowed given a cluster, settings, and batch request.

This change includes a high level roachtest which validates observable behavior
of performing follower reads by examining latencies for reads in a geo-
replicated setting.

Implements #33474 
Fixes #16593

Release note (enterprise change): Add support for performing sufficiently old
historical reads against closest replicas rather than leaseholders. A new
builtin function `follower_read_timestamp()` which can be used with `AS OF
SYSTEM TIME` clauses to generate a timestamp which is likely to be safe for
reads from a follower.

Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in #33478 Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants