-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read from a follower with timestamp bound #16593
Comments
cc @arjunravinarayan |
For expiration-based leases, knowing the current lease is confirmation that you're reasonably up-to-date. But for epoch-based leases (i.e. for all regular tables), it doesn't tell you much. A replica could be arbitrarily far behind (or even be removed from the range!) and still pass the test in step 5. Let's ditch the current time from this protocol altogether. Instead, each replica tracks a When serving a read older than |
You're right, what I was proposing gives you only consistent, not up-to-date data. I was hoping to avoid having a steady stream of Raft traffic when there isn't any write activity, but that's not possible without something like quorum leases. |
Thanks for writing this up, I've been thinking about this lately as well as a side project but hadn't gotten nearly as far into the details. Do we have any sense of how far in the past our timestamp cache low watermark is in typical deployments? Or how far in the past users that have asked about this are ok with having their reads be? We'd want to have some idea before going too far with this. Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the |
The informal thinking around the watermark is to trail it at ~10seconds, but that number isn't particularly well rationalized. We do not raise the watermark eagerly, since there is currently no need to do so, and keeping it as far back as possible retains the maximum flexibility, but there are certain future use-cases (i.e. Naiad) that have the opposing incentives - Naiad wants the watermark raised as aggressively as possible as the watermark duration delays materialization of materialized views until the watermark has passed on all ranges. We could add a debug flag to show empirically what our watermarks are at - do you think that information would be useful to know? |
The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size. Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach. Replicas will just refuse what they can't serve, and if we use the eager path as Ben suggested you should "usually" not get refused assuming the timestamp you chose should be safe based on your local HLC and MaxOffset. For example, if a request ends up at the lease holder but is one that should be safe from the follower (and perhaps was tried there first), the lease holder will, for the next X seconds, eagerly bump We need to somehow limit the amount of proposals we're sending due to this. If you have a large number of ranges and very little write traffic, bumping |
Yes, we definitely want to use the same timestamp (unless we introduce some concept of non-transactional batching). But we also have a fallback path if we choose "incorrectly": go to the remote lease holder. So I'd suggest that when we have flexibility in the timestamp, we choose one based on what we see at the first range we touch.
That's true if all ranges are receiving read traffic via multiple replicas. But that's not necessarily true. There's a hierarchy of range activity:
|
@spencerkimball, curious to hear your proposal on this. |
Yes, we definitely want to use the same timestamp (unless we introduce
some concept of non-transactional batching). But we also have a fallback
path if we choose "incorrectly": go to the remote lease holder. So I'd
suggest that when we have flexibility in the timestamp, we choose one based
on what we see at the first range we touch.
We can fall back to the leaseholder, but occasionally needing to fall back
adds a lot of variability in response time, which makes it tough in turn
for apps to deliver a reliable response time. It's great for averages, but
the tail behavior could make the feature unusable for certain customers.
On Jun 23, 2017 3:24 PM, "Tobias Schottdorf" <[email protected]> wrote:
@spencerkimball <https://github.com/spencerkimball>, curious to hear your
proposal on this.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16593 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGwdH7vI6PI56bBT0I-I5fDFNvR6Hvzlks5sHBDygaJpZM4N9yOM>
.
|
FYI (in case you missed it),
Do we need to do this with a dummy write? Perhaps we could also advance
This seems to be the biggest hole in this approach. |
We need to guarantee that any future lease holder will know about this timestamp. Writes accomplish this since they are replicated via the raft log. Quiesce messages do not since there is no guarantee that they will reach a quorum. We could poll a quorum without going all the way through the raft log, but that would need to be a new mechanism instead of piggybacking on quiescence. |
The Storage-Level Change Feed Primitive has strong connections with follower reads. But in particular, there is one requirement that hasn't been discussed here (quote below by @bdarnell):
The interesting case here is that in which a timestamp is made available for follower reads, but there is still an intent visible at that timestamp. This is fine for follower reads, though a bit awkward, as intent resolution must be carried out and that takes time. Avoiding this situation would (mostly) create parity with the needs of the change feeds primitive, but can be awkward since we won't be able to raise the safe timestamp until all intents are gone and that is hard to accomplish, seeing that we don't know where the intents are. |
Just wanted to chip in with a bit of encouragement. Option to read from non-lease holders is super useful, please keep this up. |
Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362
Implement the minimum proposal timestamp tracker outlined in the [follower reads RFC]. The implementation will likely need follow-up work to reduce lock contention. The intended usage will place a call to Track and to the returned closure into the write path. Touches cockroachdb#16593. Release note: None [follower reads RFC]: cockroachdb#26362
26941: storage: add min proposal timestamp tracker r=nvanbenschoten a=tschottdorf This extracts the code for the min proposal timestamp used in the prototype for follower reads into a new package with associated testing and commentary. Touches #16593. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches cockroachdb#16593. Release note: None
This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches cockroachdb#16593. Release note: None
Zendesk ticket #2720 has been linked to this issue. |
This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches cockroachdb#16593. Release note: None
This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches cockroachdb#16593. Release note: None
28091: storage: serve reads based on closed timestamps r=nvanbenschoten a=tschottdorf This "finishes" hooking up the storage side of closed timestamps. After checking their lease and deciding that the lease does not allow serving a read, replicas now check whether they can serve the batch as a follower read. This requires that the range is epoch based, that appropriate information is stored in the closed timestamp subsystem, and finally that the cluster setting to enable this is set. Added a test that verifies that a test server will serve follower reads (directly from the replicas, without routing through DistSender). Introducing machinery at the distributed sender to actually consider routing reads to follower replicas is the next step. TODO: take perf numbers before/after this change to verify that there isn't a noticeable regression. Touches #16593. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
33474: docs: add RFC to expose follower reads to clients r=ajwerner a=ajwerner Relates to #16593. Release note: None 34399: storage: fix NPE while printing trivial truncateDecision r=bdarnell a=tbg Fixes #34398. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>
33478: sql,kv,followerreadsccl: enable follower reads for historical queries r=ajwerner a=ajwerner Follower reads are reads which can be served from any replica as opposed to just the current lease holder. The foundation for this change was laid with the work to introduce closed timestamps and to support follower reads at the replica level. This change adds the required support to the sql and kv layers and additionally exposes a new syntax to ease client adoption of the functionality. The change adds the followerreadsccl package with logic to check when follower reads are safe and to inject the functionality so that it can be packaged as an enterprise feature. Modifies `AS OF SYSTEM TIME` semantics to allow for the evaluation of a new builtin tentatively called `follower_read_timestamp()` in addition to constant expressions. This new builtin ensures that an enterprise license exists and then returns a time that can likely be used to read from a follower. The change abstracts (and renames to the more appropriate replicaoracle) the existing leaseHolderOracle in the distsqlplan package to allow a follower read aware policy to be injected. Lastly the change add to kv a site to inject a function for checking if follower reads are safe and allowed given a cluster, settings, and batch request. This change includes a high level roachtest which validates observable behavior of performing follower reads by examining latencies for reads in a geo- replicated setting. Implements #33474 Fixes #16593 Release note (enterprise change): Add support for performing sufficiently old historical reads against closest replicas rather than leaseholders. A new builtin function `follower_read_timestamp()` which can be used with `AS OF SYSTEM TIME` clauses to generate a timestamp which is likely to be safe for reads from a follower. Co-authored-by: Andrew Werner <[email protected]>
I ended up thinking about this tonight due to a related problem, so here are
some notes. The difficulty is making this zone configurable. Might've missed
something.
Goals
from any replica. (think: analytics, time travel queries, backups, queries
that can't or don't need to pay the latency to a far-away lease holder).
Sketch of implementation
Add a field
max_write_age
to the zone configs (a value of zero behaves likeMaxUint64). The idea is that the timestamp caches of the affected ranges have
a low watermark that does not trail
(now-max_write_age)
. Note that thiseffectively limits how long transactions can write to approximately
max_write_age
. In turn, when running a read-only transaction, once thecurrent HLC timestamp has passed
read_timestamp + max_write_age + max_offset
,any replica can serve reads.
max_write_age
to the lease proto.max_write_age
is populated with the valuethe proposer believes is current.
max_write_age
. If a lease holder realizesthat the ZoneConfig's
max_write_age
has changed, it must request a new lease(in practice, it only has to do this in case
max_write_age
increases) and letthe old one expire (or transfer its lease away). There is room for optimization
here: the replica could extend the lease with the new
max_write_age
, but allmembers must enforce the smaller
max_write_age
s for as long as the "old"version is not expired.
max_write_age
. When considering a read-onlyBatchRequest
with a timestamp eligible for a follower-served read, considerfollowers, prioritizing those in close proximity.
lease is active (not whether it holds the lease itself). If not, it behaves as
it would today (requests the lease). Otherwise, if it is not the lease holder,
it checks if the batch timestamp is eligible for a follower-served read based
on the information in the lease and the current timestamp. If so, it serves it
(it does not need to update the timestamp cache).
now - max_write_age < write_ts
, behave as if therewere a timestamp cache entry at
now
.An interesting observation is that this can also be modified to allow serving
read queries when Raft completely breaks down (think all inter-DC connections
fail): a replica can always serve what is "safe" based on the last known lease.
There is much more work to do to get these replicas to agree on a timestamp,
though. The resulting syntax could be something along the lines of
SELECT (...) AS OF SYSTEM TIME STALE
and
DistSender
would consult its cache to find the minimal timestamp coveredby all leases (but even that timestamp may not work).
Caveats
stalled in inconvenient locations (such a stall would violate MaxOffset too,
but be very unlikely to be caught): If a write passes the check but then gets
delayed until it doesn't hold any more, followers may serve reads that are
then invalidated by the proceeding write. (This does not seem more fragile
than what we already have with our read leases though).
max_write_age
will be in effect.The text was updated successfully, but these errors were encountered: