-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFCs: Max safe timestamp for follower reads and change feeds #19222
Conversation
so excited to see this get started! Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file):
when i talked to alex/nathan about this yesterday, i was a bit confused about what "no new writes" meant. as i recall from the explanation, it means even intents won't be resolved on that replica (and why this is important), which is not what i could have guessed from this phrasing. this could be that i'm not up on my core terminology, but making it more explicit couldn't hurt update: ah, i see this is explained in more detail below. maybe move a bit of that context up here docs/RFCS/00000000_max_safe_timestamp.md, line 49 at r1 (raw file):
can you also include how max clock offset plays into the below? docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file):
what do we imagine to be the range of typical values for max_write_age? docs/RFCS/00000000_max_safe_timestamp.md, line 151 at r1 (raw file):
not sure it would be enough to make any of these decisions differently, but I just want to make sure y'all know about Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed. docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
Yeah, that's the important distinction that we're trying to draw here. "No new writes" in the sense of follower reads means that no transactions can be committed with timestamps lower than some timestamp threshold. This means that if a scan occurs at the timestamp threshold, the only intents that can be seen will be those for already committed transactions (which can be cleaned up by the scan, if necessary). Since a follower won't be updating the Closing out a timestamp as is needed by CDC is a superset of this requirement that also requires that all intents for transactions, even those already committed, must also be resolved by that timestamp threshold. This means that if a scan occurs at the timestamp threshold, it will never see any intents. The reason we think these should be decoupled is that the latter requires significantly more bookkeeping to guarantee. Based on past discussions with Alex and Toby, I don't think it's possible to provide the second guarantee efficiently without retaining some kind of in-memory or persistent data structure that tracks pending intents. However, as the rest of this RFC describes, the first guarantee can be provided without nearly as much bookkeeping. In fact, other than the dummy writes (which if you squint are lease renewals), the first guarantee can be provided without any real performance concern. One of our focuses throughout this discussion was to avoid paying for what isn't being used. If a range isn't using follower reads, it shouldn't pay the cost of these dummy writes. Likewise, if a range isn't monitoring a change feed, it shouldn't pay for the cost of tracking intents. By splitting docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
It really depends on the use cases we want to support. 10 seconds has been thrown around in the past, but I think that was just a holdover from arbitrary limitations we previously placed on the docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file):
If the transaction associated with these intents has a timestamp below docs/RFCS/00000000_max_safe_timestamp.md, line 83 at r1 (raw file):
Make a note of why this isn't an issue if the range is seeing any writes somewhere. docs/RFCS/00000000_max_safe_timestamp.md, line 99 at r1 (raw file):
nit: docs/RFCS/00000000_max_safe_timestamp.md, line 106 at r1 (raw file):
nit: These numbers got messed up. docs/RFCS/00000000_max_safe_timestamp.md, line 114 at r1 (raw file):
This also prevents ranges from quiescing. Comments from Reviewable |
189efe9
to
a5a9d55
Compare
Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed. docs/RFCS/00000000_max_safe_timestamp.md, line 49 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
I don't think it needs to play into the below, actually. At least not as far as follower reads are concerned. All timestamps are going through a single leaseholder except when the lease changes hands, so as long as we consider I've added a new bullet point about this. I'd love to be corrected now, though, if you can think of something I haven't. docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'd expect we'll put it in the low tens of seconds by default. Added a little note to the text. docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Improved the wording. docs/RFCS/00000000_max_safe_timestamp.md, line 83 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. docs/RFCS/00000000_max_safe_timestamp.md, line 99 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. docs/RFCS/00000000_max_safe_timestamp.md, line 106 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did they? They still rendered fine. I've cleaned them up so they look better in the non-rendered version, though. docs/RFCS/00000000_max_safe_timestamp.md, line 114 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. docs/RFCS/00000000_max_safe_timestamp.md, line 151 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
Yeah, thanks for pointing that out. I had actually been wondering about whether something like that was possible when thinking things through. It operates on all of the store's data, though, whereas change feeds may only be operating on a small fraction of a store's ranges. It'd also be tough to tune how frequently we ran it, needing to balance timeliness of close notifications with the cost of running it. Added to the alternatives section. Comments from Reviewable |
Excited as well! Reviewed 1 of 1 files at r2. docs/RFCS/00000000_max_safe_timestamp.md, line 49 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I agree with Alex that it doesn't come into play. docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I could imagine this being lowered to low single digit seconds if you're using materialized views heavily, so the design shouldn't be so heavyweight as to preclude that setting. Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful. docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file):
"already finalized transactions", right? The txns can be committed or aborted as long as they're not pending. docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file): Previously, arjunravinarayan (Arjun Narayan) wrote…
Or if you're using follower reads, not just materialized views. I think setting this down to a few seconds will be common (if we can support it) docs/RFCS/00000000_max_safe_timestamp.md, line 51 at r2 (raw file):
Does "observed" mean "appended to the raft log" or "applied"? docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file):
What do you mean by "lingering"? The write timestamp won't change, so is this referring to real time? docs/RFCS/00000000_max_safe_timestamp.md, line 74 at r2 (raw file):
How are the transactions aborted? The transaction record may live on another range; how is this synchronization managed? Comments from Reviewable |
options here: | ||
|
||
1. Have the leaseholder periodically propose an empty command to raft with the | ||
current timestamp if no other writes have come through lately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be another option to have the leaseholder periodically propose an empty command to raft with a future timestamp and disallow writes to that range until the future timestamp. This is like giving out a lease to the followers until a certain timestamp to allow reads at the current timestamp from followers at the expense of slowing down writes considerably
5a4ad7b
to
95aa53e
Compare
After having a pretty interesting discussion with @andreimatei and @bdarnell about this forum post, I decided to add a new alternative. The idea is only half serious but provides a nice generalization of |
Didn't give the CDC portion a close read yet because I had some basic questions about the first part. Reviewed 1 of 1 files at r3. docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Nathan, why do they have to be aborted? I agree that they should (because we don't want to wait). If they have to we'd have a problem because the intent may as well have be committed. Side note: when serving reads from many followers, this will likely mean multiple (x #nodes) racing intent resolutions (we deduplicate somewhat in intent resolver, but only on each node). May not be a huge problem, but should definitely try to avoid intents there in the first place. An automatically managed backoff would be interesting. Whenever an intent is hit, the lease holder adjusts (doubles?) the gap to the write frontier. docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Unrelated, but this might be a better alternative than waiting for the reproposal (or triggering a reproposal of potentially many old in-flight things) in general. Currently we repropose wildly, but we could send a "canary" which is potentially cheaper. We also don't currently actively invalidate commands once we've observed a higher lease applied index (i.e. if we proposed 3 a while ago but then we see 4 apply, we don't eagerly notify 3 but we probably should (and can gate the check on observing a gap). Might be worth filing, though hopefully we don't need it in practice. Related: I'm not sure I understand how this mechanism is supposed to work:
So essentially correctness seems to hinge on being able to control how much time passes between proposal and application, or to stop accepting writes. That seems obviously broken so I must be misunderstanding something fundamental. I pictured that the leaseholder would piggyback a safe timestamp on its proposals (or just send empty proposals when it needs to) that basically carry docs/RFCS/00000000_max_safe_timestamp.md, line 98 at r2 (raw file): Previously, vivekmenezes wrote…
I don't think we should ever have this mechanism interfere with write traffic. In fact, that should be one of the goals stated in the RFC. docs/RFCS/00000000_max_safe_timestamp.md, line 84 at r3 (raw file):
Just pointing out that you should try to consider (and perhaps accommodate, if possible) the case docs/RFCS/00000000_max_safe_timestamp.md, line 108 at r3 (raw file):
It seems that you should be able to remove Comments from Reviewable |
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. docs/RFCS/00000000_max_safe_timestamp.md, line 238 at r3 (raw file):
Alex just pointed out to me that the tscache is no longer a per-range structure - it's per store. So we should qualify here that we'd be sending the tscache entries overlaping the range in question. docs/RFCS/00000000_max_safe_timestamp.md, line 267 at r3 (raw file):
We've discussed more than written here, and I think it'd be cool to write all of it up: a generalization to of the generalization is that followers' request don't always (usually?) need to wait for a Raft proposal. The follower needs the leaseholder's tscache to be updated, and it needs to make sure that it is up to date enough with the Raft log to serve the read. So, it could tell the leaseholder what its current raft index is; the leaseholder's tscache could be augmented with a raft index number for each element -> meaning what index a follower needs to be able to serve reads that fall within that tscache entry. Now, when the leaseholder gets a request from a follower, it will:
If the requests from followers go through the command queue (to synchronize with in-flight writes), then I think no raft proposal is necessary. Comments from Reviewable |
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Ah, silly me. The steady stream of writes would immediately surpass the straggler's lease applied index. I'm still not sure this mechanism is easy to use for the CDC use case, but I understand how the basic version here checks out. Comments from Reviewable |
Mostly just a more verbose transcription of the notes I took during Nathan and I's discussion yesterday. Still a WIP compared to the expected format for RFCs.
95aa53e
to
e6bd937
Compare
TFTRs! Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions. docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It seems possible to run into an intent for a transaction that hasn't been committed or aborted, but the transaction would be abort-able at that point. docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Right, I don't believe that they'd necessarily have to be aborted. docs/RFCS/00000000_max_safe_timestamp.md, line 51 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Great question. I think that appended to the raft log is sufficient. docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yeah, "lingering" was referring to real time. I've tried to clarify this, since it was definitely a bit terse. docs/RFCS/00000000_max_safe_timestamp.md, line 74 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Why would any synchronization be needed beyond the normal synchronous intent resolution that gets done for reads? docs/RFCS/00000000_max_safe_timestamp.md, line 98 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yeah, I don't know when someone would want to make that extreme of a tradeoff. Added a short list of goals, including not interfering with writes. docs/RFCS/00000000_max_safe_timestamp.md, line 84 at r3 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I'm a bit out of touch with our clockless reads. Do you mind explaining how clockless mode would come into play given that we're basing things on lease expiration timestamps? docs/RFCS/00000000_max_safe_timestamp.md, line 108 at r3 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Good point, changed. docs/RFCS/00000000_max_safe_timestamp.md, line 238 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Done. docs/RFCS/00000000_max_safe_timestamp.md, line 269 at r3 (raw file):
Sending it through raft probably would be due to all the extra disk writes that would be involved, but sending this stuff through a separate RPC might not be totally crazy. If you'd like to do an experiment to estimate how much bandwidth/serialization cost there'd be, we could consider this more seriously. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed. docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I think it has to be aborted, not just abortable. We have to know that the transaction will never commit, and I don't think we can know that without actually aborting it (as long as it's pending, the commit could be in flight in the raft log). docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
If a real command gets delayed going into the raft log, what's stopping these dummy commands from doing the same thing? I think we have to rely solely on timestamps that have appeared in the raft log. The "heartbeats" described below to keep docs/RFCS/00000000_max_safe_timestamp.md, line 74 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Ah, so you're talking about the normal post-read intent resolution. That's fine, but it means that this "max safe timestamp" is no longer a guarantee that local reads will be successful - you may have to push the transaction remotely, and after pushing you can't tell when the local replica has caught up with the intent resolution so you have to read remotely as well. I was assuming that we wouldn't be advancing the follower read timestamp until all transactions were finalized and all intents resolved (since that's what CDC will require). We can do this simpler version for follower reads although leaving unresolved intents may mean it performs worse than expected. docs/RFCS/00000000_max_safe_timestamp.md, line 65 at r5 (raw file):
It looks like we require that docs/RFCS/00000000_max_safe_timestamp.md, line 128 at r5 (raw file):
What does this mean for epoch-based leases? Leases have start times; can we use that instead? Comments from Reviewable |
This RFC is incomplete. Is is merged in `draft` status. This RFC proposes a mechanism for subscribing to changes to a set of key ranges starting at an initial timestamp. It is for use by Cockroach-internal higher-level systems such as distributed SQL, change data capture, or a Kafka producer endpoint. A "sister RFC" detailing these higher-level systems is in cockroachdb#17535. See also cockroachdb#19222 for a related follower-reads RFC.
This RFC is incomplete. Is is merged in `draft` status. This RFC proposes a mechanism for subscribing to changes to a set of key ranges starting at an initial timestamp. It is for use by Cockroach-internal higher-level systems such as distributed SQL, change data capture, or a Kafka producer endpoint. A "sister RFC" detailing these higher-level systems is in cockroachdb#17535. See also cockroachdb#19222 for a related follower-reads RFC.
Mostly just a more verbose transcription of the notes I took during
Nathan and I's discussion yesterday. Still a WIP compared to the
expected format for RFCs.
Really just a companion document to #17535 and #16838, but sending out for feedback.
@nvanbenschoten @tschottdorf @bdarnell @petermattis