RFCs: Max safe timestamp for follower reads and change feeds #19222

a-robinson · 2017-10-12T18:33:36Z

Mostly just a more verbose transcription of the notes I took during
Nathan and I's discussion yesterday. Still a WIP compared to the
expected format for RFCs.

Really just a companion document to #17535 and #16838, but sending out for feedback.

@nvanbenschoten @tschottdorf @bdarnell @petermattis

cockroach-teamcity · 2017-10-12T18:33:42Z

This change is

danhhz · 2017-10-12T19:01:25Z

so excited to see this get started!

Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file):

non-leaseholder replicas need to know the range of timestamps at which it is
safe to serve reads without coordinating with the leaseholder, and change feeds
require what are known as "close notifications", which indicate no new writes

when i talked to alex/nathan about this yesterday, i was a bit confused about what "no new writes" meant. as i recall from the explanation, it means even intents won't be resolved on that replica (and why this is important), which is not what i could have guessed from this phrasing. this could be that i'm not up on my core terminology, but making it more explicit couldn't hurt

update: ah, i see this is explained in more detail below. maybe move a bit of that context up here

docs/RFCS/00000000_max_safe_timestamp.md, line 49 at r1 (raw file):

the
issue](https://github.com/cockroachdb/cockroach/issues/16593#issuecomment-309549461).
The high-level breakdown of how it works is:

can you also include how max clock offset plays into the below?

docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file):

* The leaseholder will also track the `max_proposed_write_timestamp`.
* The leaseholder promises not to propose a write with timestamp less than
  `max_proposed_write_timestamp - max_write_age`, where `max_write_age` is a

what do we imagine to be the range of typical values for max_write_age?

docs/RFCS/00000000_max_safe_timestamp.md, line 151 at r1 (raw file):

* We can reuse the `max_write_timestamp` tracking from follower reads, but more
  logic is needed to ensure no old intents are left unresolved.
* We have to track unresolved intents somewhere - scanning the entire range to

not sure it would be enough to make any of these decisions differently, but I just want to make sure y'all know about engine.NewTimeBoundIterator which is given a range of our mvcc timestamps and uses that to skip any sstables that don't contain those times (we store the mvcc timestamps contained in an sstable's metadata every time we write one). these intents would be very new, which maximizes the chance that you'd be scanning only small sstables

Comments from Reviewable

nvanbenschoten · 2017-10-13T21:13:46Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

when i talked to alex/nathan about this yesterday, i was a bit confused about what "no new writes" meant. as i recall from the explanation, it means even intents won't be resolved on that replica (and why this is important), which is not what i could have guessed from this phrasing. this could be that i'm not up on my core terminology, but making it more explicit couldn't hurt

update: ah, i see this is explained in more detail below. maybe move a bit of that context up here

Yeah, that's the important distinction that we're trying to draw here. "No new writes" in the sense of follower reads means that no transactions can be committed with timestamps lower than some timestamp threshold. This means that if a scan occurs at the timestamp threshold, the only intents that can be seen will be those for already committed transactions (which can be cleaned up by the scan, if necessary). Since a follower won't be updating the TimestampCache, this is necessary to provide serializable reads on followers.

Closing out a timestamp as is needed by CDC is a superset of this requirement that also requires that all intents for transactions, even those already committed, must also be resolved by that timestamp threshold. This means that if a scan occurs at the timestamp threshold, it will never see any intents.

The reason we think these should be decoupled is that the latter requires significantly more bookkeeping to guarantee. Based on past discussions with Alex and Toby, I don't think it's possible to provide the second guarantee efficiently without retaining some kind of in-memory or persistent data structure that tracks pending intents. However, as the rest of this RFC describes, the first guarantee can be provided without nearly as much bookkeeping. In fact, other than the dummy writes (which if you squint are lease renewals), the first guarantee can be provided without any real performance concern.

One of our focuses throughout this discussion was to avoid paying for what isn't being used. If a range isn't using follower reads, it shouldn't pay the cost of these dummy writes. Likewise, if a range isn't monitoring a change feed, it shouldn't pay for the cost of tracking intents. By splitting max_safe_timestamp and max_safe_intent_timestamp (or whatever it's called) we can provide follower reads without paying the full cost required by CDC.

docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

what do we imagine to be the range of typical values for max_write_age?

It really depends on the use cases we want to support. 10 seconds has been thrown around in the past, but I think that was just a holdover from arbitrary limitations we previously placed on the TimestampCache. Making this easily configurable is important because I could easily see this ranging from seconds to hours, depending on the workload of an application and the properties it's desiring from Cockroach.

docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file):

    add some additional latency to such reads, but maintains correctness. Any
    intents left by transactions that are still pending could reasonably have
    their transactions aborted at this point.

If the transaction associated with these intents has a timestamp below max_write_timestamp - max_write_age then they necessarily will have to be aborted (or at least pushed to a later timestamp).

docs/RFCS/00000000_max_safe_timestamp.md, line 83 at r1 (raw file):

There are still a couple big questions left.

If no write has been proposed for a while, the followers' `max_write_timestamp`

Make a note of why this isn't an issue if the range is seeing any writes somewhere.

docs/RFCS/00000000_max_safe_timestamp.md, line 99 at r1 (raw file):

extra cautious when it takes over the lease. For example, the previous
leaseholder couldn't have proposed a write with a timestamp greater than
`time.Now() + maxOffset`, so new leaseholders could start out by setting

nit: clock.Now() since we're dealing with HLC timestamps.

docs/RFCS/00000000_max_safe_timestamp.md, line 106 at r1 (raw file):

a few options here:

1. Always enabled. Leaseholders always have to do the work of keeping

nit: These numbers got messed up.

docs/RFCS/00000000_max_safe_timestamp.md, line 114 at r1 (raw file):

Option 1 is a bit of a non-starter due to the extra work required, since we
shouldn't be slowing down users' clusters for features that they aren't using.

This also prevents ranges from quiescing.

Comments from Reviewable

a-robinson · 2017-10-16T14:51:50Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/00000000_max_safe_timestamp.md, line 49 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

can you also include how max clock offset plays into the below?

I don't think it needs to play into the below, actually. At least not as far as follower reads are concerned. All timestamps are going through a single leaseholder except when the lease changes hands, so as long as we consider maxOffset as described below when a new lease is acquired, the leaseholder can just use HLC timestamps directly when comparing incoming writes with its max_proposed_write_timestamp or when updating its max_proposed_write_timestamp.

I've added a new bullet point about this. I'd love to be corrected now, though, if you can think of something I haven't.

docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It really depends on the use cases we want to support. 10 seconds has been thrown around in the past, but I think that was just a holdover from arbitrary limitations we previously placed on the TimestampCache. Making this easily configurable is important because I could easily see this ranging from seconds to hours, depending on the workload of an application and the properties it's desiring from Cockroach.

I'd expect we'll put it in the low tens of seconds by default. Added a little note to the text.

docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If the transaction associated with these intents has a timestamp below max_write_timestamp - max_write_age then they necessarily will have to be aborted (or at least pushed to a later timestamp).

Improved the wording.

docs/RFCS/00000000_max_safe_timestamp.md, line 83 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Make a note of why this isn't an issue if the range is seeing any writes somewhere.

Done.

docs/RFCS/00000000_max_safe_timestamp.md, line 99 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: clock.Now() since we're dealing with HLC timestamps.

Done.

docs/RFCS/00000000_max_safe_timestamp.md, line 106 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: These numbers got messed up.

Did they? They still rendered fine. I've cleaned them up so they look better in the non-rendered version, though.

docs/RFCS/00000000_max_safe_timestamp.md, line 114 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This also prevents ranges from quiescing.

Done.

docs/RFCS/00000000_max_safe_timestamp.md, line 151 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

not sure it would be enough to make any of these decisions differently, but I just want to make sure y'all know about engine.NewTimeBoundIterator which is given a range of our mvcc timestamps and uses that to skip any sstables that don't contain those times (we store the mvcc timestamps contained in an sstable's metadata every time we write one). these intents would be very new, which maximizes the chance that you'd be scanning only small sstables

Yeah, thanks for pointing that out. I had actually been wondering about whether something like that was possible when thinking things through. It operates on all of the store's data, though, whereas change feeds may only be operating on a small fraction of a store's ranges. It'd also be tough to tune how frequently we ran it, needing to balance timeliness of close notifications with the cost of running it.

Added to the alternatives section.

Comments from Reviewable

rjnn · 2017-10-16T15:44:20Z

Excited as well!

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.

docs/RFCS/00000000_max_safe_timestamp.md, line 49 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I don't think it needs to play into the below, actually. At least not as far as follower reads are concerned. All timestamps are going through a single leaseholder except when the lease changes hands, so as long as we consider maxOffset as described below when a new lease is acquired, the leaseholder can just use HLC timestamps directly when comparing incoming writes with its max_proposed_write_timestamp or when updating its max_proposed_write_timestamp.

I've added a new bullet point about this. I'd love to be corrected now, though, if you can think of something I haven't.

I agree with Alex that it doesn't come into play.

docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'd expect we'll put it in the low tens of seconds by default. Added a little note to the text.

I could imagine this being lowered to low single digit seconds if you're using materialized views heavily, so the design shouldn't be so heavyweight as to preclude that setting.

Comments from Reviewable

bdarnell · 2017-10-16T23:23:40Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.

docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file):

the only intents that can be seen will be those for already committed transactions

"already finalized transactions", right? The txns can be committed or aborted as long as they're not pending.

docs/RFCS/00000000_max_safe_timestamp.md, line 55 at r1 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

I could imagine this being lowered to low single digit seconds if you're using materialized views heavily, so the design shouldn't be so heavyweight as to preclude that setting.

Or if you're using follower reads, not just materialized views. I think setting this down to a few seconds will be common (if we can support it)

docs/RFCS/00000000_max_safe_timestamp.md, line 51 at r2 (raw file):

The high-level breakdown of how it works is:

* All replicas track the highest write timestamp they have observed, which

Does "observed" mean "appended to the raft log" or "applied"?

docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file):

    to make sure that writes which have been proposed but have not yet committed
    will fail the `LeaseAppliedIndex` / `MaxLeaseIndex` check if they do
    eventually commit. This would be to prevent a write from lingering for too

What do you mean by "lingering"? The write timestamp won't change, so is this referring to real time?

docs/RFCS/00000000_max_safe_timestamp.md, line 74 at r2 (raw file):

    add some additional latency to such reads, but maintains correctness. Any
    intents left by transactions that are still pending should have their
    transactions aborted at this point.

How are the transactions aborted? The transaction record may live on another range; how is this synchronization managed?

Comments from Reviewable

vivekmenezes · 2017-10-17T02:02:54Z

docs/RFCS/00000000_max_safe_timestamp.md

+options here:
+
+1. Have the leaseholder periodically propose an empty command to raft with the
+   current timestamp if no other writes have come through lately.


There could be another option to have the leaseholder periodically propose an empty command to raft with a future timestamp and disallow writes to that range until the future timestamp. This is like giving out a lease to the followers until a certain timestamp to allow reads at the current timestamp from followers at the expense of slowing down writes considerably

nvanbenschoten · 2017-10-18T01:29:46Z

After having a pretty interesting discussion with @andreimatei and @bdarnell about this forum post, I decided to add a new alternative. The idea is only half serious but provides a nice generalization of max safe timestamp idea that I think gives some valuable insight. It suggests that we could actually just ship the entire TimestampCache to replicas through Raft and these replicas could serve any reads locally that have a timestamp that is equal to or less than the minimum timestamp of all overlapping spans in the TimestampCache.

tbg · 2017-10-18T16:59:31Z

Didn't give the CDC portion a close read yet because I had some basic questions about the first part.

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Improved the wording.

Nathan, why do they have to be aborted? I agree that they should (because we don't want to wait). If they have to we'd have a problem because the intent may as well have be committed.

Side note: when serving reads from many followers, this will likely mean multiple (x #nodes) racing intent resolutions (we deduplicate somewhat in intent resolver, but only on each node). May not be a huge problem, but should definitely try to avoid intents there in the first place.

An automatically managed backoff would be interesting. Whenever an intent is hit, the lease holder adjusts (doubles?) the gap to the write frontier.

docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What do you mean by "lingering"? The write timestamp won't change, so is this referring to real time?

Unrelated, but this might be a better alternative than waiting for the reproposal (or triggering a reproposal of potentially many old in-flight things) in general. Currently we repropose wildly, but we could send a "canary" which is potentially cheaper. We also don't currently actively invalidate commands once we've observed a higher lease applied index (i.e. if we proposed 3 a while ago but then we see 4 apply, we don't eagerly notify 3 but we probably should (and can gate the check on observing a gap). Might be worth filing, though hopefully we don't need it in practice.

Related: I'm not sure I understand how this mechanism is supposed to work:

leaseholder proposes a write at timestamp 10
write gets dropped but it will be reproposed randomly later
but there's a steady stream of writes that gets through just fine
around 9 seconds later, the leaseholder gets really nervous and tries to propose a new command to "cancel" the inflight
that proposal also gets stuck so nothing happens. Leaseholder can chose: stall all writes, or risk stale reads. Let's do stale reads because example.
at second 10, the followers happily start dishing out "stale" reads
the write finally commits, but basically in violation of the (hypothetical) timestamp cache on the followers

So essentially correctness seems to hinge on being able to control how much time passes between proposal and application, or to stop accepting writes.

That seems obviously broken so I must be misunderstanding something fundamental. I pictured that the leaseholder would piggyback a safe timestamp on its proposals (or just send empty proposals when it needs to) that basically carry max_safe_timestamp. Followers would then allow reads below that timestamp. (And nothing mandates that we'd even need to eat the Raft overhead - it could just be a new RPC). Would be nice if the mechanism here worked because it has a tiny bit less overhead, but having to stall writes seems super dangerous, and would probably happen especially if we want to get really close behind the write frontier (which would be a huge plus I imagine). The "auto-bumping" mechanism also seems at odd with the CDC use case for which the close notification may end up lagging.

docs/RFCS/00000000_max_safe_timestamp.md, line 98 at r2 (raw file):

Previously, vivekmenezes wrote…

There could be another option to have the leaseholder periodically propose an empty command to raft with a future timestamp and disallow writes to that range until the future timestamp. This is like giving out a lease to the followers until a certain timestamp to allow reads at the current timestamp from followers at the expense of slowing down writes considerably

I don't think we should ever have this mechanism interfere with write traffic. In fact, that should be one of the goals stated in the RFC.

docs/RFCS/00000000_max_safe_timestamp.md, line 84 at r3 (raw file):

  operation under a single range lease, because all timestamps for writes are
  getting verified by the same leaseholder. `maxOffset` only needs to come in
  play when the lease for a range changes hands, as described below.

Just pointing out that you should try to consider (and perhaps accommodate, if possible) the case maxOffset = infinity (clockless reads).

docs/RFCS/00000000_max_safe_timestamp.md, line 108 at r3 (raw file):

leaseholder couldn't have proposed a write with a timestamp greater than
`clock.Now() + maxOffset`, so new leaseholders could start out by setting
`max_proposed_write_timestamp` to `clock.Now() + maxOffset`.

It seems that you should be able to remove maxOffset from the equation. You know the previous lease (well, at least at some point in time you knew it) which has an expiration, so just use that as the floor.

Comments from Reviewable

andreimatei · 2017-10-18T17:49:44Z

Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

docs/RFCS/00000000_max_safe_timestamp.md, line 238 at r3 (raw file):

"max safe timestamp" that provides more insight into the concept. Instead of
proposing dummy write commands to update the `max_write_timestamp` through Raft,
the leaseholder could instead periodically bump its `TimestampCache` low water

Alex just pointed out to me that the tscache is no longer a per-range structure - it's per store. So we should qualify here that we'd be sending the tscache entries overlaping the range in question.

docs/RFCS/00000000_max_safe_timestamp.md, line 267 at r3 (raw file):

shipping the bookkeeping required to allow a follower read. The origins of this
idea came from benefit 4 in the original comment of [this forum
post](https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360).

We've discussed more than written here, and I think it'd be cool to write all of it up: a generalization to of the generalization is that followers' request don't always (usually?) need to wait for a Raft proposal. The follower needs the leaseholder's tscache to be updated, and it needs to make sure that it is up to date enough with the Raft log to serve the read. So, it could tell the leaseholder what its current raft index is; the leaseholder's tscache could be augmented with a raft index number for each element -> meaning what index a follower needs to be able to serve reads that fall within that tscache entry. Now, when the leaseholder gets a request from a follower, it will:

check its tscache, update it if necessary, and observe that the follower is already up to date enough to serve the read
if the follower is not up to date enough, then it needs to tell the follower what index it needs to wait for. This will be the current index if the tscache had to be bumped, otherwise it's the raft index indicated by the tscache

If the requests from followers go through the command queue (to synchronize with in-flight writes), then I think no raft proposal is necessary.

Comments from Reviewable

tbg · 2017-10-19T00:10:27Z

Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Unrelated, but this might be a better alternative than waiting for the reproposal (or triggering a reproposal of potentially many old in-flight things) in general. Currently we repropose wildly, but we could send a "canary" which is potentially cheaper. We also don't currently actively invalidate commands once we've observed a higher lease applied index (i.e. if we proposed 3 a while ago but then we see 4 apply, we don't eagerly notify 3 but we probably should (and can gate the check on observing a gap). Might be worth filing, though hopefully we don't need it in practice.

Related: I'm not sure I understand how this mechanism is supposed to work:

leaseholder proposes a write at timestamp 10

write gets dropped but it will be reproposed randomly later

but there's a steady stream of writes that gets through just fine

around 9 seconds later, the leaseholder gets really nervous and tries to propose a new command to "cancel" the inflight

that proposal also gets stuck so nothing happens. Leaseholder can chose: stall all writes, or risk stale reads. Let's do stale reads because example.

at second 10, the followers happily start dishing out "stale" reads

the write finally commits, but basically in violation of the (hypothetical) timestamp cache on the followers

So essentially correctness seems to hinge on being able to control how much time passes between proposal and application, or to stop accepting writes.

That seems obviously broken so I must be misunderstanding something fundamental. I pictured that the leaseholder would piggyback a safe timestamp on its proposals (or just send empty proposals when it needs to) that basically carry max_safe_timestamp. Followers would then allow reads below that timestamp. (And nothing mandates that we'd even need to eat the Raft overhead - it could just be a new RPC). Would be nice if the mechanism here worked because it has a tiny bit less overhead, but having to stall writes seems super dangerous, and would probably happen especially if we want to get really close behind the write frontier (which would be a huge plus I imagine). The "auto-bumping" mechanism also seems at odd with the CDC use case for which the close notification may end up lagging.

Ah, silly me. The steady stream of writes would immediately surpass the straggler's lease applied index. I'm still not sure this mechanism is easy to use for the CDC use case, but I understand how the basic version here checks out.

Comments from Reviewable

Mostly just a more verbose transcription of the notes I took during Nathan and I's discussion yesterday. Still a WIP compared to the expected format for RFCs.

a-robinson · 2017-10-29T21:38:20Z

TFTRs!

Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions.

docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

the only intents that can be seen will be those for already committed transactions

"already finalized transactions", right? The txns can be committed or aborted as long as they're not pending.

It seems possible to run into an intent for a transaction that hasn't been committed or aborted, but the transaction would be abort-able at that point.

docs/RFCS/00000000_max_safe_timestamp.md, line 73 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Nathan, why do they have to be aborted? I agree that they should (because we don't want to wait). If they have to we'd have a problem because the intent may as well have be committed.

Side note: when serving reads from many followers, this will likely mean multiple (x #nodes) racing intent resolutions (we deduplicate somewhat in intent resolver, but only on each node). May not be a huge problem, but should definitely try to avoid intents there in the first place.

An automatically managed backoff would be interesting. Whenever an intent is hit, the lease holder adjusts (doubles?) the gap to the write frontier.

Right, I don't believe that they'd necessarily have to be aborted.

docs/RFCS/00000000_max_safe_timestamp.md, line 51 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Does "observed" mean "appended to the raft log" or "applied"?

Great question. I think that appended to the raft log is sufficient.

docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ah, silly me. The steady stream of writes would immediately surpass the straggler's lease applied index. I'm still not sure this mechanism is easy to use for the CDC use case, but I understand how the basic version here checks out.

Yeah, "lingering" was referring to real time. I've tried to clarify this, since it was definitely a bit terse.

docs/RFCS/00000000_max_safe_timestamp.md, line 74 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How are the transactions aborted? The transaction record may live on another range; how is this synchronization managed?

Why would any synchronization be needed beyond the normal synchronous intent resolution that gets done for reads?

docs/RFCS/00000000_max_safe_timestamp.md, line 98 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I don't think we should ever have this mechanism interfere with write traffic. In fact, that should be one of the goals stated in the RFC.

Yeah, I don't know when someone would want to make that extreme of a tradeoff. Added a short list of goals, including not interfering with writes.

docs/RFCS/00000000_max_safe_timestamp.md, line 84 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Just pointing out that you should try to consider (and perhaps accommodate, if possible) the case maxOffset = infinity (clockless reads).

I'm a bit out of touch with our clockless reads. Do you mind explaining how clockless mode would come into play given that we're basing things on lease expiration timestamps?

docs/RFCS/00000000_max_safe_timestamp.md, line 108 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

It seems that you should be able to remove maxOffset from the equation. You know the previous lease (well, at least at some point in time you knew it) which has an expiration, so just use that as the floor.

Good point, changed.

docs/RFCS/00000000_max_safe_timestamp.md, line 238 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Alex just pointed out to me that the tscache is no longer a per-range structure - it's per store. So we should qualify here that we'd be sending the tscache entries overlaping the range in question.

Done.

docs/RFCS/00000000_max_safe_timestamp.md, line 269 at r3 (raw file):

post](https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360).

Or course, shipping the `TimestampCache` through Raft would probably be too

Sending it through raft probably would be due to all the extra disk writes that would be involved, but sending this stuff through a separate RPC might not be totally crazy. If you'd like to do an experiment to estimate how much bandwidth/serialization cost there'd be, we could consider this more seriously.

Comments from Reviewable

bdarnell · 2017-10-30T17:39:29Z

Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

docs/RFCS/00000000_max_safe_timestamp.md, line 31 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It seems possible to run into an intent for a transaction that hasn't been committed or aborted, but the transaction would be abort-able at that point.

I think it has to be aborted, not just abortable. We have to know that the transaction will never commit, and I don't think we can know that without actually aborting it (as long as it's pending, the commit could be in flight in the raft log).

docs/RFCS/00000000_max_safe_timestamp.md, line 63 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Yeah, "lingering" was referring to real time. I've tried to clarify this, since it was definitely a bit terse.

If a real command gets delayed going into the raft log, what's stopping these dummy commands from doing the same thing? I think we have to rely solely on timestamps that have appeared in the raft log.

The "heartbeats" described below to keep max_write_timestamp from falling too far behind have the side effect of disqualifying any "lingering" commands, but this is an optimization to advance the timestamps rather than a safety requirement.

docs/RFCS/00000000_max_safe_timestamp.md, line 74 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why would any synchronization be needed beyond the normal synchronous intent resolution that gets done for reads?

Ah, so you're talking about the normal post-read intent resolution. That's fine, but it means that this "max safe timestamp" is no longer a guarantee that local reads will be successful - you may have to push the transaction remotely, and after pushing you can't tell when the local replica has caught up with the intent resolution so you have to read remotely as well.

I was assuming that we wouldn't be advancing the follower read timestamp until all transactions were finalized and all intents resolved (since that's what CDC will require). We can do this simpler version for follower reads although leaving unresolved intents may mean it performs worse than expected.

docs/RFCS/00000000_max_safe_timestamp.md, line 65 at r5 (raw file):

* All replicas track the highest write timestamp they have appended to their
  raft log, which we'll call the `max_write_timestamp`.
* The leaseholder will also track the `max_proposed_write_timestamp`.

It looks like we require that max_proposed_write_timestamp is monotonic across lease holders. How is this guaranteed? (advancing max_write_timestamp on log append instead of apply seems to make this trickier) It's likely that the existing lease mechanisms solve this but we need a more complete analysis of how we guarantee that the start time of a new lease will not allow the max safe timestamp to move backwards (do we rely on the lease duration being shorter than max_write_age?)

docs/RFCS/00000000_max_safe_timestamp.md, line 128 at r5 (raw file):

have proposed a write with a timestamp greater than its lease expiration, so new
leaseholders can start out by setting `max_proposed_write_timestamp` to the
previous lease's expiration timestamp.

What does this mean for epoch-based leases? Leases have start times; can we use that instead?

Comments from Reviewable

This RFC is incomplete. Is is merged in `draft` status. This RFC proposes a mechanism for subscribing to changes to a set of key ranges starting at an initial timestamp. It is for use by Cockroach-internal higher-level systems such as distributed SQL, change data capture, or a Kafka producer endpoint. A "sister RFC" detailing these higher-level systems is in cockroachdb#17535. See also cockroachdb#19222 for a related follower-reads RFC.

a-robinson requested a review from a team as a code owner October 12, 2017 18:33

a-robinson force-pushed the maxsafetimestamp branch from 189efe9 to a5a9d55 Compare October 16, 2017 14:51

vivekmenezes reviewed Oct 17, 2017

View reviewed changes

nvanbenschoten force-pushed the maxsafetimestamp branch from 5a4ad7b to 95aa53e Compare October 18, 2017 01:23

dianasaur323 mentioned this pull request Oct 19, 2017

How to change the lease expire duration? #19318

Closed

a-robinson and others added 2 commits October 29, 2017 17:32

RFCs: Max safe timestamp for follower reads and change feeds

5a328c4

Mostly just a more verbose transcription of the notes I took during Nathan and I's discussion yesterday. Still a WIP compared to the expected format for RFCs.

RFCs: add "Replicate the TimestampCache through Raft" alternative

e6bd937

a-robinson force-pushed the maxsafetimestamp branch from 95aa53e to e6bd937 Compare October 29, 2017 21:38

bdarnell mentioned this pull request Dec 29, 2017

storage: implement follower reads using replica closed timestamps #21056

Closed

a-robinson closed this Sep 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFCs: Max safe timestamp for follower reads and change feeds #19222

RFCs: Max safe timestamp for follower reads and change feeds #19222

a-robinson commented Oct 12, 2017

cockroach-teamcity commented Oct 12, 2017

danhhz commented Oct 12, 2017

nvanbenschoten commented Oct 13, 2017

a-robinson commented Oct 16, 2017

rjnn commented Oct 16, 2017

bdarnell commented Oct 16, 2017

vivekmenezes Oct 17, 2017

nvanbenschoten commented Oct 18, 2017

tbg commented Oct 18, 2017

andreimatei commented Oct 18, 2017

tbg commented Oct 19, 2017

a-robinson commented Oct 29, 2017

bdarnell commented Oct 30, 2017

RFCs: Max safe timestamp for follower reads and change feeds #19222

RFCs: Max safe timestamp for follower reads and change feeds #19222

Conversation

a-robinson commented Oct 12, 2017

cockroach-teamcity commented Oct 12, 2017

danhhz commented Oct 12, 2017

nvanbenschoten commented Oct 13, 2017

a-robinson commented Oct 16, 2017

rjnn commented Oct 16, 2017

bdarnell commented Oct 16, 2017

vivekmenezes Oct 17, 2017

Choose a reason for hiding this comment

nvanbenschoten commented Oct 18, 2017

tbg commented Oct 18, 2017

andreimatei commented Oct 18, 2017

tbg commented Oct 19, 2017

a-robinson commented Oct 29, 2017

bdarnell commented Oct 30, 2017