-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: route present-time reads to global_read follower replicas #59571
kv: route present-time reads to global_read follower replicas #59571
Conversation
441a8fc
to
11a7493
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 18 of 18 files at r1, 17 of 17 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 97 at r2 (raw file):
} // closedTimestampLikelySufficient determines if a request at a given timestamp
much clearer
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go, line 265 at r2 (raw file):
// TestOracle tests the OracleFactory exposed by this package. func TestOracleFactory(t *testing.T) {
nit: TestOracle
and comment
d3de04b
to
53860a9
Compare
4117d4c
to
a501736
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 19 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 89 at r3 (raw file):
return 0, err } return getFollowerReadLag(st), nil
Should we add a comment here (or near the follower_read_timestamp()
builtin definition) mentioning that AS OF SYSTEM TIME follower_read_timestamp()
will always return a negative offset regardless of whether the range(s) serving the query close timestamps in the past?
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 99 at r3 (raw file):
// closedTimestampLikelySufficient determines if a request at a given timestamp // is likely to be below a follower's closed timestamp and servicable as a
s/servicable/serviceable
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 117 at r3 (raw file):
} expectedClosedTS := clock.Now().Add(offset.Nanoseconds(), 0) return maxObservableTS.LessEq(expectedClosedTS)
Until this point, I was under the impression that historical queries (via AOST
) don't have a notion of uncertainty and that they can observe causal inversion. However, I see that both this and the previous implementation were accounting for the max-offset here. Is this so we preserve linearizability for long running txns?
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go, line 313 at r3 (raw file):
setLatency("3", 80*time.Millisecond) testCases := []struct {
The test is very cool!
pkg/roachpb/data.go, line 932 at r3 (raw file):
// // There is a case where an intent written by a transaction is above the // transactions write timestamp — after a successful intent push. Such cases
minor nit: transaction's
pkg/roachpb/data.go, line 937 at r3 (raw file):
// // WIP: something about how the intent will be present on followers with a // sufficiently high closed timestamp, either in its pre or post push state.
I was really confused about how reading an intent in its post-push state on a follower would be correct until I found:
cockroach/pkg/storage/pebble_mvcc_scanner.go
Lines 441 to 444 in d67299e
// 11. We're reading our own txn's intent at an equal or higher sequence. | |
// Note that we read at the intent timestamp, not at our read timestamp | |
// as the intent timestamp may have been pushed forward by another | |
// transaction. Txn's always need to read their own writes. |
It might just be because I'm unfamiliar with this stuff but do you think we should point to something like this from here?
However, if own-intents are the only reason we're Forward
ing by WriteTimestamp
, then I wonder if we can just do without it. It seems like it only "hurts" us by making follower-reads less likely. Am I misunderstanding something?
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 117 at r3 (raw file):
Oh, I missed this before. The previous implementation seems to be shifting the wrong(?) way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, non-transactional requests are no longer served by followers even if the follower replica detects that the request can be. Previously, non-transactional requests would never be routed intentionally to a follower replica, but Replica.canServeFollowerRead would allow them through if their timestamp was low enough. This change was made in order to keep the client and server logic in-sync and because this does not seem safe for non-transactional requests that get their timestamp assigned on the server.
Why do you say this isn't safe for requests that get their timestamp on the server? I understand that these request don't have an uncertainty interval (so I don't understand how they're linearizable at all) but why does it matter who assigns the timestamp?
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @nvanbenschoten)
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 98 at r3 (raw file):
} // closedTimestampLikelySufficient determines if a request at a given timestamp
I'd make it a point of explaining that the decision is range-specific, and introduce ctPolicy
. Maybe name ctPolicy -> range[Ct]Policy
.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 98 at r3 (raw file):
} // closedTimestampLikelySufficient determines if a request at a given timestamp
Also, the comment talks about :a request at a given ts", but the function takes a "maxObservableTS" which we generally don't call "the request's timestamp". Consider adding some words here, explaining that it's not the "request's timestamp" that matters, but this other thing.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 106 at r3 (raw file):
// canSendToFollower implements the logic for checking whether a batch request // may be sent to a follower. // TODO(aayush): We should try to bind clusterID to the function here, rather
this todo is in the wrong place, but deleting it is just a slap in the face :)
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 117 at r3 (raw file):
Until this point, I was under the impression that historical queries (via AOST) don't have a notion of uncertainty and that they can observe causal inversion
You're still under this impression, right?
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 123 at r3 (raw file):
// may be sent to a follower. func canSendToFollower( clusterID uuid.UUID,
if you're interested, consider taking the opportunity (perhaps different patch) to bind this damn clusterID
pkg/roachpb/data.go, line 912 at r3 (raw file):
func (t Transaction) LastActive() hlc.Timestamp { ts := t.LastHeartbeat // Only forward by the ReadTimestamp if it is a clock timestamp.
nit: this comment is useless now that the todo is gone; delete it
pkg/roachpb/data.go, line 920 at r3 (raw file):
// MaxObservableTimestamp returns the largest timestamp at which the transaction // may observe when performing a read-only operation. This is the maximum of the
nit: "may observe" is weird because it sounds like "transactions observing" is a term of art. Maybe "read values".
pkg/roachpb/data.go, line 923 at r3 (raw file):
// transaction's read timestamp, its write timestamp, and its global uncertainty // limit. func (t Transaction) MaxObservableTimestamp() hlc.Timestamp {
you sure you don't want a pointer receiver on this? And maybe on other methods around, but that can wait.
pkg/roachpb/data.go, line 934 at r3 (raw file):
// transactions write timestamp — after a successful intent push. Such cases // do allow a transaction to operate above its max observable timestamp. // However, this is fine...
don't forget to expand on the "this is fine".
pkg/roachpb/data.go, line 937 at r3 (raw file):
However, if own-intents are the only reason we're Forwarding by WriteTimestamp, then I wonder if we can just do without it. It seems like it only "hurts" us by making follower-reads less likely. Am I misunderstanding something?
Well but how are you suggesting we make sure we read our own writes when going to a follower?
pkg/roachpb/data.go, line 938 at r3 (raw file):
// WIP: something about how the intent will be present on followers with a // sufficiently high closed timestamp, either in its pre or post push state. // Either works. Do we need to change the contract of MaxObservableTimestamp
yeah, MaxObservable
seems to not describe this well (and I personally didn't really like the name to begin with). How about... RequiredFrontier
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random question - do we route batches with QueryIntent
in them to followers?
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, and @nvanbenschoten)
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 117 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Until this point, I was under the impression that historical queries (via AOST) don't have a notion of uncertainty and that they can observe causal inversion
You're still under this impression, right?
Yes. So would it be worth it to plumb a flag down here and disregard the max offset if the request was sent from an AOST
query?
If not, we could just spell out the reasoning for accounting for the max-offset in a comment here.
pkg/roachpb/data.go, line 937 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
However, if own-intents are the only reason we're Forwarding by WriteTimestamp, then I wonder if we can just do without it. It seems like it only "hurts" us by making follower-reads less likely. Am I misunderstanding something?
Well but how are you suggesting we make sure we read our own writes when going to a follower?
OK, I think I understand. Without Forward
ing to the WriteTimestamp
, we can't guarantee that an intent is even present on the follower replica serving the request.
First commit from cockroachdb#59505. This commit updates the kv client routing logic to account for the new `LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In enterprise builds, non-stale read-only requests to ranges with this closed timestamp policy can now be routed to follower replicas, just like stale read-only requests to normal ranges currently are. In addition to this main change, there are a few smaller changes in this PR that were hard to split out, so are included in this commit. First, non-transactional requests are no longer served by followers even if the follower replica detects that the request can be. Previously, non-transactional requests would never be routed intentionally to a follower replica, but `Replica.canServeFollowerRead` would allow them through if their timestamp was low enough. This change was made in order to keep the client and server logic in-sync and because this does not seem safe for non-transactional requests that get their timestamp assigned on the server. We're planning to remove non-transactional requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly just caused some testing fallout. Second, transactional requests that had previously written intents are now allowed to have their read-only requests served by followers, as long as those followers have a closed timestamp above the transaction's read *and* write timestamp. Previously, we had avoided this because it seemed to cause issues with read-your-writes. However, it appears safe as long as the write timestamp is below the closed timestamp, because we know all of the transaction's intents are at or below its write timestamp. This is very important for multi-region work, where we want a transaction to be able to write to a REGIONAL table and then later perform local reads (maybe foreign key checks) on GLOBAL tables. Third, a max clock offset shift in `canUseFollowerRead` was removed. It wasn't exactly clear what this was doing. It was introduced in the original 70be833 and seemed to allow a follower read in cases where they otherwise shouldn't be expected to succeed. I thought at first that it was accounting for the fact that the kv client's clock might be leading the kv server's clock and so it was being pessimistic about the expected closed timestamp, but it actually seems to be shifting the other way, so I don't know. I might be missing something. Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed and the existing `replicaoracle.OracleFactory` takes its place. We no longer need the double-factory approach because the transaction object is now passed directly to `Oracle.ChoosePreferredReplica`. This was a necessary change, because the process of determining whether a follower read can be served now requires transaction information and range information (i.e. the closed timestamp policy), so we need to make it in the Oracle itself instead of in the OracleFactory. This all seems like a simplification anyway. This is still waiting on changes to the closed timestamp system to be able to write end-to-end tests using this new functionality.
Stop using timeutil.Now to create HLC timestamps.
a501736
to
2c972f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Why do you say this isn't safe for requests that get their timestamp on the server? I understand that these request don't have an uncertainty interval (so I don't understand how they're linearizable at all) but why does it matter who assigns the timestamp?
Because if it isn't the leaseholder who is assigning the timestamp then the replica's clock may trail the commit timestamp of a write with a happens-before relation to the read. This is all kind of broken though. See #58459.
do we route batches with QueryIntent in them to followers?
Discussed offline. We currently don't because we take a fairly pessimistic view of what can happen during async consensus. We assume that an in-flight write can be pushed to a higher timestamp between the time that pipelining begins and when the intent first "lands". And so we don't use a similar argument to the one we're using in RequiredFrontier
to guarantee read-your-writes.
I think this is probably too pessimistic of a view and it may be ok to route QueryIntent requests to followers if we loosen its semantics a little bit. But that isn't needed here, so I'll defer that investigation.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @ajwerner, and @andreimatei)
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 89 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Should we add a comment here (or near the
follower_read_timestamp()
builtin definition) mentioning thatAS OF SYSTEM TIME follower_read_timestamp()
will always return a negative offset regardless of whether the range(s) serving the query close timestamps in the past?
Done.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 98 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'd make it a point of explaining that the decision is range-specific, and introduce
ctPolicy
. Maybe namectPolicy -> range[Ct]Policy
.
You don't think the name of the type makes that clear enough?
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 98 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Also, the comment talks about :a request at a given ts", but the function takes a "maxObservableTS" which we generally don't call "the request's timestamp". Consider adding some words here, explaining that it's not the "request's timestamp" that matters, but this other thing.
Done.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 99 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
s/servicable/serviceable
Done.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 106 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this todo is in the wrong place, but deleting it is just a slap in the face :)
👋
I don't see anything better we can do. utilccl.CheckEnterpriseEnabled
requires a cluster ID. Hiding it in some object and then turning these static functions into methods just makes things even less clear.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 117 at r3 (raw file):
A request sent with AOST will have no uncertainty interval, that's correct. That will be reflected in the transaction's GlobalUncertaintyLimit
, which will be equal to its ReadTimestamp
. So we don't need to plumb anything else down here.
The previous implementation seems to be shifting the wrong(?) way.
That's what it looks like. It confused me too.
If not, we could just spell out the reasoning for accounting for the max-offset in a comment here.
We aren't accounting for max offset here anymore. This is now handled in RequiredFrontier
.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 123 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
if you're interested, consider taking the opportunity (perhaps different patch) to bind this damn
clusterID
Bind it to what? I don't love this, but I don't like introducing some pointless object even more. It just adds even more noise. Functions are at least easy to test and this doesn't get plumbed too far.
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go, line 265 at r2 (raw file):
Previously, ajwerner wrote…
nit:
TestOracle
and comment
Done.
pkg/roachpb/data.go, line 912 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: this comment is useless now that the todo is gone; delete it
Done.
pkg/roachpb/data.go, line 920 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: "may observe" is weird because it sounds like "transactions observing" is a term of art. Maybe "read values".
Done.
pkg/roachpb/data.go, line 923 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
you sure you don't want a pointer receiver on this? And maybe on other methods around, but that can wait.
Done.
pkg/roachpb/data.go, line 932 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
minor nit:
transaction's
Done.
pkg/roachpb/data.go, line 934 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
don't forget to expand on the "this is fine".
Done.
pkg/roachpb/data.go, line 937 at r3 (raw file):
we can't guarantee that an intent is even present on the follower replica serving the request.
Yes, exactly. Sorry for the confusion, I was lazy by leaving this WIP comment and skipping the most subtle point.
but do you think we should point to something like this from here?
Done.
pkg/roachpb/data.go, line 938 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
yeah,
MaxObservable
seems to not describe this well (and I personally didn't really like the name to begin with). How about...RequiredFrontier
?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, and @nvanbenschoten)
pkg/ccl/kvccl/kvfollowerreadsccl/followerreads.go, line 98 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
You don't think the name of the type makes that clear enough?
I guess it does, nvm.
bors r+ |
Build failed (retrying...): |
Build succeeded: |
Fixes cockroachdb#62447. In cockroachdb#62447, Erik found that cockroachdb#59571 had re-ordered the call to `utilccl.CheckEnterpriseEnabled` to occur before checking the batch in `canSendToFollower`, instead of after. This added an error allocation into the hot path of all reads, which showed up in CPU profiles and caused an 8% performance regression on `kv95`. This commit fixes this by moving the enterprise check back out of the hot-path for all non-stale read-only batches. A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled` cheaper by avoiding the error allocation for callers that don't need an error. This work is not done in this commit.
62377: changefeedccl: allow topic_name parameter for changefeed kafka sinks r=[stevendanna,miretskiy] a=HonoreDB Previously, changes for a table went to a Kafka topic named for that table, with users only able to specify a prefix. Some users, however, need changes to go to a specific topic, including sometimes the same one for more than one table, distinguishing messages using metadata. This patch allows the `?topic_name=foo` parameter to be added to Kafka sink URIs. This will override the per-table topic generation, so that changes for every table will all go to the specified topic. It may be used in conjunction with `topic_prefix`, although the distinction is not meaningful. Release note (enterprise change): Kafka sink URIs now accept the "topic_name" parameter to override per-table topic names. Closes #59300 Closes #58302 62414: workload/schemachange: add SURVIVE syntax r=ajwerner a=otan see individual commits for details 62420: cliccl/load: fix TestLoadShowIncremental typo r=pbardea a=Elliebababa This patch fixs typo of TestLoadShowIncremental. Resolves: #62416 Release note: none 62465: kvccl: re-order enterprise check in canSendToFollower r=nvanbenschoten a=nvanbenschoten Fixes #62447. In #62447, Erik found that #59571 had re-ordered the call to `utilccl.CheckEnterpriseEnabled` to occur before checking the batch in `canSendToFollower`, instead of after. This added an error allocation into the hot path of all reads, which showed up in CPU profiles and caused an 8% performance regression on `kv95`. This commit fixes this by moving the enterprise check back out of the hot-path for all non-stale read-only batches. A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled` cheaper by avoiding the error allocation for callers that don't need an error. This work is not done in this commit. 62473: kvserver: unskip TestEagerReplication r=lunevalex a=lunevalex PR #61847 fixed the flake in TestEagerReplication but was not rebased with master, so the skipped tag was not properly removed. This PR actually unskips TestEagerReplication. Release note: None Co-authored-by: Aaron Zinger <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: elliebababa <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Alex Lunev <[email protected]>
Fixes cockroachdb#62447. In cockroachdb#62447, Erik found that cockroachdb#59571 had re-ordered the call to `utilccl.CheckEnterpriseEnabled` to occur before checking the batch in `canSendToFollower`, instead of after. This added an error allocation into the hot path of all reads, which showed up in CPU profiles and caused an 8% performance regression on `kv95`. This commit fixes this by moving the enterprise check back out of the hot-path for all non-stale read-only batches. A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled` cheaper by avoiding the error allocation for callers that don't need an error. This work is not done in this commit.
This commit updates the kv client routing logic to account for the new
LEAD_FOR_GLOBAL_READS
RangeClosedTimestampPolicy
added in #59505. In enterprise builds, non-stale read-only requests to ranges with this closed timestamp policy can now be routed to follower replicas, just like stale read-only requests to normal ranges currently are.In addition to this main change, there are a few smaller changes in this PR that were hard to split out, so are included in this commit.
First, non-transactional requests are no longer served by followers even if the follower replica detects that the request can be. Previously, non-transactional requests would never be routed intentionally to a follower replica, but
Replica.canServeFollowerRead
would allow them through if their timestamp was low enough. This change was made in order to keep the client and server logic in-sync and because this does not seem safe for non-transactional requests that get their timestamp assigned on the server. We're planning to remove non-transactional requests soon anyway (#58459), so this isn't a big deal. It mostly just caused some testing fallout.Second, transactional requests that had previously written intents are now allowed to have their read-only requests served by followers, as long as those followers have a closed timestamp above the transaction's read and write timestamp. Previously, we had avoided this because it seemed to cause issues with read-your-writes. However, it appears safe as long as the write timestamp is below the closed timestamp, because we know all of the transaction's intents are at or below its write timestamp. This is very important for multi-region work, where we want a transaction to be able to write to a REGIONAL table and then later perform local reads (maybe foreign key checks) on GLOBAL tables.
Third, a max clock offset shift in
canUseFollowerRead
was removed. It wasn't exactly clear what this was doing. It was introduced in the original 70be833 and seemed to allow a follower read in cases where they otherwise shouldn't be expected to succeed. I thought at first that it was accounting for the fact that the kv client's clock might be leading the kv server's clock and so it was being pessimistic about the expected closed timestamp, but it actually seems to be shifting the other way, so I don't know. I might be missing something.Finally, the concept of a
replicaoracle.OracleFactoryFunc
was removed and the existingreplicaoracle.OracleFactory
takes its place. We no longer need the double-factory approach because the transaction object is now passed directly toOracle.ChoosePreferredReplica
. This was a necessary change, because the process of determining whether a follower read can be served now requires transaction information and range information (i.e. the closed timestamp policy), so we need to make it in the Oracle itself instead of in the OracleFactory. This all seems like a simplification anyway.This is still waiting on changes to the closed timestamp system to be able to write end-to-end tests using this new functionality.