-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LWT routing optimisation: plan contains replicas in ring order #717
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that LWT queries become so expensive (at least on the driver side, server-side it might still be a win). AFAIK SimpleStrategy is being deprecated so if someone uses LWT then LWT + NTS combo is going to be a frequent combination for them. I don't agree with the assertion that "LWT are rare" - maybe the LWT users are rare, but LWT queries aren't rare for them.
It's fine if computing replicas is expensive in case of fallback, but it's not if we want to pick the first replica. As I understand, this was the big idea behind the new design of the load balancing module.
If implementing an efficient pick()
is still difficult, then we can consider changing the behavior of the optimization and make sure that rust driver always chooses coordinators in some deterministic order which can be specific to the driver itself but quicker to compute. Such implementation should work well if a user only uses rust driver and nothing else (at least for LWT query for some particular table), however I'd rather do it as a last resort and revisit this problem later.
It might be possible to implement ring-wise I've got an idea to do as little as possible on struct ReplicasOrderedIter {
indexes: SmallVec<usize>,
replica_lists: SmallVec<&[(NodeRef, Token)] >,
// replica_lists could also be represented like in `ReplicaSet::ChainedNTS`, but care must be taken when handling non - precomputed ones.
}
...
impl ReplicasOrderedIter {
fn next(&mut self) -> Self::Item {
// Using index[i] select (replica, token) from i-th replica_list
// Return replica with a lowest token from the selected ones
// Increase index that pointed to selected replica
}
} This way, @piodul @wprzytula what do you think about this? |
If we keep |
Not answering your question yet, do you @havaker @piodul consider the cost incurred by LWT queries in the current PR state acceptable for the |
After the private conversation with @piodul and @havaker, the conclusion it to:
|
443e19a
to
86142a6
Compare
v2:
|
f88a4e8
to
395b7f4
Compare
430a748
to
5dd9f74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I hoped it would require less code, but it seems that it's the best what we can do with the current design.
It looks like you need to update examples in the documentation with respect to the last commit in order for the CI to pass.
// (in Node only `down_marker` is such, being an AtomicBool). | ||
// https://rust-lang.github.io/rust-clippy/master/index.html#mutable_key_type | ||
#[allow(clippy::mutable_key_type)] | ||
let mut all_replicas: HashSet<&'a Arc<Node>> = HashSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should just drop the impl {Eq,Hash} for Node
in the future (not in this PR). Those impls don't take the whole struct into consideration as Eq
impls usually does, so it's weird. We should explicitly use HashMap
and index by host id.
5dd9f74
to
fb23dcd
Compare
The new name defines better the semantics of this type: we only provide criteria that we want to taken into account, without specifying the actual data (the replica's location) to be compared to.
This is a simple refactor to avoid code duplication.
It is clearer not to use returns but instead use the functional style.
This is another refactor that aims to reduce code duplication.
As the case of providing preferred rack without specifying the preferred datacenter is invalid, in spirit of the principle “Make illegal states unrepresentable”, an enum is introduced to restrict the possible cases. So far, it is only used in DefaultPolicy, but in the next major release DefaultBuilder's API should be changed accordingly to disallow creating such configuration. It appears that there was a bug which involved incorrect behaviour when preferred rack was provided without setting preferred datacenter: instead on ignoring the preferred rack, replicas were filtered by it. Now, the following is true: it is no longer possible to turn on rack awareness without dc-awareness; attempts to do so will result in fallback to non-rack awareness.
It is cleaner when the Criteria itself carries the required information for the filtering, as well as less prone to errors (because one has to provide the required DC/rack name). As a bonus, the semantics of helper fns calls in `pick()` were made more sane: `pick_replica()` is called with `ReplicaLocationCriteria::Datacenter` only if the preferred datacenter is actually specified. `fallback()` was modified analogously (in regard to calling `shuffled_replicas()`).
We want to keep regarding pick() as a cheap, happy path function, which does not allocate. In LWT optimisation, however, only the primary replica can be computed without allocations (due to specifics of Network Topology Strategy, replicas are not assigned greedily, but their distribution across racks is being balanced). If a policy (in our case, the default policy) recognizes that a picked replica is down, it would try to pick another, and this computation would be expensive for ReplicasOrdered. Instead, having recognized during LWT optimised case that a picked replica is down, the policy returns None from pick() to hint that further computation will be expensive. The plan logic is hence altered to call fallback() even if pick() returns None. In the non-LWT case, as subsequent calls to pick() are still cheap, pick() will still try to find next replicas if one is recognized to be down. A test is added. It asserts that `fallback()` is called if `pick()` returned None. For the test to be possible to be written, a convenience constructor is added for `Node` under `cfg(test)`.
In order to be consistent with LWT optimisation implementation in other drivers, it is mandatory to return query plan consisting of replicas put in ring order. To accomplish that, the (abstractly unordered) ReplicaSet becomes convertible into a newly defined ReplicasOrdered struct. Iterating over ReplicasOrdered yields replicas in ring order, so policies can depend on it to perform LWT optimisation correctly. Note that the computation of ReplicasOrdered is lazy and performed in two steps. - the first step is relatively cheap and free of allocations and involves finding the primary replica by iterating over the global ring. It is triggered by the first call to ReplicasOrderedIterator::next(). - the second step is relatively expensive and involves allocations. It computes the whole ReplicaArray of all replicas, put in the ring over. Precomputation is not an option, as it would be very expensive to compute and store all precomputed arrays that would respect the global ring order (as opposed to only local, per-DC ring order).
For LWT optimisation, we want to request replicas to be given in the ring order. For other cases (non-LWT requests) the order: - either does not matter (can be arbitrary), - or it is required to be random, but then the replicas are to be shuffled anyway, so before shuffling they can be given, again, in arbitrary order. This leads to introducing the enum ReplicaOrder with Arbitrary and RingOrder variants.
For strong typing, an enum StatementType is introduces with Lwt and NonLwt variants. If the executed query is confirmed LWT, then it will be routed to replicas in ring order.
In case of executing LWT statements, we expect the policy to return plans in order exactly matching the ring order. To verify this, a new ExpectedGroup variant is added that will be fed with a sequence of nodes put in the expected ring order.
(going) though->through (the ring)
These test that when a LWT statement is executed, the policy returns replicas in the fixed, ring order. Logger is no longer initialised in `test_default_policy_with_token_aware_statements` test not to spam console with (expected) errors about returning empty plan.
fb23dcd
to
ecb4a9f
Compare
While routing statements that are confirmed LWTs, replicas are never shuffled but instead attempted in the ring order. This is done according to scylladb/gocql#40.
ReplicaSet
currentlyIn case of Network Topology Strategy, the replica iterator we obtain from the
ReplicaLocator
(ReplicaSetIterator
) does not yield replicas in the ring order.Meet
ReplicasOrdered
To support that,
ReplicasOrdered
struct is added and constructible fromReplicaSet
. It is also convertible intoReplicasOrderedIterator
, which does yield replicas in the ring order.Costs incurred
The first call to
next()
onReplicasOrderedIterator
is cheap, because the primary replica can be computed cheaply, and this suits thepick()
method.The second call to
next()
is expensive, because all the replicas are materialised in a vector (there is no cheap way to compute the second replica and so on). All subsequent calls are cheap (because replicas are already computed). This suits thefallback()
method expected high cost. Because of this characteristic, a semantic change was done toPlan
(andLoadBalancingPolicy
, as result).Plan semantics change
Before,
pick()
could returnNone
and this signified that the plan is empty, sofallback()
was not even called in such case.Now,
pick()
can returnNone
and this signifies that the first node cannot be computed cheaply.fallback()
is called even then, for it is still possible to compute the first node (and the others) more expensively.Taken approach
If the first replica computed by
pick()
is believed to be down, thenpick()
doesn't attempt to compute any more replicas, but returnsNone
. Then,fallback()
is called and computes all remaining replicas at once.Thanks to that,
pick()
remains cheap.Fixes: #672
Pre-review checklist
[ ] I have provided docstrings for the public items that I want to introduce../docs/source/
.Fixes:
annotations to PR description.