Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LWT routing optimisation: plan contains replicas in ring order #717

Merged
merged 16 commits into from
May 31, 2023

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Apr 27, 2023

While routing statements that are confirmed LWTs, replicas are never shuffled but instead attempted in the ring order. This is done according to scylladb/gocql#40.

ReplicaSet currently

In case of Network Topology Strategy, the replica iterator we obtain from the ReplicaLocator (ReplicaSetIterator) does not yield replicas in the ring order.

Meet ReplicasOrdered

To support that, ReplicasOrdered struct is added and constructible from ReplicaSet. It is also convertible into ReplicasOrderedIterator, which does yield replicas in the ring order.

Costs incurred

The first call to next() on ReplicasOrderedIterator is cheap, because the primary replica can be computed cheaply, and this suits the pick() method.
The second call to next() is expensive, because all the replicas are materialised in a vector (there is no cheap way to compute the second replica and so on). All subsequent calls are cheap (because replicas are already computed). This suits the fallback() method expected high cost. Because of this characteristic, a semantic change was done to Plan (and LoadBalancingPolicy, as result).

Plan semantics change

Before, pick() could return None and this signified that the plan is empty, so fallback() was not even called in such case.
Now, pick() can return None and this signifies that the first node cannot be computed cheaply. fallback() is called even then, for it is still possible to compute the first node (and the others) more expensively.

Taken approach

If the first replica computed by pick() is believed to be down, then pick() doesn't attempt to compute any more replicas, but returns None. Then, fallback() is called and computes all remaining replicas at once.
Thanks to that, pick() remains cheap.

Fixes: #672

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula requested review from havaker and cvybhu April 27, 2023 15:27
Copy link
Collaborator

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that LWT queries become so expensive (at least on the driver side, server-side it might still be a win). AFAIK SimpleStrategy is being deprecated so if someone uses LWT then LWT + NTS combo is going to be a frequent combination for them. I don't agree with the assertion that "LWT are rare" - maybe the LWT users are rare, but LWT queries aren't rare for them.

It's fine if computing replicas is expensive in case of fallback, but it's not if we want to pick the first replica. As I understand, this was the big idea behind the new design of the load balancing module.

If implementing an efficient pick() is still difficult, then we can consider changing the behavior of the optimization and make sure that rust driver always chooses coordinators in some deterministic order which can be specific to the driver itself but quicker to compute. Such implementation should work well if a user only uses rust driver and nothing else (at least for LWT query for some particular table), however I'd rather do it as a last resort and revisit this problem later.

@havaker
Copy link
Contributor

havaker commented Apr 28, 2023

I don't like that LWT queries become so expensive (at least on the driver side, server-side it might still be a win). AFAIK SimpleStrategy is being deprecated so if someone uses LWT then LWT + NTS combo is going to be a frequent combination for them. I don't agree with the assertion that "LWT are rare" - maybe the LWT users are rare, but LWT queries aren't rare for them.

It's fine if computing replicas is expensive in case of fallback, but it's not if we want to pick the first replica. As I understand, this was the big idea behind the new design of the load balancing module.

If implementing an efficient pick() is still difficult, then we can consider changing the behavior of the optimization and make sure that rust driver always chooses coordinators in some deterministic order which can be specific to the driver itself but quicker to compute. Such implementation should work well if a user only uses rust driver and nothing else (at least for LWT query for some particular table), however I'd rather do it as a last resort and revisit this problem later.

It might be possible to implement ring-wise ReplicasOrdered in a more efficient manner, but it would be necessary to make bigger changes to the locator module. The problem I see now is that for ChainedNTS variant in ReplicaSet, we are constructing ReplicasOrdered eagerly - whole order is materialized into a Vec each time a ReplicaSet::into_replicas_ordered is called. This is a performance problem in pick, because we usually want only the first replica in that order (if there are down nodes, we might want to filter out something).

I've got an idea to do as little as possible on ReplicasOrdered creation, and move order computation to ReplicasOrderedIter::next. If we could precompute [(NodeRef, Token)] instead of [NodeRef] lists, merging few precomputed lists in a ring-wise order could be done lazily:

struct ReplicasOrderedIter {
    indexes: SmallVec<usize>,

    replica_lists: SmallVec<&[(NodeRef, Token)] >,
    // replica_lists could also be represented like in `ReplicaSet::ChainedNTS`, but care must be taken when handling non - precomputed ones.
}

...

impl ReplicasOrderedIter {
    fn next(&mut self) -> Self::Item {
        // Using index[i] select (replica, token) from i-th replica_list
        // Return replica with a lowest token from the selected ones
        // Increase index that pointed to selected replica 
    }
}

This way, DefaultPolicy::pick could use ReplicasOrderedIter to quickly find the first (not down) replica in a ring-wise order. Changing [NodeRef] to [(NodeRef, Token)] in ReplicationInfo and PrecomputedReplicas should not be too difficult, but would probably introduce breaking changes to the locator API.

@piodul @wprzytula what do you think about this?

@piodul
Copy link
Collaborator

piodul commented May 4, 2023

This way, DefaultPolicy::pick could use ReplicasOrderedIter to quickly find the first (not down) replica in a ring-wise order.
Changing [NodeRef] to [(NodeRef, Token)] in ReplicationInfo and PrecomputedReplicas should not be too difficult, but would probably introduce breaking changes to the locator API.

If we keep [(NodeRef, Token)] instead of [NodeRef] then we would double our metadata size, which is something I would like to avoid. Maybe, instead of iterating over slices of [(NodeRef, Token)], we could compute the replicas for the datacenter on demand? If we only need to choose the first replica in general case then perhaps we can compute that cheaply.

@wprzytula
Copy link
Collaborator Author

@piodul @wprzytula what do you think about this?

Not answering your question yet,

do you @havaker @piodul consider the cost incurred by LWT queries in the current PR state acceptable for the fallback case ?
I mean, do we only have to look for something better in the pick case, keeping the fallback case as-is?

@wprzytula
Copy link
Collaborator Author

After the private conversation with @piodul and @havaker, the conclusion it to:

  • settle on pick() being always a cheap (or, more precisely, non-allocating) function (it replica precomputation is turned on),
  • rework pick() in case of a LWT request to only return Some(replica) if it can compute it cheaply, else None,
  • change Plan to call fallback() even if pick() returned None,
  • make fallback() do all expensive computations needed to compute non-primary global replicas in ring order.

@wprzytula wprzytula force-pushed the lwt-optimisation branch 2 times, most recently from 443e19a to 86142a6 Compare May 12, 2023 16:57
@wprzytula wprzytula requested a review from piodul May 12, 2023 17:43
@wprzytula
Copy link
Collaborator Author

v2:

  • addressed @piodul's comments
  • taken approach described above:
  • settle on pick() being always a cheap (or, more precisely, non-allocating) function (it replica precomputation is turned on),
  • rework pick() in case of a LWT request to only return Some(replica) if it can compute it cheaply, else None,
  • change Plan to call fallback() even if pick() returned None,
  • make fallback() do all expensive computations needed to compute non-primary global replicas in ring order.

@wprzytula wprzytula requested a review from havaker May 18, 2023 08:50
@wprzytula wprzytula force-pushed the lwt-optimisation branch 2 times, most recently from f88a4e8 to 395b7f4 Compare May 24, 2023 06:26
@wprzytula wprzytula added the performance Improves performance of existing features label May 24, 2023
@wprzytula wprzytula added this to the 0.8.2 milestone May 24, 2023
@wprzytula wprzytula force-pushed the lwt-optimisation branch 2 times, most recently from 430a748 to 5dd9f74 Compare May 28, 2023 17:56
Copy link
Collaborator

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I hoped it would require less code, but it seems that it's the best what we can do with the current design.

It looks like you need to update examples in the documentation with respect to the last commit in order for the CI to pass.

// (in Node only `down_marker` is such, being an AtomicBool).
// https://rust-lang.github.io/rust-clippy/master/index.html#mutable_key_type
#[allow(clippy::mutable_key_type)]
let mut all_replicas: HashSet<&'a Arc<Node>> = HashSet::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should just drop the impl {Eq,Hash} for Node in the future (not in this PR). Those impls don't take the whole struct into consideration as Eq impls usually does, so it's weird. We should explicitly use HashMap and index by host id.

wprzytula added 9 commits May 31, 2023 11:14
The new name defines better the semantics of this type: we only provide
criteria that we want to taken into account, without specifying the
actual data (the replica's location) to be compared to.
This is a simple refactor to avoid code duplication.
It is clearer not to use returns but instead use the functional style.
This is another refactor that aims to reduce code duplication.
As the case of providing preferred rack without specifying the preferred
datacenter is invalid, in spirit of the principle “Make illegal states
unrepresentable”, an enum is introduced to restrict the possible cases.
So far, it is only used in DefaultPolicy, but in the next major release
DefaultBuilder's API should be changed accordingly to disallow creating
such configuration.

It appears that there was a bug which involved incorrect behaviour when
preferred rack was provided without setting preferred datacenter:
instead on ignoring the preferred rack, replicas were filtered by it.
Now, the following is true:
it is no longer possible to turn on rack awareness without dc-awareness;
attempts to do so will result in fallback to non-rack awareness.
It is cleaner when the Criteria itself carries the required information
for the filtering, as well as less prone to errors (because one has to
provide the required DC/rack name).
As a bonus, the semantics of helper fns calls in `pick()` were made more
sane: `pick_replica()` is called with
`ReplicaLocationCriteria::Datacenter` only if the preferred datacenter
is actually specified. `fallback()` was modified analogously (in regard
to calling `shuffled_replicas()`).
We want to keep regarding pick() as a cheap, happy path function,
which does not allocate. In LWT optimisation, however, only the primary
replica can be computed without allocations (due to specifics of Network
Topology Strategy, replicas are not assigned greedily, but their
distribution across racks is being balanced). If a policy (in our case,
the default policy) recognizes that a picked replica is down, it would
try to pick another, and this computation would be expensive for
ReplicasOrdered. Instead, having recognized during LWT optimised case
that a picked replica is down, the policy returns None from pick()
to hint that further computation will be expensive.
The plan logic is hence altered to call fallback() even if pick()
returns None.
In the non-LWT case, as subsequent calls to pick() are still cheap,
pick() will still try to find next replicas if one is recognized
to be down.

A test is added. It asserts that `fallback()` is called if `pick()`
returned None.
For the test to be possible to be written, a convenience constructor is
added for `Node` under `cfg(test)`.
In order to be consistent with LWT optimisation implementation in other
drivers, it is mandatory to return query plan consisting of replicas
put in ring order. To accomplish that, the (abstractly unordered)
ReplicaSet becomes convertible into a newly defined ReplicasOrdered
struct. Iterating over ReplicasOrdered yields replicas in ring order, so
policies can depend on it to perform LWT optimisation correctly.

Note that the computation of ReplicasOrdered is lazy and performed in
two steps.
- the first step is relatively cheap and free of allocations
and involves finding the primary replica by iterating over the global
ring. It is triggered by the first call to ReplicasOrderedIterator::next().
- the second step is relatively expensive and involves allocations.
It computes the whole ReplicaArray of all replicas, put in the ring
over.
Precomputation is not an option, as it would be very expensive
to compute and store all precomputed arrays that would respect the
global ring order (as opposed to only local, per-DC ring order).
wprzytula added 7 commits May 31, 2023 11:14
For LWT optimisation, we want to request replicas to be given in
the ring order. For other cases (non-LWT requests) the order:
- either does not matter (can be arbitrary),
- or it is required to be random, but then the replicas are to be
shuffled anyway, so before shuffling they can be given, again,
in arbitrary order.
This leads to introducing the enum ReplicaOrder with Arbitrary and
RingOrder variants.
For strong typing, an enum StatementType is introduces with Lwt and
NonLwt variants. If the executed query is confirmed LWT, then it will be
routed to replicas in ring order.
In case of executing LWT statements, we expect the policy to return
plans in order exactly matching the ring order. To verify this, a new
ExpectedGroup variant is added that will be fed with a sequence of nodes
put in the expected ring order.
(going) though->through (the ring)
These test that when a LWT statement is executed, the policy returns
replicas in the fixed, ring order.

Logger is no longer initialised in
`test_default_policy_with_token_aware_statements` test not to spam
console with (expected) errors about returning empty plan.
@wprzytula wprzytula requested a review from piodul May 31, 2023 10:02
@piodul piodul merged commit 76acf4f into scylladb:main May 31, 2023
@wprzytula wprzytula deleted the lwt-optimisation branch March 5, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improves performance of existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-add LWT routing optimization
3 participants