LWT routing optimisation: plan contains replicas in ring order #717

wprzytula · 2023-04-27T15:27:47Z

While routing statements that are confirmed LWTs, replicas are never shuffled but instead attempted in the ring order. This is done according to scylladb/gocql#40.

`ReplicaSet` currently

In case of Network Topology Strategy, the replica iterator we obtain from the ReplicaLocator (ReplicaSetIterator) does not yield replicas in the ring order.

Meet `ReplicasOrdered`

To support that, ReplicasOrdered struct is added and constructible from ReplicaSet. It is also convertible into ReplicasOrderedIterator, which does yield replicas in the ring order.

Costs incurred

The first call to next() on ReplicasOrderedIterator is cheap, because the primary replica can be computed cheaply, and this suits the pick() method.
The second call to next() is expensive, because all the replicas are materialised in a vector (there is no cheap way to compute the second replica and so on). All subsequent calls are cheap (because replicas are already computed). This suits the fallback() method expected high cost. Because of this characteristic, a semantic change was done to Plan (and LoadBalancingPolicy, as result).

Plan semantics change

Before, pick() could return None and this signified that the plan is empty, so fallback() was not even called in such case.
Now, pick() can return None and this signifies that the first node cannot be computed cheaply. fallback() is called even then, for it is still possible to compute the first node (and the others) more expensively.

Taken approach

If the first replica computed by pick() is believed to be down, then pick() doesn't attempt to compute any more replicas, but returns None. Then, fallback() is called and computes all remaining replicas at once.
Thanks to that, pick() remains cheap.

Fixes: #672

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have provided docstrings for the public items that I want to introduce.~~
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

piodul

I don't like that LWT queries become so expensive (at least on the driver side, server-side it might still be a win). AFAIK SimpleStrategy is being deprecated so if someone uses LWT then LWT + NTS combo is going to be a frequent combination for them. I don't agree with the assertion that "LWT are rare" - maybe the LWT users are rare, but LWT queries aren't rare for them.

It's fine if computing replicas is expensive in case of fallback, but it's not if we want to pick the first replica. As I understand, this was the big idea behind the new design of the load balancing module.

If implementing an efficient pick() is still difficult, then we can consider changing the behavior of the optimization and make sure that rust driver always chooses coordinators in some deterministic order which can be specific to the driver itself but quicker to compute. Such implementation should work well if a user only uses rust driver and nothing else (at least for LWT query for some particular table), however I'd rather do it as a last resort and revisit this problem later.

scylla/src/transport/load_balancing/default.rs

scylla/src/transport/locator/mod.rs

havaker · 2023-04-28T12:49:05Z

I don't like that LWT queries become so expensive (at least on the driver side, server-side it might still be a win). AFAIK SimpleStrategy is being deprecated so if someone uses LWT then LWT + NTS combo is going to be a frequent combination for them. I don't agree with the assertion that "LWT are rare" - maybe the LWT users are rare, but LWT queries aren't rare for them.

It's fine if computing replicas is expensive in case of fallback, but it's not if we want to pick the first replica. As I understand, this was the big idea behind the new design of the load balancing module.

If implementing an efficient pick() is still difficult, then we can consider changing the behavior of the optimization and make sure that rust driver always chooses coordinators in some deterministic order which can be specific to the driver itself but quicker to compute. Such implementation should work well if a user only uses rust driver and nothing else (at least for LWT query for some particular table), however I'd rather do it as a last resort and revisit this problem later.

It might be possible to implement ring-wise ReplicasOrdered in a more efficient manner, but it would be necessary to make bigger changes to the locator module. The problem I see now is that for ChainedNTS variant in ReplicaSet, we are constructing ReplicasOrdered eagerly - whole order is materialized into a Vec each time a ReplicaSet::into_replicas_ordered is called. This is a performance problem in pick, because we usually want only the first replica in that order (if there are down nodes, we might want to filter out something).

I've got an idea to do as little as possible on ReplicasOrdered creation, and move order computation to ReplicasOrderedIter::next. If we could precompute [(NodeRef, Token)] instead of [NodeRef] lists, merging few precomputed lists in a ring-wise order could be done lazily:

struct ReplicasOrderedIter {
    indexes: SmallVec<usize>,

    replica_lists: SmallVec<&[(NodeRef, Token)] >,
    // replica_lists could also be represented like in `ReplicaSet::ChainedNTS`, but care must be taken when handling non - precomputed ones.
}

...

impl ReplicasOrderedIter {
    fn next(&mut self) -> Self::Item {
        // Using index[i] select (replica, token) from i-th replica_list
        // Return replica with a lowest token from the selected ones
        // Increase index that pointed to selected replica 
    }
}

This way, DefaultPolicy::pick could use ReplicasOrderedIter to quickly find the first (not down) replica in a ring-wise order. Changing [NodeRef] to [(NodeRef, Token)] in ReplicationInfo and PrecomputedReplicas should not be too difficult, but would probably introduce breaking changes to the locator API.

@piodul @wprzytula what do you think about this?

piodul · 2023-05-04T11:29:58Z

This way, DefaultPolicy::pick could use ReplicasOrderedIter to quickly find the first (not down) replica in a ring-wise order.
Changing [NodeRef] to [(NodeRef, Token)] in ReplicationInfo and PrecomputedReplicas should not be too difficult, but would probably introduce breaking changes to the locator API.

If we keep [(NodeRef, Token)] instead of [NodeRef] then we would double our metadata size, which is something I would like to avoid. Maybe, instead of iterating over slices of [(NodeRef, Token)], we could compute the replicas for the datacenter on demand? If we only need to choose the first replica in general case then perhaps we can compute that cheaply.

wprzytula · 2023-05-09T17:02:27Z

@piodul @wprzytula what do you think about this?

Not answering your question yet,

do you @havaker @piodul consider the cost incurred by LWT queries in the current PR state acceptable for the fallback case ?
I mean, do we only have to look for something better in the pick case, keeping the fallback case as-is?

wprzytula · 2023-05-12T15:05:32Z

After the private conversation with @piodul and @havaker, the conclusion it to:

settle on pick() being always a cheap (or, more precisely, non-allocating) function (it replica precomputation is turned on),
rework pick() in case of a LWT request to only return Some(replica) if it can compute it cheaply, else None,
change Plan to call fallback() even if pick() returned None,
make fallback() do all expensive computations needed to compute non-primary global replicas in ring order.

wprzytula · 2023-05-12T17:45:51Z

v2:

addressed @piodul's comments
taken approach described above:

settle on pick() being always a cheap (or, more precisely, non-allocating) function (it replica precomputation is turned on),

rework pick() in case of a LWT request to only return Some(replica) if it can compute it cheaply, else None,

change Plan to call fallback() even if pick() returned None,

make fallback() do all expensive computations needed to compute non-primary global replicas in ring order.

scylla/src/transport/load_balancing/default.rs

scylla/src/transport/load_balancing/plan.rs

scylla/src/transport/locator/mod.rs

scylla/src/transport/load_balancing/default.rs

piodul

Looks good. I hoped it would require less code, but it seems that it's the best what we can do with the current design.

It looks like you need to update examples in the documentation with respect to the last commit in order for the CI to pass.

piodul · 2023-05-29T07:29:10Z

scylla/src/transport/locator/mod.rs

+                // (in Node only `down_marker` is such, being an AtomicBool).
+                // https://rust-lang.github.io/rust-clippy/master/index.html#mutable_key_type
+                #[allow(clippy::mutable_key_type)]
+                let mut all_replicas: HashSet<&'a Arc<Node>> = HashSet::new();


I think that we should just drop the impl {Eq,Hash} for Node in the future (not in this PR). Those impls don't take the whole struct into consideration as Eq impls usually does, so it's weird. We should explicitly use HashMap and index by host id.

scylla/src/transport/load_balancing/default.rs

The new name defines better the semantics of this type: we only provide criteria that we want to taken into account, without specifying the actual data (the replica's location) to be compared to.

This is a simple refactor to avoid code duplication.

It is clearer not to use returns but instead use the functional style.

This is another refactor that aims to reduce code duplication.

As the case of providing preferred rack without specifying the preferred datacenter is invalid, in spirit of the principle “Make illegal states unrepresentable”, an enum is introduced to restrict the possible cases. So far, it is only used in DefaultPolicy, but in the next major release DefaultBuilder's API should be changed accordingly to disallow creating such configuration. It appears that there was a bug which involved incorrect behaviour when preferred rack was provided without setting preferred datacenter: instead on ignoring the preferred rack, replicas were filtered by it. Now, the following is true: it is no longer possible to turn on rack awareness without dc-awareness; attempts to do so will result in fallback to non-rack awareness.

It is cleaner when the Criteria itself carries the required information for the filtering, as well as less prone to errors (because one has to provide the required DC/rack name). As a bonus, the semantics of helper fns calls in `pick()` were made more sane: `pick_replica()` is called with `ReplicaLocationCriteria::Datacenter` only if the preferred datacenter is actually specified. `fallback()` was modified analogously (in regard to calling `shuffled_replicas()`).

We want to keep regarding pick() as a cheap, happy path function, which does not allocate. In LWT optimisation, however, only the primary replica can be computed without allocations (due to specifics of Network Topology Strategy, replicas are not assigned greedily, but their distribution across racks is being balanced). If a policy (in our case, the default policy) recognizes that a picked replica is down, it would try to pick another, and this computation would be expensive for ReplicasOrdered. Instead, having recognized during LWT optimised case that a picked replica is down, the policy returns None from pick() to hint that further computation will be expensive. The plan logic is hence altered to call fallback() even if pick() returns None. In the non-LWT case, as subsequent calls to pick() are still cheap, pick() will still try to find next replicas if one is recognized to be down. A test is added. It asserts that `fallback()` is called if `pick()` returned None. For the test to be possible to be written, a convenience constructor is added for `Node` under `cfg(test)`.

In order to be consistent with LWT optimisation implementation in other drivers, it is mandatory to return query plan consisting of replicas put in ring order. To accomplish that, the (abstractly unordered) ReplicaSet becomes convertible into a newly defined ReplicasOrdered struct. Iterating over ReplicasOrdered yields replicas in ring order, so policies can depend on it to perform LWT optimisation correctly. Note that the computation of ReplicasOrdered is lazy and performed in two steps. - the first step is relatively cheap and free of allocations and involves finding the primary replica by iterating over the global ring. It is triggered by the first call to ReplicasOrderedIterator::next(). - the second step is relatively expensive and involves allocations. It computes the whole ReplicaArray of all replicas, put in the ring over. Precomputation is not an option, as it would be very expensive to compute and store all precomputed arrays that would respect the global ring order (as opposed to only local, per-DC ring order).

For LWT optimisation, we want to request replicas to be given in the ring order. For other cases (non-LWT requests) the order: - either does not matter (can be arbitrary), - or it is required to be random, but then the replicas are to be shuffled anyway, so before shuffling they can be given, again, in arbitrary order. This leads to introducing the enum ReplicaOrder with Arbitrary and RingOrder variants.

For strong typing, an enum StatementType is introduces with Lwt and NonLwt variants. If the executed query is confirmed LWT, then it will be routed to replicas in ring order.

In case of executing LWT statements, we expect the policy to return plans in order exactly matching the ring order. To verify this, a new ExpectedGroup variant is added that will be fed with a sequence of nodes put in the expected ring order.

(going) though->through (the ring)

These test that when a LWT statement is executed, the policy returns replicas in the fixed, ring order. Logger is no longer initialised in `test_default_policy_with_token_aware_statements` test not to spam console with (expected) errors about returning empty plan.

wprzytula requested review from havaker and cvybhu April 27, 2023 15:27

piodul requested changes Apr 28, 2023

View reviewed changes

wprzytula force-pushed the lwt-optimisation branch 2 times, most recently from 443e19a to 86142a6 Compare May 12, 2023 16:57

wprzytula requested a review from piodul May 12, 2023 17:43

havaker suggested changes May 17, 2023

View reviewed changes

wprzytula requested a review from havaker May 18, 2023 08:50

wprzytula force-pushed the lwt-optimisation branch 2 times, most recently from f88a4e8 to 395b7f4 Compare May 24, 2023 06:26

wprzytula added the performance Improves performance of existing features label May 24, 2023

wprzytula added this to the 0.8.2 milestone May 24, 2023

wprzytula force-pushed the lwt-optimisation branch 2 times, most recently from 430a748 to 5dd9f74 Compare May 28, 2023 17:56

piodul reviewed May 31, 2023

View reviewed changes

wprzytula force-pushed the lwt-optimisation branch from 5dd9f74 to fb23dcd Compare May 31, 2023 09:05

wprzytula added 9 commits May 31, 2023 11:14

locator, default_policy: Docstring typos fixes

847e71d

default_policy: rename ReplicaLocation -> ...Criteria

87be688

The new name defines better the semantics of this type: we only provide criteria that we want to taken into account, without specifying the actual data (the replica's location) to be compared to.

default_policy: compute should_be_local on one path

f7cbb17

This is a simple refactor to avoid code duplication.

default_policy: improve readability of preferred_node_set()

bb1051b

It is clearer not to use returns but instead use the functional style.

default_policy: extract make_rack_predicate()

b41d348

This is another refactor that aims to reduce code duplication.

wprzytula added 7 commits May 31, 2023 11:14

default_policy: introduce and respect StatementType

cbaae8f

For strong typing, an enum StatementType is introduces with Lwt and NonLwt variants. If the executed query is confirmed LWT, then it will be routed to replicas in ring order.

default_policy: tests: fix typo

c5f3c9f

(going) though->through (the ring)

enable LWT integration test once again

475be40

docs: mention LWT optimisation in default policy

ecb4a9f

wprzytula force-pushed the lwt-optimisation branch from fb23dcd to ecb4a9f Compare May 31, 2023 09:15

wprzytula requested a review from piodul May 31, 2023 10:02

piodul approved these changes May 31, 2023

View reviewed changes

piodul merged commit 76acf4f into scylladb:main May 31, 2023

wprzytula deleted the lwt-optimisation branch March 5, 2024 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LWT routing optimisation: plan contains replicas in ring order #717

LWT routing optimisation: plan contains replicas in ring order #717

wprzytula commented Apr 27, 2023 •

edited

Loading

piodul left a comment

havaker commented Apr 28, 2023

piodul commented May 4, 2023

wprzytula commented May 9, 2023

wprzytula commented May 12, 2023

wprzytula commented May 12, 2023

piodul left a comment

piodul May 29, 2023

LWT routing optimisation: plan contains replicas in ring order #717

LWT routing optimisation: plan contains replicas in ring order #717

Conversation

wprzytula commented Apr 27, 2023 • edited Loading

ReplicaSet currently

Meet ReplicasOrdered

Costs incurred

Plan semantics change

Taken approach

Pre-review checklist

piodul left a comment

Choose a reason for hiding this comment

havaker commented Apr 28, 2023

piodul commented May 4, 2023

wprzytula commented May 9, 2023

wprzytula commented May 12, 2023

wprzytula commented May 12, 2023

piodul left a comment

Choose a reason for hiding this comment

piodul May 29, 2023

Choose a reason for hiding this comment

wprzytula commented Apr 27, 2023 •

edited

Loading

`ReplicaSet` currently

Meet `ReplicasOrdered`