Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(batch): fine-grained serving vnode mapping rebalance #9718

Closed
wants to merge 8 commits into from

Conversation

zwang28
Copy link
Contributor

@zwang28 zwang28 commented May 10, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR makes two improvement related to vnode mapping strategy for serving:

  1. rebalance: Inspired by Straw Buckets in Ceph's CRUSH, we uses similar idea to minimize data relocation after addition or removal of serving worker nodes.
  2. availability: When a worker node is likely encountering network issue, we temporarily mark it as failed, and use a new temporarily serving vnode mapping excluding all failed worker nodes. This way, we can achieve high availability of serving cluster. Note that we don't implement auto retry in frontend. So when transient error happens, the first query will fail and any following query will succeed.

#8940

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

  • My PR DOES NOT contain user-facing changes.
Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

@zwang28 zwang28 changed the title refactor(batch): maintain serving vnode mapping in meta node refactor(batch): fine-grained serving vnode mapping rebalance May 17, 2023
@zwang28 zwang28 force-pushed the wangzheng/serving_worker_replica branch from 8c5f20c to eb4a53d Compare May 17, 2023 06:18
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 3426 files.

Valid Invalid Ignored Fixed
1556 1 1869 0
Click to see the invalid file list
  • src/tests/simulation/tests/it/batch/mod.rs

@zwang28 zwang28 force-pushed the wangzheng/serving_worker_replica branch from 3f5edf8 to b20a42a Compare May 18, 2023 06:22
@zwang28
Copy link
Contributor Author

zwang28 commented May 19, 2023

Compared with maintaining serving vnode mapping in meta node, I prefer current decentralized way (calculate by frontend on demand):

  1. If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused problems.
  2. Unlike the streaming vnode mapping, which must be followed strictly, serving vnode mapping is just a hint for better data locality. Because we can actually access checkpoint data from any compute nodes.

@zwang28 zwang28 marked this pull request as ready for review May 19, 2023 10:02
@codecov
Copy link

codecov bot commented May 19, 2023

Codecov Report

Merging #9718 (f3823ae) into main (6e22bec) will increase coverage by 0.00%.
The diff coverage is 79.51%.

@@           Coverage Diff            @@
##             main    #9718    +/-   ##
========================================
  Coverage   71.10%   71.11%            
========================================
  Files        1250     1250            
  Lines      209199   209405   +206     
========================================
+ Hits       148755   148920   +165     
- Misses      60444    60485    +41     
Flag Coverage Δ
rust 71.11% <79.51%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/frontend/src/lib.rs 22.72% <ø> (ø)
src/frontend/src/scheduler/distributed/stage.rs 15.32% <0.00%> (-0.39%) ⬇️
src/frontend/src/session.rs 27.22% <14.28%> (-0.09%) ⬇️
src/frontend/src/scheduler/worker_node_manager.rs 76.17% <88.34%> (+7.28%) ⬆️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zwang28 zwang28 marked this pull request as draft May 22, 2023 04:24
@zwang28
Copy link
Contributor Author

zwang28 commented May 22, 2023

The core function map_vnode_for_serving takes some parallel units, and maps all vnodes to them deterministically. As map_vnode_for_serving only leads to a probabilistically balanced distribution, I run micro bench bench_map_vnode_for_serving to illustrate how it perform in practice. Each run of the bench maps 256 vnodes to n parallel units for 1000 fragments, with random fragment id and random parallel unit id.

In short, vnode distribution it produces is not as balanced as expected.

The ratio below shows how many given parallel units have been assigned at least one vnode. Take the worst case of "256 parallel units", 57.8125% roughly means half of available parallel units are assigned 2 vnodes each on average, and half of available parallel units are not used. But ideally each available parallel unit should be assigned exactly 1 vnode.

256 parallel units:

1000 runs
min 148/256 57.8125%
p10 155/256 60.546875%
p50 162/256 63.28125%
p90 168/256 65.625%
max 176/256 68.75%

128 parallel units:

1000 runs
min 98/128 76.5625%
p10 107/128 83.59375%
p50 111/128 86.71875%
p90 115/128 89.84375%
max 120/128 93.75%

64 parallel units:

1000 runs
min 59/64 92.1875%
p10 61/64 95.3125%
p50 63/64 98.4375%
p90 64/64 100%
max 64/64 100%

@zwang28
Copy link
Contributor Author

zwang28 commented May 22, 2023

Let me summarize current situation. There is 3 design goals of serving vnode mapping strategy:

  1. deterministic output vnode mapping given same set of parallel units, across frontends.
  2. balanced distribution of vnodes among parallel units.
  3. minimal data movement due to the addition or removal of parallel units.

If we maintain the mapping in meta node, all 3 goals can be achieved easily. However, I still prefer current decentralized way (maintain mapping in frontend) instead, to avoid increasing complexity in meta:

  • If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused problems.
  • Unlike the streaming vnode mapping, which must be followed strictly, serving vnode mapping is just a hint for better data locality. Because we can actually access checkpoint data from any compute nodes.

With regard to decentralized way (maintain mapping in frontend), I find it hard to achieve all 3 goals at the same time:

  • For the round-robin approach in current main branch, it optimizes for goal#1 and goal#2, but totally ignore goal#3. i.e. frontend re-calculate vnode mappings whenever parallel unit changes.
  • For the pseudorandom approach in this PR, it optimizes for all 3 goals, but doesn't achieve a satisfying result for goal#2 in benchmark.
  • If we adopt a rebalance approach similar to what streaming does in frontend whenever parallel unit changes, it doesn't achieve goal#1, i.e. calc_mapping(pu1, pu2, pu4) doesn't necessarily equal to calc_mapping(pu1 ,pu2, pu3) + rebalance_add(pu4) + rebalance_remove(pu3).
    • The difficulty of modifying it to be deterministic originates from the fact that, there is another flexible variable serving_parallelism to be considered, compared with streaming's already-known total parallel unit number. Goal#1 is most crucial so this approach is rejected.
    • However we can use this approach to generate a temporary mapping with minimal data movement, when there is transient compute node failure, or we want a table scan parallelism different from default one.

My opinion is, as addition or removal of parallel units is rare, the round-robin approach in current main branch should suffice. So I'd like to hold this PR and related improvement. We can refactor to the centralized way (maintain in meta node) later as more use cases occurs.

@fuyufjh @liurenjie1024 @hzxa21 Any comments?

@BugenZhao BugenZhao self-requested a review May 24, 2023 08:08
@BugenZhao
Copy link
Member

  • If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused problems.

As you've mentioned, the "vnode mapping is just a hint" here. So is it possible that we make it centralized to the meta service but not persisted?

  • Once there's some configuration change for the serving nodes in the cluster, the meta node can sense this and broadcast the new mapping.
  • If the meta service fails, we can rebuild the mapping from scratch.

@zwang28
Copy link
Contributor Author

zwang28 commented May 24, 2023

make it centralized to the meta service but not persisted

Make sense.
Although the mapping rebuilt from scratch may not match exactly with the one before meta fails, the difference should be small.

@liurenjie1024
Copy link
Contributor

I prefer the centralized approach in meta for several reasons:

  1. Easier to implement and debug. Please note that in production when something unexpected(high latency) happens, we need to have easy way to inspect the reason and explain it.
  2. Adding/removing parallel unit is not the only reason for rebalancing. There are at least two cases we may need to do rebalance:
  • The request to vnode/partition maybe unbalanced. For example when a worker node hosts many vnode which are host, we may need to move some partition to other worker nodes.
  • The data size of vnode/partition maybe unbalanced. For example when a worker node may have several large vnodes, we may need to move some partition to other worker nodes.
  1. As with "Unlike the streaming vnode mapping, which must be followed strictly, serving vnode mapping is just a hint for better data locality. Because we can actually access checkpoint data from any compute nodes." It can guarantee correctness, but it can't guarantee stable performance in serving user requests. Please note that our serving performance is designed for online dashboard, so predictable latency is crucial for our customers.

If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused #7728.

For this problem, I think there are two approaches to solve it:

  1. As @BugenZhao said, we may don't need to save the mapping, but waiting for worker nodes to report it.
  2. I think the problem you mentioned is implementation rather design problem, we can split request to smaller chunks to solve it.

@zwang28
Copy link
Contributor Author

zwang28 commented May 25, 2023

Replace with #10004

@zwang28 zwang28 closed this May 25, 2023
@zwang28 zwang28 deleted the wangzheng/serving_worker_replica branch June 19, 2023 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants