refactor(batch): fine-grained serving vnode mapping rebalance #9718

zwang28 · 2023-05-10T03:07:23Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR makes two improvement related to vnode mapping strategy for serving:

rebalance: Inspired by Straw Buckets in Ceph's CRUSH, we uses similar idea to minimize data relocation after addition or removal of serving worker nodes.
availability: When a worker node is likely encountering network issue, we temporarily mark it as failed, and use a new temporarily serving vnode mapping excluding all failed worker nodes. This way, we can achieve high availability of serving cluster. Note that we don't implement auto retry in frontend. So when transient error happens, the first query will fail and any following query will succeed.

#8940

Checklist For Contributors

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

My PR DOES NOT contain user-facing changes.

Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

Installation and deployment
Connector (sources & sinks)
SQL commands, functions, and operators
RisingWave cluster configuration changes
Other (please specify in the release note below)

Release note

github-actions

license-eye has totally checked 3426 files.

Valid	Invalid	Ignored	Fixed
1556	1	1869	0

Click to see the invalid file list

src/tests/simulation/tests/it/batch/mod.rs

src/tests/simulation/tests/it/batch/mod.rs

# Conflicts: # Cargo.lock

zwang28 · 2023-05-19T10:02:07Z

Compared with maintaining serving vnode mapping in meta node, I prefer current decentralized way (calculate by frontend on demand):

If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused problems.
Unlike the streaming vnode mapping, which must be followed strictly, serving vnode mapping is just a hint for better data locality. Because we can actually access checkpoint data from any compute nodes.

codecov · 2023-05-19T12:08:45Z

Codecov Report

Merging #9718 (f3823ae) into main (6e22bec) will increase coverage by 0.00%.
The diff coverage is 79.51%.

@@           Coverage Diff            @@
##             main    #9718    +/-   ##
========================================
  Coverage   71.10%   71.11%            
========================================
  Files        1250     1250            
  Lines      209199   209405   +206     
========================================
+ Hits       148755   148920   +165     
- Misses      60444    60485    +41

Flag	Coverage Δ
rust	`71.11% <79.51%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/frontend/src/lib.rs	`22.72% <ø> (ø)`
src/frontend/src/scheduler/distributed/stage.rs	`15.32% <0.00%> (-0.39%)`	⬇️
src/frontend/src/session.rs	`27.22% <14.28%> (-0.09%)`	⬇️
src/frontend/src/scheduler/worker_node_manager.rs	`76.17% <88.34%> (+7.28%)`	⬆️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zwang28 · 2023-05-22T07:33:25Z

The core function map_vnode_for_serving takes some parallel units, and maps all vnodes to them deterministically. As map_vnode_for_serving only leads to a probabilistically balanced distribution, I run micro bench bench_map_vnode_for_serving to illustrate how it perform in practice. Each run of the bench maps 256 vnodes to n parallel units for 1000 fragments, with random fragment id and random parallel unit id.

In short, vnode distribution it produces is not as balanced as expected.

The ratio below shows how many given parallel units have been assigned at least one vnode. Take the worst case of "256 parallel units", 57.8125% roughly means half of available parallel units are assigned 2 vnodes each on average, and half of available parallel units are not used. But ideally each available parallel unit should be assigned exactly 1 vnode.

256 parallel units:

1000 runs
min 148/256 57.8125%
p10 155/256 60.546875%
p50 162/256 63.28125%
p90 168/256 65.625%
max 176/256 68.75%

128 parallel units:

1000 runs
min 98/128 76.5625%
p10 107/128 83.59375%
p50 111/128 86.71875%
p90 115/128 89.84375%
max 120/128 93.75%

64 parallel units:

1000 runs
min 59/64 92.1875%
p10 61/64 95.3125%
p50 63/64 98.4375%
p90 64/64 100%
max 64/64 100%

zwang28 · 2023-05-22T07:34:29Z

Let me summarize current situation. There is 3 design goals of serving vnode mapping strategy:

deterministic output vnode mapping given same set of parallel units, across frontends.
balanced distribution of vnodes among parallel units.
minimal data movement due to the addition or removal of parallel units.

If we maintain the mapping in meta node, all 3 goals can be achieved easily. However, I still prefer current decentralized way (maintain mapping in frontend) instead, to avoid increasing complexity in meta:

If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused problems.
Unlike the streaming vnode mapping, which must be followed strictly, serving vnode mapping is just a hint for better data locality. Because we can actually access checkpoint data from any compute nodes.

With regard to decentralized way (maintain mapping in frontend), I find it hard to achieve all 3 goals at the same time:

For the round-robin approach in current main branch, it optimizes for goal#1 and goal#2, but totally ignore goal#3. i.e. frontend re-calculate vnode mappings whenever parallel unit changes.
For the pseudorandom approach in this PR, it optimizes for all 3 goals, but doesn't achieve a satisfying result for goal#2 in benchmark.
If we adopt a rebalance approach similar to what streaming does in frontend whenever parallel unit changes, it doesn't achieve goal#1, i.e. calc_mapping(pu1, pu2, pu4) doesn't necessarily equal to calc_mapping(pu1 ,pu2, pu3) + rebalance_add(pu4) + rebalance_remove(pu3).
- The difficulty of modifying it to be deterministic originates from the fact that, there is another flexible variable serving_parallelism to be considered, compared with streaming's already-known total parallel unit number. Goal#1 is most crucial so this approach is rejected.
- However we can use this approach to generate a temporary mapping with minimal data movement, when there is transient compute node failure, or we want a table scan parallelism different from default one.

My opinion is, as addition or removal of parallel units is rare, the round-robin approach in current main branch should suffice. So I'd like to hold this PR and related improvement. We can refactor to the centralized way (maintain in meta node) later as more use cases occurs.

@fuyufjh @liurenjie1024 @hzxa21 Any comments?

BugenZhao · 2023-05-24T08:42:46Z

If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused problems.

As you've mentioned, the "vnode mapping is just a hint" here. So is it possible that we make it centralized to the meta service but not persisted?

Once there's some configuration change for the serving nodes in the cluster, the meta node can sense this and broadcast the new mapping.
If the meta service fails, we can rebuild the mapping from scratch.

zwang28 · 2023-05-24T08:55:03Z

make it centralized to the meta service but not persisted

Make sense.
Although the mapping rebuilt from scratch may not match exactly with the one before meta fails, the difference should be small.

liurenjie1024 · 2023-05-25T07:07:22Z

I prefer the centralized approach in meta for several reasons:

Easier to implement and debug. Please note that in production when something unexpected(high latency) happens, we need to have easy way to inspect the reason and explain it.
Adding/removing parallel unit is not the only reason for rebalancing. There are at least two cases we may need to do rebalance:

The request to vnode/partition maybe unbalanced. For example when a worker node hosts many vnode which are host, we may need to move some partition to other worker nodes.
The data size of vnode/partition maybe unbalanced. For example when a worker node may have several large vnodes, we may need to move some partition to other worker nodes.

As with "Unlike the streaming vnode mapping, which must be followed strictly, serving vnode mapping is just a hint for better data locality. Because we can actually access checkpoint data from any compute nodes." It can guarantee correctness, but it can't guarantee stable performance in serving user requests. Please note that our serving performance is designed for online dashboard, so predictable latency is crucial for our customers.

If we go the meta node way, whenever a rebalance is required, we need do it for all mapping for all fragments, and persist it. Persisting streaming vnode mapping has already caused #7728.

For this problem, I think there are two approaches to solve it:

As @BugenZhao said, we may don't need to save the mapping, but waiting for worker nodes to report it.
I think the problem you mentioned is implementation rather design problem, we can split request to smaller chunks to solve it.

zwang28 · 2023-05-25T07:26:05Z

Replace with #10004

refactor(batch): rebalance serving vnode mapping

9de7f30

github-actions bot added the type/refactor label May 10, 2023

zwang28 mentioned this pull request May 10, 2023

Tracking: Dedicated Batch Serving Cluster #8940

Closed

4 tasks

zwang28 changed the title ~~refactor(batch): maintain serving vnode mapping in meta node~~ refactor(batch): fine-grained serving vnode mapping rebalance May 17, 2023

feat(batch): mask failed worker during scheduling

eb4a53d

zwang28 force-pushed the wangzheng/serving_worker_replica branch from 8c5f20c to eb4a53d Compare May 17, 2023 06:18

add simulation test

c4e2c72

github-actions bot reviewed May 17, 2023

View reviewed changes

src/tests/simulation/tests/it/batch/mod.rs Outdated Show resolved Hide resolved

zwang28 added 2 commits May 18, 2023 10:37

add header

5d21561

add tests

b20a42a

zwang28 force-pushed the wangzheng/serving_worker_replica branch from 3f5edf8 to b20a42a Compare May 18, 2023 06:22

zwang28 added 2 commits May 19, 2023 17:30

refactor mapping strategy

83fae21

Merge branch 'main' into wangzheng/draw_straw

4a5ca23

# Conflicts: # Cargo.lock

zwang28 marked this pull request as ready for review May 19, 2023 10:02

fix e2e test

f3823ae

zwang28 marked this pull request as draft May 22, 2023 04:24

BugenZhao self-requested a review May 24, 2023 08:08

zwang28 closed this May 25, 2023

zwang28 deleted the wangzheng/serving_worker_replica branch June 19, 2023 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(batch): fine-grained serving vnode mapping rebalance #9718

refactor(batch): fine-grained serving vnode mapping rebalance #9718

zwang28 commented May 10, 2023 •

edited

Loading

github-actions bot left a comment

zwang28 commented May 19, 2023 •

edited

Loading

codecov bot commented May 19, 2023 •

edited

Loading

zwang28 commented May 22, 2023

zwang28 commented May 22, 2023 •

edited

Loading

BugenZhao commented May 24, 2023

zwang28 commented May 24, 2023

liurenjie1024 commented May 25, 2023

zwang28 commented May 25, 2023

refactor(batch): fine-grained serving vnode mapping rebalance #9718

refactor(batch): fine-grained serving vnode mapping rebalance #9718

Conversation

zwang28 commented May 10, 2023 • edited Loading

What's changed and what's your intention?

Checklist For Contributors

Checklist For Reviewers

Documentation

Types of user-facing changes

Release note

github-actions bot left a comment

Choose a reason for hiding this comment

zwang28 commented May 19, 2023 • edited Loading

codecov bot commented May 19, 2023 • edited Loading

Codecov Report

zwang28 commented May 22, 2023

zwang28 commented May 22, 2023 • edited Loading

BugenZhao commented May 24, 2023

zwang28 commented May 24, 2023

liurenjie1024 commented May 25, 2023

zwang28 commented May 25, 2023

zwang28 commented May 10, 2023 •

edited

Loading

zwang28 commented May 19, 2023 •

edited

Loading

codecov bot commented May 19, 2023 •

edited

Loading

zwang28 commented May 22, 2023 •

edited

Loading