-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restore: split & scatter regions concurrently #27034
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
/sig migrate |
/cc Little-Wallace |
/run-integration-tests |
1 similar comment
/run-integration-tests |
/component br |
/run-integration-test |
/run-integration-tests |
/run-check_dev_2 So many unit tests >= 5s... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
Co-authored-by: kennytm <[email protected]>
/merge |
This pull request has been accepted and is ready to merge. Commit hash: afe9bd3
|
@YuJuncen: Your PR was out of date, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Port of pingcap/br#1363
What problem does this PR solve?
Before, when restoring many small tables, the batcher would probably send small batch due to the so called
AutoCommit
feature of the batcher. By this, we can make the split & scatter & restore worker more active.But frequently send small batches isn't free. The split step is costly and I/O bounded for even small batches. For example, it costs about 3s to splitting 60 ranges, but restore those ranges typically costs only 1s. Then the restore worker get idle at most time. The restore hence has slowed down.
What is changed and how it works?
Instead of using a single split worker, this PR allow multi restore batches be split concurrently.
We added two hidden flags,
--batch-flush-interval
and--pd-concurrency
, the former for better tuning the behavior of batcher, the latter for tweaking the concurrent split.Also, more logs were added so the create table speed, download, ingest time cost can be observed via log.
Check List
Tests
A internal test shows, in a 190GB, 6000 tables workload, this PR can speed up the restoration: the original version takes over 2 hours for restoring, and this version takes about 30mins for restoring. The latter is nearly equal to the time cost of creating tables(see figure below).
Release note