Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: new slow store detecting and leader evicting #5808

Merged
merged 24 commits into from
Feb 9, 2023

Conversation

innerr
Copy link
Contributor

@innerr innerr commented Dec 28, 2022

What problem does this PR solve?

This PR is part of tikv/tikv#14000, the details would be in that PR to avoid redundancy.

Issue Number: ref #5916

What is changed and how does it work?

Same as above, this PR is part of tikv/tikv#14000, the details would be in that PR to avoid redundancy.

schedulers: new scheduler `evict-slow-trend-scheduler`, for new slow store detecting and leader evicting

Check List

Code changes

  • Has persistent data change

Side effects

  • None

Related changes

  • None (except the main PR at TiKV-repo and the related PR at KVProto-repo)

Release note

schedulers: add a new scheduler `evict-slow-trend-scheduler` that can automatically evict leaders from a slow store. It's a improved version of  `evict-slow-store-scheduler`.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Dec 28, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • lhy1024
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 28, 2022
server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Show resolved Hide resolved
@innerr
Copy link
Contributor Author

innerr commented Dec 29, 2022

@LykxSassinator All addressed, PTAL

Copy link
Contributor

@LykxSassinator LykxSassinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
server/statistics/store_collection.go Outdated Show resolved Hide resolved
server/statistics/store_collection.go Outdated Show resolved Hide resolved
server/statistics/store_collection.go Outdated Show resolved Hide resolved
server/statistics/store_collection.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2023
@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 2, 2023
@@ -265,6 +280,8 @@ type StoreSetController interface {

SlowStoreEvicted(id uint64) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will we mark old evict-slow-store-scheduler related functions as deprecated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it's proven no longer in use.

pkg/core/store.go Outdated Show resolved Hide resolved
server/api/store.go Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved
// See the License for the specific language governing permissions and
// limitations under the License.

package schedulers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some tests, similar to TestEvictSlowStoreTestSuite?

Copy link
Contributor Author

@innerr innerr Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tough issue. We don't get enough tests.

For UT, tests could only make sure the routine procedure codes is working or not. (which also important and I got that missed)

But the real deal is the varity of inputs. Some might have chances to cause a false-alarm, and some should cause an event but might be wrongly tolerated.

The tests seem to be endless, I try to run as much as I can in the manual way. But even we could do it automatically, we still can't aford the needed resources and time.

This is an issue I haven't figue out how to solve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests

server/schedulers/evict_slow_trend.go Show resolved Hide resolved
server/schedulers/evict_slow_trend.go Show resolved Hide resolved
}

affectedStoreThreshold := (len(stores)+1)/3 + 1
if affectedStoreCount < affectedStoreThreshold {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why judge it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an event just affect minor amount of the cluster, we could tolerate that.
The better way to judge this event is toleratable is to check the cluster's QPS, for that we need to collect info from all TiDB instances so it's far complicated(for example, if one TiDB's QPS drop, but other TiDB's QPS are normal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TiDB QPS?TiKV QPS?

Copy link
Contributor Author

@innerr innerr Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhy1024 here we check the TiKVs' QPS (gRPC QPS), but the better way is to check the TiDBs' QPS (too complicated for now)

}

slowStore := cluster.GetStore(slowStoreID)
if !candFreshCaptured && checkStoreFasterThanOthers(cluster, slowStore) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we replace chooseEvictCandidate and checkStoreFasterThanOthers with sorting slow score of store?

Copy link
Contributor Author

@innerr innerr Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting (and picking the slowest one) is not good enough, we also need to make sure the slowest store is much way slower than the others (by rate mainly, so workload-tilting could be handled).

For that, SPOT is a good choose for a big cluster having many TiKVs, but not good for small-middle size clusters.
For small-middle size clusters, calculating and comparing the std-ev and setup a threshold (as in demo stage of this PR) will be not bad.
But for safty, not just both cause rate and result rate need to be checked, the real values of result(that means the TiKV's current actual gRPC-QPS) also need to be checked.

What we put in here now is a simplified judgement, it works well in tests, should be improved someday later.

if len(stores) <= 1 {
return false
}
expected := (len(stores) + 1) / 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why judge it?

Copy link
Contributor Author

@innerr innerr Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as the above one (line 247 about chooseEvictCandidate and checkStoreFasterThanOthers).

Signed-off-by: Liu Cong <[email protected]>
@codecov
Copy link

codecov bot commented Feb 6, 2023

Codecov Report

Base: 75.47% // Head: 74.91% // Decreases project coverage by -0.57% ⚠️

Coverage data is based on head (1a6e948) compared to base (e14a985).
Patch coverage: 7.01% of modified lines in pull request are covered.

❗ Current head 1a6e948 differs from pull request most recent head 57e5773. Consider uploading reports for the commit 57e5773 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5808      +/-   ##
==========================================
- Coverage   75.47%   74.91%   -0.57%     
==========================================
  Files         346      347       +1     
  Lines       35184    35442     +258     
==========================================
- Hits        26555    26550       -5     
- Misses       6335     6592     +257     
- Partials     2294     2300       +6     
Flag Coverage Δ
unittests 74.91% <7.01%> (-0.57%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/core/basic_cluster.go 87.50% <0.00%> (-5.84%) ⬇️
pkg/core/store_option.go 95.40% <0.00%> (-4.60%) ⬇️
server/api/scheduler.go 40.48% <0.00%> (-0.81%) ⬇️
server/cluster/cluster.go 82.22% <0.00%> (+0.37%) ⬆️
server/handler.go 52.78% <0.00%> (-0.11%) ⬇️
server/schedule/filter/counter.go 87.50% <ø> (ø)
server/schedulers/evict_slow_trend.go 0.00% <0.00%> (ø)
server/statistics/store_collection.go 90.96% <16.66%> (-3.42%) ⬇️
pkg/core/store.go 77.85% <27.77%> (-3.52%) ⬇️
server/schedule/filter/filters.go 87.29% <50.00%> (-0.80%) ⬇️
... and 45 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

server/api/store.go Show resolved Hide resolved
@@ -27,7 +27,7 @@ func TestString(t *testing.T) {
expected string
}{
{int(storeStateTombstone), "store-state-tombstone-filter"},
{int(filtersLen - 1), "store-state-reject-leader-filter"},
{int(filtersLen - 1), "store-state-slow-trend-filter"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why modify it rather than add a new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last one in the const filter list, store-state-slow-trend-filter is append to the list, that changed the last one item from reject... to slow...

EvictSlowTrendType = "evict-slow-trend"
)

func init() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move it to init.go after #5934

@@ -780,6 +780,10 @@ type ScheduleConfig struct {

// EnableWitness is the option to enable using witness
EnableWitness bool `toml:"enable-witness" json:"enable-witness,string"`

// SlowStoreEvictingAffectedStoreRatioThreshold is the affected ratio threshold when judging a store is slow
// A store's slowness must affected more than `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` to trigger evicting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A store's slowness must affected more than `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` to trigger evicting.
// A store's slowness must exceed `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` stores in the cluster to trigger evicting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reversed, I think the original one maybe more accurate, for the reason as the below one comment.

@@ -832,6 +836,8 @@ const (
defaultHotRegionsReservedDays = 7
// It means we skip the preparing stage after the 48 hours no matter if the store has finished preparing stage.
defaultMaxStorePreparingTime = 48 * time.Hour
// When a slow store affected more than 30% of total stores, it will trigger evicting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// When a slow store affected more than 30% of total stores, it will trigger evicting.
// When a store's slowness exceeds 30% of total stores, it will trigger evicting.

Copy link
Contributor Author

@innerr innerr Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use affected would be more accurate because:
If a store become slow (the latency still faster than 30% stores), but the become slow event affected more than 30% (30%+ stores' QPS dropped), that we also count the store as slow

Copy link
Contributor

@LykxSassinator LykxSassinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several minor comments. Rest LGTM

@lhy1024
Copy link
Contributor

lhy1024 commented Feb 8, 2023

ci still failed

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Feb 8, 2023
Signed-off-by: Liu Cong <[email protected]>
Signed-off-by: Liu Cong <[email protected]>
@innerr
Copy link
Contributor Author

innerr commented Feb 8, 2023

/merge

@ti-chi-bot
Copy link
Member

@innerr: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

@innerr: /merge is only allowed for the committers, you can assign this pull request to the committer in list by filling /assign @committer in the comment to help merge this pull request.

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@lhy1024
Copy link
Contributor

lhy1024 commented Feb 8, 2023

/merge

@ti-chi-bot
Copy link
Member

@lhy1024: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

@lhy1024: /merge in this pull request requires 2 approval(s).

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Feb 9, 2023
@nolouch
Copy link
Contributor

nolouch commented Feb 9, 2023

/merge

@ti-chi-bot
Copy link
Member

@nolouch: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 57e5773

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Feb 9, 2023
@ti-chi-bot ti-chi-bot merged commit 756e32c into tikv:master Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants