scheduler: new slow store detecting and leader evicting #5808

innerr · 2022-12-28T15:12:14Z

What problem does this PR solve?

This PR is part of tikv/tikv#14000, the details would be in that PR to avoid redundancy.

Issue Number: ref #5916

What is changed and how does it work?

Same as above, this PR is part of tikv/tikv#14000, the details would be in that PR to avoid redundancy.

schedulers: new scheduler `evict-slow-trend-scheduler`, for new slow store detecting and leader evicting

Check List

Integration test
Manual test (add detailed scripts or steps below)
- Same as above, this PR is part of raftstore: new slow store detecting tikv#14000, the details would be in that PR to avoid redundancy.

Code changes

Has persistent data change

Side effects

None

Related changes

None (except the main PR at TiKV-repo and the related PR at KVProto-repo)

Release note

schedulers: add a new scheduler `evict-slow-trend-scheduler` that can automatically evict leaders from a slow store. It's a improved version of  `evict-slow-store-scheduler`.

ti-chi-bot · 2022-12-28T15:12:16Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

lhy1024
nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

server/schedulers/evict_slow_trend.go

innerr · 2022-12-29T14:02:41Z

@LykxSassinator All addressed, PTAL

LykxSassinator

Rest LGTM

server/schedulers/evict_slow_trend.go

server/statistics/store_collection.go

server/schedulers/evict_slow_trend.go

Signed-off-by: Liu Cong <[email protected]>

lhy1024 · 2023-02-05T15:03:48Z

pkg/core/basic_cluster.go

@@ -265,6 +280,8 @@ type StoreSetController interface {

 	SlowStoreEvicted(id uint64) error


when will we mark old evict-slow-store-scheduler related functions as deprecated

When it's proven no longer in use.

pkg/core/store.go

server/api/store.go

server/schedulers/evict_slow_trend.go

lhy1024 · 2023-02-05T15:17:15Z

server/schedulers/evict_slow_trend.go

+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package schedulers


Do we need to add some tests, similar to TestEvictSlowStoreTestSuite?

It's a tough issue. We don't get enough tests.

For UT, tests could only make sure the routine procedure codes is working or not. (which also important and I got that missed)

But the real deal is the varity of inputs. Some might have chances to cause a false-alarm, and some should cause an event but might be wrongly tolerated.

The tests seem to be endless, I try to run as much as I can in the manual way. But even we could do it automatically, we still can't aford the needed resources and time.

This is an issue I haven't figue out how to solve.

Added tests

server/schedulers/evict_slow_trend.go

lhy1024 · 2023-02-05T15:43:28Z

server/schedulers/evict_slow_trend.go

+	}
+
+	affectedStoreThreshold := (len(stores)+1)/3 + 1
+	if affectedStoreCount < affectedStoreThreshold {


why judge it?

If an event just affect minor amount of the cluster, we could tolerate that.
The better way to judge this event is toleratable is to check the cluster's QPS, for that we need to collect info from all TiDB instances so it's far complicated(for example, if one TiDB's QPS drop, but other TiDB's QPS are normal).

TiDB QPS？TiKV QPS？

@lhy1024 here we check the TiKVs' QPS (gRPC QPS), but the better way is to check the TiDBs' QPS (too complicated for now)

lhy1024 · 2023-02-05T15:45:53Z

server/schedulers/evict_slow_trend.go

+	}
+
+	slowStore := cluster.GetStore(slowStoreID)
+	if !candFreshCaptured && checkStoreFasterThanOthers(cluster, slowStore) {


can we replace chooseEvictCandidate and checkStoreFasterThanOthers with sorting slow score of store?

Sorting (and picking the slowest one) is not good enough, we also need to make sure the slowest store is much way slower than the others (by rate mainly, so workload-tilting could be handled).

For that, SPOT is a good choose for a big cluster having many TiKVs, but not good for small-middle size clusters.
For small-middle size clusters, calculating and comparing the std-ev and setup a threshold (as in demo stage of this PR) will be not bad.
But for safty, not just both cause rate and result rate need to be checked, the real values of result(that means the TiKV's current actual gRPC-QPS) also need to be checked.

What we put in here now is a simplified judgement, it works well in tests, should be improved someday later.

lhy1024 · 2023-02-05T15:46:41Z

server/schedulers/evict_slow_trend.go

+	if len(stores) <= 1 {
+		return false
+	}
+	expected := (len(stores) + 1) / 2


why judge it?

Same reason as the above one (line 247 about chooseEvictCandidate and checkStoreFasterThanOthers).

Signed-off-by: Liu Cong <[email protected]>

codecov · 2023-02-06T17:14:04Z

Codecov Report

Base: 75.47% // Head: 74.91% // Decreases project coverage by -0.57% ⚠️

Coverage data is based on head (1a6e948) compared to base (e14a985).
Patch coverage: 7.01% of modified lines in pull request are covered.

❗ Current head 1a6e948 differs from pull request most recent head 57e5773. Consider uploading reports for the commit 57e5773 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5808      +/-   ##
==========================================
- Coverage   75.47%   74.91%   -0.57%     
==========================================
  Files         346      347       +1     
  Lines       35184    35442     +258     
==========================================
- Hits        26555    26550       -5     
- Misses       6335     6592     +257     
- Partials     2294     2300       +6

Flag	Coverage Δ
unittests	`74.91% <7.01%> (-0.57%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/core/basic_cluster.go	`87.50% <0.00%> (-5.84%)`	⬇️
pkg/core/store_option.go	`95.40% <0.00%> (-4.60%)`	⬇️
server/api/scheduler.go	`40.48% <0.00%> (-0.81%)`	⬇️
server/cluster/cluster.go	`82.22% <0.00%> (+0.37%)`	⬆️
server/handler.go	`52.78% <0.00%> (-0.11%)`	⬇️
server/schedule/filter/counter.go	`87.50% <ø> (ø)`
server/schedulers/evict_slow_trend.go	`0.00% <0.00%> (ø)`
server/statistics/store_collection.go	`90.96% <16.66%> (-3.42%)`	⬇️
pkg/core/store.go	`77.85% <27.77%> (-3.52%)`	⬇️
server/schedule/filter/filters.go	`87.29% <50.00%> (-0.80%)`	⬇️
... and 45 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Liu Cong <[email protected]>

server/api/store.go

lhy1024 · 2023-02-08T10:52:13Z

server/schedule/filter/counter_test.go

@@ -27,7 +27,7 @@ func TestString(t *testing.T) {
 		expected   string
 	}{
 		{int(storeStateTombstone), "store-state-tombstone-filter"},
-		{int(filtersLen - 1), "store-state-reject-leader-filter"},
+		{int(filtersLen - 1), "store-state-slow-trend-filter"},


why modify it rather than add a new one?

This is the last one in the const filter list, store-state-slow-trend-filter is append to the list, that changed the last one item from reject... to slow...

lhy1024 · 2023-02-08T10:54:16Z

server/schedulers/evict_slow_trend.go

+	EvictSlowTrendType = "evict-slow-trend"
+)
+
+func init() {


need to move it to init.go after #5934

LykxSassinator · 2023-02-08T11:11:12Z

server/config/config.go

@@ -780,6 +780,10 @@ type ScheduleConfig struct {

 	// EnableWitness is the option to enable using witness
 	EnableWitness bool `toml:"enable-witness" json:"enable-witness,string"`
+
+	// SlowStoreEvictingAffectedStoreRatioThreshold is the affected ratio threshold when judging a store is slow
+	// A store's slowness must affected more than `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` to trigger evicting.


Suggested change

// A store's slowness must affected more than `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` to trigger evicting.

// A store's slowness must exceed `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` stores in the cluster to trigger evicting.

Reversed, I think the original one maybe more accurate, for the reason as the below one comment.

LykxSassinator · 2023-02-08T11:12:42Z

server/config/config.go

@@ -832,6 +836,8 @@ const (
 	defaultHotRegionsReservedDays      = 7
 	// It means we skip the preparing stage after the 48 hours no matter if the store has finished preparing stage.
 	defaultMaxStorePreparingTime = 48 * time.Hour
+	// When a slow store affected more than 30% of total stores, it will trigger evicting.


Suggested change

// When a slow store affected more than 30% of total stores, it will trigger evicting.

// When a store's slowness exceeds 30% of total stores, it will trigger evicting.

Use affected would be more accurate because:
If a store become slow (the latency still faster than 30% stores), but the become slow event affected more than 30% (30%+ stores' QPS dropped), that we also count the store as slow

LykxSassinator

Several minor comments. Rest LGTM

Signed-off-by: Liu Cong <[email protected]>

lhy1024 · 2023-02-08T15:10:59Z

ci still failed

Signed-off-by: Liu Cong <[email protected]>

innerr · 2023-02-08T16:47:59Z

/merge

ti-chi-bot · 2023-02-08T16:48:01Z

@innerr: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2023-02-08T16:48:02Z

@innerr: /merge is only allowed for the committers, you can assign this pull request to the committer in list by filling /assign @committer in the comment to help merge this pull request.

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

lhy1024 · 2023-02-08T16:49:21Z

/merge

ti-chi-bot · 2023-02-08T16:49:22Z

@lhy1024: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2023-02-08T16:49:23Z

@lhy1024: /merge in this pull request requires 2 approval(s).

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

nolouch · 2023-02-09T01:36:39Z

/merge

ti-chi-bot · 2023-02-09T01:36:40Z

@nolouch: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2023-02-09T01:36:42Z

This pull request has been accepted and is ready to merge.

Commit hash: 57e5773

ti-chi-bot added do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 28, 2022

ti-chi-bot requested review from HunDunDM and nolouch December 28, 2022 15:12

innerr mentioned this pull request Dec 28, 2022

raftstore: new slow store detecting tikv/tikv#14000

Merged

LykxSassinator reviewed Dec 29, 2022

View reviewed changes

LykxSassinator reviewed Dec 30, 2022

View reviewed changes

ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2023

LykxSassinator reviewed Jan 17, 2023

View reviewed changes

server/schedulers/evict_slow_trend.go Show resolved Hide resolved

server/schedulers/evict_slow_trend.go Outdated Show resolved Hide resolved

scheduler: new slow store detecting and leader evicting

4e378bf

Signed-off-by: Liu Cong <[email protected]>

innerr force-pushed the store-slow-trend branch from a4935ef to 4e378bf Compare February 3, 2023 19:26

Merge branch 'master' of github.com:tikv/pd into store-slow-trend

080027a

lhy1024 reviewed Feb 5, 2023

View reviewed changes

innerr added 4 commits February 6, 2023 20:13

add comments for exported functions and vars

b7ec2d4

Signed-off-by: Liu Cong <[email protected]>

Merge branch 'master' of github.com:tikv/pd into store-slow-trend

a5dc101

EvictedAsSlowTrend => IsEvictedAsSlowTrend

847bb95

Signed-off-by: Liu Cong <[email protected]>

copy right 2021 => 2023

4c6c08c

Signed-off-by: Liu Cong <[email protected]>

ti-chi-bot removed the do-not-merge/needs-linked-issue label Feb 6, 2023

innerr added 3 commits February 6, 2023 21:36

fix typo

8046a5a

Signed-off-by: Liu Cong <[email protected]>

Merge branch 'master' of github.com:tikv/pd into store-slow-trend

66c6c48

fix tests

5054b39

Signed-off-by: Liu Cong <[email protected]>

innerr force-pushed the store-slow-trend branch from 7cb7d69 to 5054b39 Compare February 6, 2023 16:46

fix tests

00e0a44

Signed-off-by: Liu Cong <[email protected]>

Merge branch 'master' of github.com:tikv/pd into store-slow-trend

1a6e948

innerr added 3 commits February 8, 2023 17:14

Merge branch 'master' of github.com:tikv/pd into store-slow-trend

fde8da1

fix build

9f3884a

Signed-off-by: Liu Cong <[email protected]>

fix build

bf7fe55

Signed-off-by: Liu Cong <[email protected]>

lhy1024 reviewed Feb 8, 2023

View reviewed changes

LykxSassinator reviewed Feb 8, 2023

View reviewed changes

innerr added 3 commits February 8, 2023 20:40

more comments

28f021c

Signed-off-by: Liu Cong <[email protected]>

relocate scheduler registering

f1b426e

Signed-off-by: Liu Cong <[email protected]>

Merge branch 'master' of github.com:tikv/pd into store-slow-trend

64bef69

lhy1024 approved these changes Feb 8, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Feb 8, 2023

fixing tests

3fd765c

Signed-off-by: Liu Cong <[email protected]>

innerr force-pushed the store-slow-trend branch from 18a82be to 3fd765c Compare February 8, 2023 16:33

fixing tests

57e5773

Signed-off-by: Liu Cong <[email protected]>

nolouch approved these changes Feb 9, 2023

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Feb 9, 2023

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Feb 9, 2023

ti-chi-bot merged commit 756e32c into tikv:master Feb 9, 2023

		@@ -265,6 +280,8 @@ type StoreSetController interface {

		SlowStoreEvicted(id uint64) error

	// A store's slowness must affected more than `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` to trigger evicting.
	// A store's slowness must exceed `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` stores in the cluster to trigger evicting.

	// When a slow store affected more than 30% of total stores, it will trigger evicting.
	// When a store's slowness exceeds 30% of total stores, it will trigger evicting.

scheduler: new slow store detecting and leader evicting #5808

scheduler: new slow store detecting and leader evicting #5808

Conversation

innerr commented Dec 28, 2022 • edited Loading

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

ti-chi-bot commented Dec 28, 2022 • edited Loading

innerr commented Dec 29, 2022

LykxSassinator left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

innerr Feb 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

innerr Feb 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

innerr Feb 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

innerr Feb 5, 2023 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 6, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

innerr Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

LykxSassinator left a comment

Choose a reason for hiding this comment

lhy1024 commented Feb 8, 2023

innerr commented Feb 8, 2023

ti-chi-bot commented Feb 8, 2023

ti-chi-bot commented Feb 8, 2023

lhy1024 commented Feb 8, 2023

ti-chi-bot commented Feb 8, 2023

ti-chi-bot commented Feb 8, 2023

nolouch commented Feb 9, 2023

ti-chi-bot commented Feb 9, 2023

ti-chi-bot commented Feb 9, 2023

innerr commented Dec 28, 2022 •

edited

Loading

ti-chi-bot commented Dec 28, 2022 •

edited

Loading

innerr Feb 5, 2023 •

edited

Loading

innerr Feb 6, 2023 •

edited

Loading

innerr Feb 5, 2023 •

edited

Loading

innerr Feb 5, 2023 •

edited

Loading

codecov bot commented Feb 6, 2023 •

edited

Loading

innerr Feb 8, 2023 •

edited

Loading