-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: new slow store detecting and leader evicting #5808
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
@LykxSassinator All addressed, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Signed-off-by: Liu Cong <[email protected]>
a4935ef
to
4e378bf
Compare
@@ -265,6 +280,8 @@ type StoreSetController interface { | |||
|
|||
SlowStoreEvicted(id uint64) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will we mark old evict-slow-store-scheduler related functions as deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it's proven no longer in use.
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
package schedulers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add some tests, similar to TestEvictSlowStoreTestSuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tough issue. We don't get enough tests.
For UT, tests could only make sure the routine procedure codes is working or not. (which also important and I got that missed)
But the real deal is the varity of inputs. Some might have chances to cause a false-alarm, and some should cause an event but might be wrongly tolerated.
The tests seem to be endless, I try to run as much as I can in the manual way. But even we could do it automatically, we still can't aford the needed resources and time.
This is an issue I haven't figue out how to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests
} | ||
|
||
affectedStoreThreshold := (len(stores)+1)/3 + 1 | ||
if affectedStoreCount < affectedStoreThreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why judge it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an event just affect minor amount of the cluster, we could tolerate that.
The better way to judge this event is toleratable is to check the cluster's QPS, for that we need to collect info from all TiDB instances so it's far complicated(for example, if one TiDB's QPS drop, but other TiDB's QPS are normal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TiDB QPS?TiKV QPS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhy1024 here we check the TiKVs' QPS (gRPC QPS), but the better way is to check the TiDBs' QPS (too complicated for now)
} | ||
|
||
slowStore := cluster.GetStore(slowStoreID) | ||
if !candFreshCaptured && checkStoreFasterThanOthers(cluster, slowStore) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we replace chooseEvictCandidate
and checkStoreFasterThanOthers
with sorting slow score of store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting (and picking the slowest one) is not good enough, we also need to make sure the slowest store is much way slower than the others (by rate mainly, so workload-tilting could be handled).
For that, SPOT is a good choose for a big cluster having many TiKVs, but not good for small-middle size clusters.
For small-middle size clusters, calculating and comparing the std-ev
and setup a threshold (as in demo stage of this PR) will be not bad.
But for safty, not just both cause rate
and result rate
need to be checked, the real values of result
(that means the TiKV's current actual gRPC-QPS) also need to be checked.
What we put in here now is a simplified judgement, it works well in tests, should be improved someday later.
if len(stores) <= 1 { | ||
return false | ||
} | ||
expected := (len(stores) + 1) / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why judge it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as the above one (line 247 about chooseEvictCandidate
and checkStoreFasterThanOthers
).
Signed-off-by: Liu Cong <[email protected]>
Signed-off-by: Liu Cong <[email protected]>
Signed-off-by: Liu Cong <[email protected]>
Signed-off-by: Liu Cong <[email protected]>
Signed-off-by: Liu Cong <[email protected]>
7cb7d69
to
5054b39
Compare
Signed-off-by: Liu Cong <[email protected]>
Codecov ReportBase: 75.47% // Head: 74.91% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #5808 +/- ##
==========================================
- Coverage 75.47% 74.91% -0.57%
==========================================
Files 346 347 +1
Lines 35184 35442 +258
==========================================
- Hits 26555 26550 -5
- Misses 6335 6592 +257
- Partials 2294 2300 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Signed-off-by: Liu Cong <[email protected]>
Signed-off-by: Liu Cong <[email protected]>
@@ -27,7 +27,7 @@ func TestString(t *testing.T) { | |||
expected string | |||
}{ | |||
{int(storeStateTombstone), "store-state-tombstone-filter"}, | |||
{int(filtersLen - 1), "store-state-reject-leader-filter"}, | |||
{int(filtersLen - 1), "store-state-slow-trend-filter"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why modify it rather than add a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last one in the const filter list, store-state-slow-trend-filter
is append to the list, that changed the last one item from reject...
to slow...
EvictSlowTrendType = "evict-slow-trend" | ||
) | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to move it to init.go after #5934
@@ -780,6 +780,10 @@ type ScheduleConfig struct { | |||
|
|||
// EnableWitness is the option to enable using witness | |||
EnableWitness bool `toml:"enable-witness" json:"enable-witness,string"` | |||
|
|||
// SlowStoreEvictingAffectedStoreRatioThreshold is the affected ratio threshold when judging a store is slow | |||
// A store's slowness must affected more than `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` to trigger evicting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// A store's slowness must affected more than `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` to trigger evicting. | |
// A store's slowness must exceed `store-count * SlowStoreEvictingAffectedStoreRatioThreshold` stores in the cluster to trigger evicting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reversed, I think the original one maybe more accurate, for the reason as the below one comment.
@@ -832,6 +836,8 @@ const ( | |||
defaultHotRegionsReservedDays = 7 | |||
// It means we skip the preparing stage after the 48 hours no matter if the store has finished preparing stage. | |||
defaultMaxStorePreparingTime = 48 * time.Hour | |||
// When a slow store affected more than 30% of total stores, it will trigger evicting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// When a slow store affected more than 30% of total stores, it will trigger evicting. | |
// When a store's slowness exceeds 30% of total stores, it will trigger evicting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use affected
would be more accurate because:
If a store become slow (the latency still faster than 30% stores), but the become slow
event affected more than 30% (30%+ stores' QPS dropped), that we also count the store as slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several minor comments. Rest LGTM
Signed-off-by: Liu Cong <[email protected]>
Signed-off-by: Liu Cong <[email protected]>
ci still failed |
Signed-off-by: Liu Cong <[email protected]>
18a82be
to
3fd765c
Compare
Signed-off-by: Liu Cong <[email protected]>
/merge |
@innerr: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
@innerr: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/merge |
@lhy1024: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
@lhy1024: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/merge |
@nolouch: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: 57e5773
|
What problem does this PR solve?
This PR is part of tikv/tikv#14000, the details would be in that PR to avoid redundancy.
Issue Number: ref #5916
What is changed and how does it work?
Same as above, this PR is part of tikv/tikv#14000, the details would be in that PR to avoid redundancy.
Check List
Code changes
Side effects
Related changes
Release note