-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add cooldown protection plugin #2149
add cooldown protection plugin #2149
Conversation
Welcome @flyhighzy! |
127c44a
to
170234b
Compare
/assign @shinytang6 |
add volcano.sh/api pr #68, please have a review |
Please update the vendor locally. |
thanks for the advice and please have a review :) |
dad0c8e
to
0f78a7b
Compare
0f78a7b
to
44ad4b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm, can we have a simple doc to claim how to use this plugin?
btw, it seems that this pr contains some api changes that causing ci failed )
pls help resolve the conflicts. |
updated |
#### Edit yaml of vcjob | ||
|
||
1. add annotations in volcano job in format below. | ||
1. `preemptable` annotation(or label) indicates that job or task is preemptable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, annotation is better for there exists preemption annotaion volcano.sh/preemptable: "true"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! It's generally LGTM for me now.
/lgtm
- name: priority | ||
- name: gang | ||
- name: conformance | ||
- name: stablepreempt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly suggest change another plugin name instead of stablepreempt
.
It seems current preempt is not stable without this plugin:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "preempt-protected"?
or any other suggestions for the new plugin name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's name it cdt
plugin which means cool down time or cdp
which means cool down period. It can be used not only in preemption but also in reclaim as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@william-wang @flyhighzy I don't recommend using this name. There may be multiple ways to cool down victim tasks, not only by time interval. The name limits the functionality of the plugin and makes the functionality of this plugin slightly thin. How about vcp
which means victim cooldown protection or ecp
which means evicted cooldown protection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jason-Liu-Dream I noticed your issue, but maybe you need to put issue in the right repo :). I think it sounds reasonable, but may need to discuss more details.
So what's your opinions? @william-wang @Thor-wl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jason-Liu-Dream I noticed your issue, but maybe you need to put issue in the right repo :). I think it sounds reasonable, but may need to discuss more details.
this issue is put here because our inner version implements a demo of this function, we want to combine your design, and retry to design our implements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jason-Liu-Dream Would you clariy some other vailid cool down ways besides the time/period?
@william-wang We think cool down reason maybe need to consider time periods, eviction times, pod height load pressure, block messages, message events and so on and different business scenarios have different requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jason-Liu-Dream Thanks for the explaining the different cases for the cool down operation. If possible you can elaborate on pod height load pressure, block messages, message events.
@flyhighzy I think we can refine the name to clp
which means cool down protection. It can cover most of use cases so far :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I think so :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jason-Liu-Dream Thanks for the explaining the different cases for the cool down operation. If possible you can elaborate on pod height load pressure, block messages, message events.
@william-wang
pod height load pressure: Sometimes we need to preempt some pods, but we don't want to preempt it when this pod is at height load pressure, height load pressure means a large scope of influence after stopping service in this pod. And we don't know how long this will last. so cooldown time does not work.
block messages: similar to above, sometimes message block in queue means there is a lot of stress to deal with in this service. Preempting is not a good idea.
message events: In batch tasks, we can record progress. Assuming that a task has N stages, and each stage supports breakpoints to continue working, then the best plan is to complete the goal of the previous stage, send a message event, and then let this task be preempted so that not cause excessive waste of resources. This is very common in pipeline jobs such as video transcoding.
|
||
```yaml | ||
volcano.sh/preemptable: "true" | ||
volcano.sh/preempt-stable-time: "600s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you described in the background, i think volcano.sh/cooldown-time
is more strightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job label changes already discussed in pr#68, so we still need to change this label name again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyhighzy No worry, just change it and let's do things right the first time :)
// PluginName is name of plugin | ||
PluginName = "stablepreempt" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest add a comment here to describe this new plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally OK about the modifications. Please rebase all the commits into one and push again.
5b713ea
to
baaf855
Compare
I've finished rebase all the commits into one and rename plugin name to "cdp", pls take a review again~ |
Signed-off-by: Zhao Ying <[email protected]> update preempt_stable_time label value to time duration Signed-off-by: Zhao Ying <[email protected]> update local vendor and api dependency Signed-off-by: Zhao Ying <[email protected]> re-generate configs and add user guid Signed-off-by: Zhao Ying <[email protected]> rename stablepreempt plugin to cdt Signed-off-by: Zhao Ying <[email protected]>
baaf855
to
0d0c061
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
resolve issue #2075
Add cooldown time support for preempt action:
now - preempt-stable-time
will be not in the result victims list