Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cooldown protection plugin #2149

Merged
merged 1 commit into from
Jul 15, 2022

Conversation

flyhighzy
Copy link

resolve issue #2075

Add cooldown time support for preempt action:

  1. provide a new label/annotation named "volcano.sh/preempt-stable-time", whose value means the cool down time for preempt with unit second. This label/annotation can be set for entire vcjob or some dedicated tasks, if set to job, we'll transfer to all tasks' pods.
  2. add a plugin to participate in preempt action, ensure pods whose scheduled time after now - preempt-stable-time will be not in the result victims list

@volcano-sh-bot
Copy link
Contributor

Welcome @flyhighzy!

It looks like this is your first PR to volcano-sh/volcano 馃帀.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 6, 2022
@flyhighzy flyhighzy force-pushed the preempt-stable-time branch from 127c44a to 170234b Compare April 6, 2022 12:18
@flyhighzy
Copy link
Author

/assign @shinytang6

@flyhighzy
Copy link
Author

add volcano.sh/api pr #68, please have a review

@Thor-wl
Copy link
Contributor

Thor-wl commented Apr 15, 2022

Please update the vendor locally.

@flyhighzy
Copy link
Author

flyhighzy commented Apr 25, 2022

thanks for the advice and please have a review :)
since this pr depends on pr of. volcano.sh/api, please also review api's pull request as well
@hudson741 @hzxuzhonghu @wpeng102 @shinytang6

@Thor-wl Thor-wl requested review from william-wang, shinytang6 and k82cn and removed request for hudson741 and hzxuzhonghu April 25, 2022 08:18
@flyhighzy flyhighzy force-pushed the preempt-stable-time branch from dad0c8e to 0f78a7b Compare April 25, 2022 09:28
@volcano-sh-bot volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 25, 2022
@flyhighzy flyhighzy force-pushed the preempt-stable-time branch from 0f78a7b to 44ad4b3 Compare April 25, 2022 09:35
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 25, 2022
Copy link
Member

@shinytang6 shinytang6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, can we have a simple doc to claim how to use this plugin?
btw, it seems that this pr contains some api changes that causing ci failed )

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 26, 2022
@shinytang6
Copy link
Member

pls help resolve the conflicts.

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 28, 2022
@flyhighzy
Copy link
Author

pls help resolve the conflicts.

updated

@Thor-wl Thor-wl self-requested a review July 8, 2022 09:12
#### Edit yaml of vcjob

1. add annotations in volcano job in format below.
1. `preemptable` annotation(or label) indicates that job or task is preemptable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, annotation is better for there exists preemption annotaion volcano.sh/preemptable: "true".

Copy link
Contributor

@Thor-wl Thor-wl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! It's generally LGTM for me now.
/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 8, 2022
- name: priority
- name: gang
- name: conformance
- name: stablepreempt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest change another plugin name instead of stablepreempt.
It seems current preempt is not stable without this plugin:)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "preempt-protected"?
or any other suggestions for the new plugin name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name it cdt plugin which means cool down time or cdp which means cool down period. It can be used not only in preemption but also in reclaim as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@william-wang @flyhighzy I don't recommend using this name. There may be multiple ways to cool down victim tasks, not only by time interval. The name limits the functionality of the plugin and makes the functionality of this plugin slightly thin. How about vcp which means victim cooldown protection or ecp which means evicted cooldown protection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jason-Liu-Dream I noticed your issue, but maybe you need to put issue in the right repo :). I think it sounds reasonable, but may need to discuss more details.

So what's your opinions? @william-wang @Thor-wl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jason-Liu-Dream I noticed your issue, but maybe you need to put issue in the right repo :). I think it sounds reasonable, but may need to discuss more details.

this issue is put here because our inner version implements a demo of this function, we want to combine your design, and retry to design our implements.

Copy link

@Jason-Liu-Dream Jason-Liu-Dream Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jason-Liu-Dream Would you clariy some other vailid cool down ways besides the time/period?

@william-wang We think cool down reason maybe need to consider time periods, eviction times, pod height load pressure, block messages, message events and so on and different business scenarios have different requirements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jason-Liu-Dream Thanks for the explaining the different cases for the cool down operation. If possible you can elaborate on pod height load pressure, block messages, message events.

@flyhighzy I think we can refine the name to clp which means cool down protection. It can cover most of use cases so far :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I think so :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jason-Liu-Dream Thanks for the explaining the different cases for the cool down operation. If possible you can elaborate on pod height load pressure, block messages, message events.
@william-wang
pod height load pressure: Sometimes we need to preempt some pods, but we don't want to preempt it when this pod is at height load pressure, height load pressure means a large scope of influence after stopping service in this pod. And we don't know how long this will last. so cooldown time does not work.
block messages: similar to above, sometimes message block in queue means there is a lot of stress to deal with in this service. Preempting is not a good idea.
message events: In batch tasks, we can record progress. Assuming that a task has N stages, and each stage supports breakpoints to continue working, then the best plan is to complete the goal of the previous stage, send a message event, and then let this task be preempted so that not cause excessive waste of resources. This is very common in pipeline jobs such as video transcoding.


```yaml
volcano.sh/preemptable: "true"
volcano.sh/preempt-stable-time: "600s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you described in the background, i think volcano.sh/cooldown-time is more strightforward.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job label changes already discussed in pr#68, so we still need to change this label name again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyhighzy No worry, just change it and let's do things right the first time :)

// PluginName is name of plugin
PluginName = "stablepreempt"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest add a comment here to describe this new plugin.

@volcano-sh-bot volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 12, 2022
Copy link
Contributor

@Thor-wl Thor-wl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally OK about the modifications. Please rebase all the commits into one and push again.

@flyhighzy flyhighzy force-pushed the preempt-stable-time branch 2 times, most recently from 5b713ea to baaf855 Compare July 14, 2022 02:44
@flyhighzy
Copy link
Author

I've finished rebase all the commits into one and rename plugin name to "cdp", pls take a review again~
@Thor-wl @william-wang

Signed-off-by: Zhao Ying <[email protected]>

update preempt_stable_time label value to time duration

Signed-off-by: Zhao Ying <[email protected]>

update local vendor and api dependency

Signed-off-by: Zhao Ying <[email protected]>

re-generate configs and add user guid

Signed-off-by: Zhao Ying <[email protected]>

rename stablepreempt plugin to cdt

Signed-off-by: Zhao Ying <[email protected]>
@flyhighzy flyhighzy force-pushed the preempt-stable-time branch from baaf855 to 0d0c061 Compare July 14, 2022 02:55
@william-wang william-wang changed the title add stable preempt plugin add cooldown protection plugin Jul 15, 2022
Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.
/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2022
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 15, 2022
@volcano-sh-bot volcano-sh-bot merged commit 9def57e into volcano-sh:master Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants