-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Preemption Toleration Plugin #205
KEP: Preemption Toleration Plugin #205
Conversation
Welcome @everpeace! |
Hi @everpeace. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
599614e
to
63b20e7
Compare
system-critical: 10000 | ||
high: 9000 | ||
low: 8000 | ||
low-non-preempted: 8000 # with minimumPreemptablePriority=10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If comparing this case with "low-non-preempted=10000 and PreemptionPolicy=Never", is the only difference that you don't want "low-non-preempted" pods to preempt pods with priority value ranging from 8000 to 10000?
If so, is that an alternative is to add another option to PreemptionPolicy (like "only preempt pods with priority value that's less than X")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If comparing this case with "low-non-preempted=10000 and PreemptionPolicy=Never", is the only difference that you don't want "low-non-preempted" pods to preempt pods with priority value ranging from 8000 to 10000?
Yes for preemption behavior. But the scheduling behavior will be different because the priority value affects the rank in the scheduler queue.
If so, is that an alternative is to add another option to PreemptionPolicy (like "only preempt pods with priority value that's less than X")?
Focusing on preemption behavior, yes, it could be. However, we also proposed tolerationSeconds
, which will be essentially the preemptee-side configuration described as "Guarantee to running at least N minutes even in lower priority" section. Thus, I think making them be preemptee-side configuration would be simpler here. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the tolerationSeconds fits more on the preemptee side.
scheduling.sigs.k8s.io/preemption-toleration: | | ||
# the api is versioned for future enhancement | ||
apiVersion: scheduling.sigs.k8s.io/v1beta1 | ||
kind: PreemptionToleration | ||
|
||
# This priority class can tolerate preemption by priority with 8000 <= p < 10000. | ||
minimumPreemptablePriority: 10000 | ||
|
||
# And it can tolerate preemption in 1 hour by the pod with priority (8000 <= p < 10000). | ||
tolerationSeconds: 3600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to exact all the settings to a CustomResource object; otherwise, I don't think it's practical to versioning/defaulting/converting on the annotated string.
Or, you give up support multi-version toleration policies - i.e., define the settings as scheduling.sigs.k8s.io/preemption-toleration-XYZ, individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If we choose CRD case, PriorityClass
would need to refer PreemptionToleration
CR by annotation or a similar way. Then, I would like to choose the annotation case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### PostFilter | ||
|
||
This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd need to add a stanza to describe how to honor the preemption tolerationSeconds setting, e.g., implement the controller logic to monitor/manage the pods with preemption tolerationSeconds settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will document where to how to modify the default preemption plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 8484681
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the details of PostFilter is awesome. But what I meant in the original comment was the "controller" logic - like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).
(A reference is that Kubernetes uses a nodelifecycle manager to guarantee the tolerationSeconds' semantics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).
preemptionTolerationSeconds
does not require any controller logic because the duration is defined as the duration from the time being scheduled of the victim candidate pod
. So, the scheduler plugin can determine the victim candidate can tolerate the preemption or not by itself like:
// in the PreemptionToleration plugin:
// check it can tolerate the preemption in terms of toleration seconds
_, scheduledCondition := podutil.GetPodCondition(&victimCandidate.Status, v1.PodScheduled)
if scheduledCondition == nil {
return canTolerateOnPriorityValue, nil
}
canTolerateOnTolerationSeconds := !now.After(
scheduledCondition.LastTransitionTime.Time.Add(time.Duration(*toleration.TolerationSeconds)*time.Second),
)
Is it better to explain this (we no need to implement a controller) in this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in another thread #205 (comment)
// MinimumPreemptablePriority specifies the minimum priority value that can preempt this priority class. | ||
MinimumPreemptablePriority *int32 `json:"minimumPreemptablePriority,omitempty"` | ||
|
||
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope my understanding is correct:
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes). | |
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. | |
// Nil or zero value means it's preemptable immediately (default). | |
// Non-nil value means the duration from the pod being scheduled to some node. | |
// This value affects scheduled pods only (no effect on nominated pods). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to the same semantics with tolerationSeconds
in Pod:
By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.
Thus, I would like to fix it to be a more clear explanation like below. What do you think?
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes). | |
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. | |
// Nil value means forever, which is not preempted forever (default) | |
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all. | |
// This value affects scheduled pods only (no effect on nominated pods). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 316a022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei Thanks for your comment! I replied to your comments. PTAL 🙇
system-critical: 10000 | ||
high: 9000 | ||
low: 8000 | ||
low-non-preempted: 8000 # with minimumPreemptablePriority=10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If comparing this case with "low-non-preempted=10000 and PreemptionPolicy=Never", is the only difference that you don't want "low-non-preempted" pods to preempt pods with priority value ranging from 8000 to 10000?
Yes for preemption behavior. But the scheduling behavior will be different because the priority value affects the rank in the scheduler queue.
If so, is that an alternative is to add another option to PreemptionPolicy (like "only preempt pods with priority value that's less than X")?
Focusing on preemption behavior, yes, it could be. However, we also proposed tolerationSeconds
, which will be essentially the preemptee-side configuration described as "Guarantee to running at least N minutes even in lower priority" section. Thus, I think making them be preemptee-side configuration would be simpler here. What do you think?
scheduling.sigs.k8s.io/preemption-toleration: | | ||
# the api is versioned for future enhancement | ||
apiVersion: scheduling.sigs.k8s.io/v1beta1 | ||
kind: PreemptionToleration | ||
|
||
# This priority class can tolerate preemption by priority with 8000 <= p < 10000. | ||
minimumPreemptablePriority: 10000 | ||
|
||
# And it can tolerate preemption in 1 hour by the pod with priority (8000 <= p < 10000). | ||
tolerationSeconds: 3600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If we choose CRD case, PriorityClass
would need to refer PreemptionToleration
CR by annotation or a similar way. Then, I would like to choose the annotation case.
#### PostFilter | ||
|
||
This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will document where to how to modify the default preemption plugin.
// MinimumPreemptablePriority specifies the minimum priority value that can preempt this priority class. | ||
MinimumPreemptablePriority *int32 `json:"minimumPreemptablePriority,omitempty"` | ||
|
||
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to the same semantics with tolerationSeconds
in Pod:
By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.
Thus, I would like to fix it to be a more clear explanation like below. What do you think?
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes). | |
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. | |
// Nil value means forever, which is not preempted forever (default) | |
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all. | |
// This value affects scheduled pods only (no effect on nominated pods). |
b162e8d
to
316a022
Compare
|
||
### Lower priority value but not-being-preempted priority | ||
|
||
Generally speaking, making a job be resumable after preemption, the job would need to preserve its state to outside of the pod, e.g. persistent volume, object storage, etc., at preemption and restore it at next restart (a.k.a check-pointing). However, when a job uses some proprietary tools that are not modifiable or do not support suspend/resume feature, users would like to avoid being preempted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to outside of the pod
-> externally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in b8c0b87
system-critical: 10000 | ||
high: 9000 | ||
low: 8000 | ||
low-non-preempted: 8000 # with minimumPreemptablePriority=10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the tolerationSeconds fits more on the preemptee side.
name: toleration-policy-sample | ||
annotation: | ||
# this key is needed to enable preemption toleration policy for distinguishing | ||
# between no toleration policy and empty toleration policy (all fields will be default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: replace tab with empty characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 8be0fa2
metadata: | ||
name: toleration-policy-sample | ||
annotation: | ||
# this key is needed to enable preemption toleration policy for distinguishing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... this is another shortage of annotations (vs. CRD). It's ok to start with annotations as an alpha feature, but once we graduate it to beta, do you have any plan to turn to CRD? - which may better manage the specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to start with annotations as an alpha feature, but once we graduate it to beta, do you have any plan to turn to CRD? - which may better manage the specs.
I would like to start with annotations because the policy spec is simple enough, and no controller logic is required. When graduating it to beta, we plan to transform it to CRD.
|
||
```go | ||
type PreemptionToleration struct { | ||
// MinimumPreemptablePriority specifies the minimum priority value that can preempt this priority class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe the default value/behavior if it's not set. (like, if not set, pods that have a higher priority value can preempt it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 8d7f4da
#### PostFilter | ||
|
||
This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the details of PostFilter is awesome. But what I meant in the original comment was the "controller" logic - like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).
(A reference is that Kubernetes uses a nodelifecycle manager to guarantee the tolerationSeconds' semantics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei Thanks for the review. I pushed some fixes and replied to your comments. PTAL 🙇
metadata: | ||
name: toleration-policy-sample | ||
annotation: | ||
# this key is needed to enable preemption toleration policy for distinguishing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to start with annotations as an alpha feature, but once we graduate it to beta, do you have any plan to turn to CRD? - which may better manage the specs.
I would like to start with annotations because the policy spec is simple enough, and no controller logic is required. When graduating it to beta, we plan to transform it to CRD.
#### PostFilter | ||
|
||
This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).
preemptionTolerationSeconds
does not require any controller logic because the duration is defined as the duration from the time being scheduled of the victim candidate pod
. So, the scheduler plugin can determine the victim candidate can tolerate the preemption or not by itself like:
// in the PreemptionToleration plugin:
// check it can tolerate the preemption in terms of toleration seconds
_, scheduledCondition := podutil.GetPodCondition(&victimCandidate.Status, v1.PodScheduled)
if scheduledCondition == nil {
return canTolerateOnPriorityValue, nil
}
canTolerateOnTolerationSeconds := !now.After(
scheduledCondition.LastTransitionTime.Time.Add(time.Duration(*toleration.TolerationSeconds)*time.Second),
)
Is it better to explain this (we no need to implement a controller) in this section?
Re #205 (comment): (Github doesn't allow me to comment there) Oh, I misunderstood it. So it means upon a preemption attempt, the Pods that can be tolerated won't be preempted, also they won't be deleted if the toleration seconds have passed - which is different with Pod's tolerationSeconds where the Pod will be deleted forcibly when tolerationSeconds is used up. In your case, it's more a concept of "guaranteed minimum duration to run". These Pods can be only preempted when the time passed the minimum duration. Is my understanding correct? |
Yes, that's right!! |
Once I got approved, I would like to squash the commits so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final comments on defaulting. LGTM otherwise.
// Nil value means forever, which is not preempted forever (default) | ||
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should distinguish nil from zero. How about:
// It defaults to zero if not set. Zero value means the pod will be preempted immediately. i.e., no toleration at all.
And negative value will be caught during validation phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated #205 (comment):
TolerationSeconds: nil
: it can tolerate forever (perhaps this is a common use case, I think)TolerationSeconds: 0
: it can't tolerate at all (tolerate in 0 second).
So, how about this??
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.
// It defaults to Nil if not set. Nil value means the pod can tolerate the preemption forever. Zero value means the pod will be preempted immediately. i.e., no toleration at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in 671bf6f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@everpeace in terms of API design practices, it's a good practice to use a pointer for the internal-version struct (in pkg/apis/config/types.go); we do that for external versions (in pkg/apis/config/v1XYZ/types.go).
So when the versioned obj is wired in, we can tell if it's set to 0 or left as nil, and then default the value if it's nil (not set). In the internal processing that uses internal version objs, we don't need to worry the fields are set or not - default or user-specified value should have been set.
So the comment "It defaults to Nil if not set" is inaccurate - in internal versions, it should be a concrete value instead of pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei I see. Then, how about negative value means forever?? I would like to implement a policy that can tolerate forever in some way. How about below??
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.
// It defaults to zero if not set. Zero value means the pod will be preempted immediately. i.e., no toleration at all.
// Negative value means the pod can tolerate forever.
updated in b15a950
Actually, I would like to make tolerate forever be default thing. Would you have any idea about this?? I appreciate it if you give me some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OR, could we make it just leaveNil
if not set (perhaps we can say no default value)?? how about this??
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.
// If not set, the pod can tolerate the preemption forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just leave
Nil
if not set
As I mentioned before, internal API object should be a concrect value (so nil value doesn't work, generally).
Your previous comment makes sense to me except for the negative value:
- if it's not set (nil), internally default it to 0 - which means it can be preempted immediately
- if it's set to a positive value - the duration will be honored
- if it's set to a negative value, you can make it semantically mean "don't preempt forever". But I'm not sure this is really what you want? What if some pods set this to a negative value and then the high-priority can never preempt them and pending there forever? It seems to break the semantics of Priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned before, internal API object should be a concrect value (so nil value doesn't work, generally).
oh... sorry... I was confused.
But I'm not sure this is really what you want? What if some pods set this to a negative value and then the high-priority can never preempt them and pending there forever? It seems to break the semantics of Priority.
minimumPreemptablePriority
and tolerationSeconds
will be evaluated in AND-ed. tolerationSeconds
is expected to be used with minimumPreemptablePriority
. It means tolerationSeconds
specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority as described in the comment of this property. For example,
Assume low-non-preempted-10min
declared with minimumPreemptablePriority=10000, tolerationSeconds=600
. Then, system-critical, high, medium
priorites' preemption behavior will be like below:
# system-critical can preempt low-10min pods immediately
# because low-non-preempted-10min can't tolerate the preemption from this priority class
# (because of minimumPreemptablePriority=10000)
system-critical: 10000
# high and medium pods can preempt low-non-preempted-10min pods
# which elapsed at least 10min from being scheduled because toleration expired in 10 min
# (because of tolerationSeconds=600)
high: 9000
medium: 8500
# assume low-non-preempted-10min has a preemption toleration policy
# with minimumPreemptablePriority=10000, tolerationSeconds=600
low-non-preempted-10min: 8000
updated in 5ab5cad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought minimumPreemptablePriority
and tolerationSeconds
work independently. The current semantics looks good - it can only tolerate the pods with priority lower than minimumPreemptablePriority
, not always. Can you somehow emphasize it in the example? You did mention it in the latter section (API/implementations), but it'd be better to also emphasize it in the earlier section (the examples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you somehow emphasize it in the example? You did mention it in the latter section (API/implementations), but it'd be better to also emphasize it in the earlier section (the examples).
Thanks for your suggestion. I agree. I think the "Use Cases" section would be better not to require preemption toleration API knowledge to readers. So, I added the section "Implementing Typical Use Cases by Preemption Toleration APIs" under the "Preemption Toleration API" section so that it can describe how the API works in detail by examples. And I added the link to the section in "Use Cases" section.
updated in c50b6e2 .
PTAL 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei sorry for the late reply because it was a holiday week in Japan last week. I updated the branch and commented back for some. PTAL 🙇 I hope this is the last turn.
// Nil value means forever, which is not preempted forever (default) | ||
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated #205 (comment):
TolerationSeconds: nil
: it can tolerate forever (perhaps this is a common use case, I think)TolerationSeconds: 0
: it can't tolerate at all (tolerate in 0 second).
So, how about this??
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.
// It defaults to Nil if not set. Nil value means the pod can tolerate the preemption forever. Zero value means the pod will be preempted immediately. i.e., no toleration at all.
a25320f
to
b15a950
Compare
high: 8000 # high can preempt pods in low-non-preempted-{10,30}min priority class which elapsed at least {10,30} minutes. | ||
low: 8000 # no minimum running time guarantee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the tabs with spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 697ab81
|
||
```yaml | ||
system-critical: 10000 # system-critical can preempt pods in low-non-preempted-{10,30}min priority immediately | ||
high: 8000 # high can preempt pods in low-non-preempted-{10,30}min priority class which elapsed at least {10,30} minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 640fca7
cd1f519
to
5a86c4d
Compare
5a86c4d
to
7d1a6a9
Compare
@everpeace LGTM right now. Please squash the commits, and I will approve later. |
7d1a6a9
to
4fae905
Compare
4fae905
to
aa22f0c
Compare
@Huang-Wei Thank you very much for the review! I squashed my branch and rebased it to the latest master. PTAL 🙇 |
/ok-to-test /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: everpeace, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds a KEP that proposes "Preemption Toleration" plugin which provides a more flexible preemption behavior by introducing a preemption toleration policy in
PriorityClass
resources (defining in the annotation).The concept of preemption toleration policy is similar to pod toleration. The preemption toleration policy defines the condition in which the target priority class can tolerate (prevent from) the preemption by other priority classes or not. That is a preemptee (a.k.a. victim) side policy that is not covered by Kubernetes official priority and preemption API.
I would be happy if I have community feedbacks. Once the PR is approved, I will open the implementation and tests PR based on my PoC(everpeace#2).
Note: I would like to use this plugin in our on-premise Deep Learning focused GPU clusters (N thousands of GPUs and M hundreds of users).
TODO