Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Preemption Toleration Plugin #205

Merged

Conversation

everpeace
Copy link
Contributor

@everpeace everpeace commented Jun 24, 2021

This PR adds a KEP that proposes "Preemption Toleration" plugin which provides a more flexible preemption behavior by introducing a preemption toleration policy in PriorityClass resources (defining in the annotation).

The concept of preemption toleration policy is similar to pod toleration. The preemption toleration policy defines the condition in which the target priority class can tolerate (prevent from) the preemption by other priority classes or not. That is a preemptee (a.k.a. victim) side policy that is not covered by Kubernetes official priority and preemption API.

I would be happy if I have community feedbacks. Once the PR is approved, I will open the implementation and tests PR based on my PoC(everpeace#2).

Note: I would like to use this plugin in our on-premise Deep Learning focused GPU clusters (N thousands of GPUs and M hundreds of users).

TODO

  • squash the branch once it was approved

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 24, 2021
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 24, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @everpeace!

It looks like this is your first PR to kubernetes-sigs/scheduler-plugins 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/scheduler-plugins has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @everpeace. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 24, 2021
@everpeace everpeace force-pushed the preemption-toleration-kep branch 3 times, most recently from 599614e to 63b20e7 Compare June 24, 2021 11:51
system-critical: 10000
high: 9000
low: 8000
low-non-preempted: 8000 # with minimumPreemptablePriority=10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If comparing this case with "low-non-preempted=10000 and PreemptionPolicy=Never", is the only difference that you don't want "low-non-preempted" pods to preempt pods with priority value ranging from 8000 to 10000?

If so, is that an alternative is to add another option to PreemptionPolicy (like "only preempt pods with priority value that's less than X")?

Copy link
Contributor Author

@everpeace everpeace Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If comparing this case with "low-non-preempted=10000 and PreemptionPolicy=Never", is the only difference that you don't want "low-non-preempted" pods to preempt pods with priority value ranging from 8000 to 10000?

Yes for preemption behavior. But the scheduling behavior will be different because the priority value affects the rank in the scheduler queue.

If so, is that an alternative is to add another option to PreemptionPolicy (like "only preempt pods with priority value that's less than X")?

Focusing on preemption behavior, yes, it could be. However, we also proposed tolerationSeconds, which will be essentially the preemptee-side configuration described as "Guarantee to running at least N minutes even in lower priority" section. Thus, I think making them be preemptee-side configuration would be simpler here. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tolerationSeconds fits more on the preemptee side.

Comment on lines 104 to 113
scheduling.sigs.k8s.io/preemption-toleration: |
# the api is versioned for future enhancement
apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: PreemptionToleration

# This priority class can tolerate preemption by priority with 8000 <= p < 10000.
minimumPreemptablePriority: 10000

# And it can tolerate preemption in 1 hour by the pod with priority (8000 <= p < 10000).
tolerationSeconds: 3600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to exact all the settings to a CustomResource object; otherwise, I don't think it's practical to versioning/defaulting/converting on the annotated string.

Or, you give up support multi-version toleration policies - i.e., define the settings as scheduling.sigs.k8s.io/preemption-toleration-XYZ, individually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If we choose CRD case, PriorityClass would need to refer PreemptionToleration CR by annotation or a similar way. Then, I would like to choose the annotation case.

Copy link
Contributor Author

@everpeace everpeace Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 266f863 and ab2d0ff

#### PostFilter

This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd need to add a stanza to describe how to honor the preemption tolerationSeconds setting, e.g., implement the controller logic to monitor/manage the pods with preemption tolerationSeconds settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will document where to how to modify the default preemption plugin.

Copy link
Contributor Author

@everpeace everpeace Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 8484681

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the details of PostFilter is awesome. But what I meant in the original comment was the "controller" logic - like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).

(A reference is that Kubernetes uses a nodelifecycle manager to guarantee the tolerationSeconds' semantics)

Copy link
Contributor Author

@everpeace everpeace Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).

preemptionTolerationSeconds does not require any controller logic because the duration is defined as the duration from the time being scheduled of the victim candidate pod. So, the scheduler plugin can determine the victim candidate can tolerate the preemption or not by itself like:

	// in the PreemptionToleration plugin:

	// check it can tolerate the preemption in terms of toleration seconds
	_, scheduledCondition := podutil.GetPodCondition(&victimCandidate.Status, v1.PodScheduled)
	if scheduledCondition == nil {
		return canTolerateOnPriorityValue, nil
	}
	canTolerateOnTolerationSeconds := !now.After(
		scheduledCondition.LastTransitionTime.Time.Add(time.Duration(*toleration.TolerationSeconds)*time.Second),
	)

Is it better to explain this (we no need to implement a controller) in this section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in another thread #205 (comment)

// MinimumPreemptablePriority specifies the minimum priority value that can preempt this priority class.
MinimumPreemptablePriority *int32 `json:"minimumPreemptablePriority,omitempty"`

// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope my understanding is correct:

Suggested change
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes).
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.
// Nil or zero value means it's preemptable immediately (default).
// Non-nil value means the duration from the pod being scheduled to some node.
// This value affects scheduled pods only (no effect on nominated pods).

Copy link
Contributor Author

@everpeace everpeace Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to the same semantics with tolerationSeconds in Pod:

By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.

Thus, I would like to fix it to be a more clear explanation like below. What do you think?

Suggested change
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes).
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.
// Nil value means forever, which is not preempted forever (default)
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all.
// This value affects scheduled pods only (no effect on nominated pods).

Copy link
Contributor Author

@everpeace everpeace Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 316a022

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei Thanks for your comment! I replied to your comments. PTAL 🙇

system-critical: 10000
high: 9000
low: 8000
low-non-preempted: 8000 # with minimumPreemptablePriority=10000
Copy link
Contributor Author

@everpeace everpeace Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If comparing this case with "low-non-preempted=10000 and PreemptionPolicy=Never", is the only difference that you don't want "low-non-preempted" pods to preempt pods with priority value ranging from 8000 to 10000?

Yes for preemption behavior. But the scheduling behavior will be different because the priority value affects the rank in the scheduler queue.

If so, is that an alternative is to add another option to PreemptionPolicy (like "only preempt pods with priority value that's less than X")?

Focusing on preemption behavior, yes, it could be. However, we also proposed tolerationSeconds, which will be essentially the preemptee-side configuration described as "Guarantee to running at least N minutes even in lower priority" section. Thus, I think making them be preemptee-side configuration would be simpler here. What do you think?

Comment on lines 104 to 113
scheduling.sigs.k8s.io/preemption-toleration: |
# the api is versioned for future enhancement
apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: PreemptionToleration

# This priority class can tolerate preemption by priority with 8000 <= p < 10000.
minimumPreemptablePriority: 10000

# And it can tolerate preemption in 1 hour by the pod with priority (8000 <= p < 10000).
tolerationSeconds: 3600
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If we choose CRD case, PriorityClass would need to refer PreemptionToleration CR by annotation or a similar way. Then, I would like to choose the annotation case.

#### PostFilter

This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will document where to how to modify the default preemption plugin.

// MinimumPreemptablePriority specifies the minimum priority value that can preempt this priority class.
MinimumPreemptablePriority *int32 `json:"minimumPreemptablePriority,omitempty"`

// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes).
Copy link
Contributor Author

@everpeace everpeace Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to the same semantics with tolerationSeconds in Pod:

By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.

Thus, I would like to fix it to be a more clear explanation like below. What do you think?

Suggested change
// TolerationSeconds specified how many seconds this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority. Null value specifies forever (default). Duration means the duration from the pod being scheduled to some node. This value affects only to scheduled pods (no effect on nominated nodes).
// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.
// Nil value means forever, which is not preempted forever (default)
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all.
// This value affects scheduled pods only (no effect on nominated pods).

@everpeace everpeace force-pushed the preemption-toleration-kep branch from b162e8d to 316a022 Compare July 1, 2021 09:57

### Lower priority value but not-being-preempted priority

Generally speaking, making a job be resumable after preemption, the job would need to preserve its state to outside of the pod, e.g. persistent volume, object storage, etc., at preemption and restore it at next restart (a.k.a check-pointing). However, when a job uses some proprietary tools that are not modifiable or do not support suspend/resume feature, users would like to avoid being preempted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to outside of the pod -> externally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in b8c0b87

system-critical: 10000
high: 9000
low: 8000
low-non-preempted: 8000 # with minimumPreemptablePriority=10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tolerationSeconds fits more on the preemptee side.

name: toleration-policy-sample
annotation:
# this key is needed to enable preemption toleration policy for distinguishing
# between no toleration policy and empty toleration policy (all fields will be default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace tab with empty characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 8be0fa2

metadata:
name: toleration-policy-sample
annotation:
# this key is needed to enable preemption toleration policy for distinguishing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this is another shortage of annotations (vs. CRD). It's ok to start with annotations as an alpha feature, but once we graduate it to beta, do you have any plan to turn to CRD? - which may better manage the specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to start with annotations as an alpha feature, but once we graduate it to beta, do you have any plan to turn to CRD? - which may better manage the specs.

I would like to start with annotations because the policy spec is simple enough, and no controller logic is required. When graduating it to beta, we plan to transform it to CRD.


```go
type PreemptionToleration struct {
// MinimumPreemptablePriority specifies the minimum priority value that can preempt this priority class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the default value/behavior if it's not set. (like, if not set, pods that have a higher priority value can preempt it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 8d7f4da

#### PostFilter

This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the details of PostFilter is awesome. But what I meant in the original comment was the "controller" logic - like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).

(A reference is that Kubernetes uses a nodelifecycle manager to guarantee the tolerationSeconds' semantics)

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei Thanks for the review. I pushed some fixes and replied to your comments. PTAL 🙇

metadata:
name: toleration-policy-sample
annotation:
# this key is needed to enable preemption toleration policy for distinguishing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to start with annotations as an alpha feature, but once we graduate it to beta, do you have any plan to turn to CRD? - which may better manage the specs.

I would like to start with annotations because the policy spec is simple enough, and no controller logic is required. When graduating it to beta, we plan to transform it to CRD.

#### PostFilter

This plugin implements the PostFilter extension point because this plugin focuses to extend the scheduler's preemption behavior. Moreover, this plugin just extends victim candidate selection logics in the default preemption behavior.

Copy link
Contributor Author

@everpeace everpeace Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like how you want the controller to be implemented, launched and packaged (into scheduler or controller binary?).

preemptionTolerationSeconds does not require any controller logic because the duration is defined as the duration from the time being scheduled of the victim candidate pod. So, the scheduler plugin can determine the victim candidate can tolerate the preemption or not by itself like:

	// in the PreemptionToleration plugin:

	// check it can tolerate the preemption in terms of toleration seconds
	_, scheduledCondition := podutil.GetPodCondition(&victimCandidate.Status, v1.PodScheduled)
	if scheduledCondition == nil {
		return canTolerateOnPriorityValue, nil
	}
	canTolerateOnTolerationSeconds := !now.After(
		scheduledCondition.LastTransitionTime.Time.Add(time.Duration(*toleration.TolerationSeconds)*time.Second),
	)

Is it better to explain this (we no need to implement a controller) in this section?

@Huang-Wei
Copy link
Contributor

Re #205 (comment):

(Github doesn't allow me to comment there)

Oh, I misunderstood it. So it means upon a preemption attempt, the Pods that can be tolerated won't be preempted, also they won't be deleted if the toleration seconds have passed - which is different with Pod's tolerationSeconds where the Pod will be deleted forcibly when tolerationSeconds is used up.

In your case, it's more a concept of "guaranteed minimum duration to run". These Pods can be only preempted when the time passed the minimum duration. Is my understanding correct?

@everpeace
Copy link
Contributor Author

@Huang-Wei

In your case, it's more a concept of "guaranteed minimum duration to run". These Pods can be only preempted when the time passed the minimum duration. Is my understanding correct?

Yes, that's right!!

@everpeace
Copy link
Contributor Author

Once I got approved, I would like to squash the commits so far.

Copy link
Contributor

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final comments on defaulting. LGTM otherwise.

Comment on lines 87 to 88
// Nil value means forever, which is not preempted forever (default)
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should distinguish nil from zero. How about:

// It defaults to zero if not set. Zero value means the pod will be preempted immediately. i.e., no toleration at all.

And negative value will be caught during validation phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated #205 (comment):

  • TolerationSeconds: nil : it can tolerate forever (perhaps this is a common use case, I think)
  • TolerationSeconds: 0: it can't tolerate at all (tolerate in 0 second).

So, how about this??

  // TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.  
  // It defaults to Nil if not set.  Nil value means the pod can tolerate the preemption forever.  Zero value means the pod will be preempted immediately. i.e., no toleration at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 671bf6f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@everpeace in terms of API design practices, it's a good practice to use a pointer for the internal-version struct (in pkg/apis/config/types.go); we do that for external versions (in pkg/apis/config/v1XYZ/types.go).

So when the versioned obj is wired in, we can tell if it's set to 0 or left as nil, and then default the value if it's nil (not set). In the internal processing that uses internal version objs, we don't need to worry the fields are set or not - default or user-specified value should have been set.

So the comment "It defaults to Nil if not set" is inaccurate - in internal versions, it should be a concrete value instead of pointer.

Copy link
Contributor Author

@everpeace everpeace Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei I see. Then, how about negative value means forever?? I would like to implement a policy that can tolerate forever in some way. How about below??

// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.  
// It defaults to zero if not set. Zero value means the pod will be preempted immediately. i.e., no toleration at all.
// Negative value means the pod can tolerate forever.

updated in b15a950

Actually, I would like to make tolerate forever be default thing. Would you have any idea about this?? I appreciate it if you give me some.

Copy link
Contributor Author

@everpeace everpeace Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR, could we make it just leaveNil if not set (perhaps we can say no default value)?? how about this??

// TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.  
// If not set, the pod can tolerate the preemption forever.

Copy link
Contributor

@Huang-Wei Huang-Wei Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just leaveNil if not set

As I mentioned before, internal API object should be a concrect value (so nil value doesn't work, generally).

Your previous comment makes sense to me except for the negative value:

  • if it's not set (nil), internally default it to 0 - which means it can be preempted immediately
  • if it's set to a positive value - the duration will be honored
  • if it's set to a negative value, you can make it semantically mean "don't preempt forever". But I'm not sure this is really what you want? What if some pods set this to a negative value and then the high-priority can never preempt them and pending there forever? It seems to break the semantics of Priority.

Copy link
Contributor Author

@everpeace everpeace Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei

As I mentioned before, internal API object should be a concrect value (so nil value doesn't work, generally).

oh... sorry... I was confused.

But I'm not sure this is really what you want? What if some pods set this to a negative value and then the high-priority can never preempt them and pending there forever? It seems to break the semantics of Priority.

minimumPreemptablePriority and tolerationSeconds will be evaluated in AND-ed. tolerationSeconds is expected to be used with minimumPreemptablePriority. It means tolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority as described in the comment of this property. For example,

Assume low-non-preempted-10min declared with minimumPreemptablePriority=10000, tolerationSeconds=600. Then, system-critical, high, medium priorites' preemption behavior will be like below:

# system-critical can preempt low-10min pods immediately 
# because low-non-preempted-10min can't tolerate the preemption from this priority class
# (because of minimumPreemptablePriority=10000)
         system-critical: 10000 

# high and medium pods can preempt low-non-preempted-10min pods 
# which elapsed at least 10min from being scheduled because toleration expired in 10 min
# (because of tolerationSeconds=600)
                   high:  9000 
                 medium:  8500 

# assume low-non-preempted-10min has a preemption toleration policy 
# with minimumPreemptablePriority=10000, tolerationSeconds=600
low-non-preempted-10min:  8000

updated in 5ab5cad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought minimumPreemptablePriority and tolerationSeconds work independently. The current semantics looks good - it can only tolerate the pods with priority lower than minimumPreemptablePriority, not always. Can you somehow emphasize it in the example? You did mention it in the latter section (API/implementations), but it'd be better to also emphasize it in the earlier section (the examples).

Copy link
Contributor Author

@everpeace everpeace Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei

Can you somehow emphasize it in the example? You did mention it in the latter section (API/implementations), but it'd be better to also emphasize it in the earlier section (the examples).

Thanks for your suggestion. I agree. I think the "Use Cases" section would be better not to require preemption toleration API knowledge to readers. So, I added the section "Implementing Typical Use Cases by Preemption Toleration APIs" under the "Preemption Toleration API" section so that it can describe how the API works in detail by examples. And I added the link to the section in "Use Cases" section.

updated in c50b6e2 .

PTAL 🙇

Copy link
Contributor Author

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei sorry for the late reply because it was a holiday week in Japan last week. I updated the branch and commented back for some. PTAL 🙇 I hope this is the last turn.

Comment on lines 87 to 88
// Nil value means forever, which is not preempted forever (default)
// zero and negative value means immediate, which is preempted immediately. i.e. no toleration at all.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated #205 (comment):

  • TolerationSeconds: nil : it can tolerate forever (perhaps this is a common use case, I think)
  • TolerationSeconds: 0: it can't tolerate at all (tolerate in 0 second).

So, how about this??

  // TolerationSeconds specifies how long this priority class can tolerate preemption by priorities lower than MinimumPreemptablePriority.  
  // It defaults to Nil if not set.  Nil value means the pod can tolerate the preemption forever.  Zero value means the pod will be preempted immediately. i.e., no toleration at all.

@everpeace everpeace force-pushed the preemption-toleration-kep branch from a25320f to b15a950 Compare July 28, 2021 07:59
@everpeace everpeace requested a review from Huang-Wei July 28, 2021 08:15
Comment on lines 69 to 70
high: 8000 # high can preempt pods in low-non-preempted-{10,30}min priority class which elapsed at least {10,30} minutes.
low: 8000 # no minimum running time guarantee
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the tabs with spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 697ab81


```yaml
system-critical: 10000 # system-critical can preempt pods in low-non-preempted-{10,30}min priority immediately
high: 8000 # high can preempt pods in low-non-preempted-{10,30}min priority class which elapsed at least {10,30} minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 640fca7

@everpeace everpeace force-pushed the preemption-toleration-kep branch 5 times, most recently from cd1f519 to 5a86c4d Compare August 4, 2021 05:24
@everpeace everpeace force-pushed the preemption-toleration-kep branch from 5a86c4d to 7d1a6a9 Compare August 4, 2021 05:31
@Huang-Wei
Copy link
Contributor

@everpeace LGTM right now. Please squash the commits, and I will approve later.

@everpeace everpeace force-pushed the preemption-toleration-kep branch from 7d1a6a9 to 4fae905 Compare August 5, 2021 01:21
@everpeace everpeace force-pushed the preemption-toleration-kep branch from 4fae905 to aa22f0c Compare August 5, 2021 01:22
@everpeace
Copy link
Contributor Author

@Huang-Wei Thank you very much for the review! I squashed my branch and rebased it to the latest master. PTAL 🙇

@everpeace everpeace requested a review from Huang-Wei August 5, 2021 03:36
@Huang-Wei
Copy link
Contributor

/ok-to-test

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 5, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 5, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: everpeace, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit 74438bd into kubernetes-sigs:master Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants