Skip to content

Commit

Permalink
PreemptionToleration: added the section that describes how the api wo…
Browse files Browse the repository at this point in the history
…rks by example and added links to the section in use cases section.

<
  • Loading branch information
everpeace authored Aug 4, 2021
1 parent 640fca7 commit c50b6e2
Showing 1 changed file with 90 additions and 6 deletions.
96 changes: 90 additions & 6 deletions kep/205-preemption-toleration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,12 @@
- [Non-Goals](#non-goals)
- [Use Cases](#use-cases)
- [Lower priority value but not-being-preempted priority](#lower-priority-value-but-not-being-preempted-priority)
- [Guarantee to running at least N minutes even in lower priority](#guarantee-to-running-at-least-n-minutes-even-in-lower-priority)
- [Conditional preemption with guaranteed running time](#conditional-preemption-with-guaranteed-running-time)
- [Design Details](#design-details)
- [Preemption Toleration API](#preemption-toleration-api)
- [Implementing Typical Use Cases by Preemption Toleration APIs](#implementing-typical-use-cases-by-preemption-toleration-apis)
- [Lower priority value but not-being-preempted priority](#lower-priority-value-but-not-being-preempted-priority-1)
- [Conditional preemption with guaranteed running time](#conditional-preemption-with-guaranteed-running-time-1)
- [Plugin implementation](#plugin-implementation)
- [PostFilter](#postfilter)
- [Implementation History](#implementation-history)
Expand Down Expand Up @@ -51,11 +54,13 @@ On the other hand, from the cluster administrative view, unconditional non-preem
```yaml
system-critical: 10000
high: 9000
low: 8000
low-non-preempted: 8000 # with minimumPreemptablePriority=10000
# (i.e. high can't preempt this, but system-critical can preempt this)
low-non-preempted: 8000 # this priority class can tolerate preemption from priorities less than 10000
# (i.e. high can't preempt this, but system-critical can preempt this)
low: 8000 # normal preemption happens on this priority class, i.e. p > 8000 can preempt this priority class.
```
To realise this feature, the plugin introduces `MinimumPreemptablePriority` in the preemption toleration policy API. See [Preemption Toleration API](#preemption-toleration-api) section and [Implementing Typical Use Cases by Preemption Toleration APIs](#implementing-typical-use-cases-by-preemption-toleration-apis) section below.

### Conditional preemption with guaranteed running time

This is a typical use-case in machine learning job. As described above, job would need check-pointing. However, most machine learning jobs are iteration-based process, called _epochs_. So, if the cluster is high load and preemption always happens before the first epoch is finished, the machine learning jobs can never make any progress and computational resources consumed by the job came to nothing in this case.
Expand All @@ -72,6 +77,8 @@ low-non-preempted-10min: 8000 # 10 min minimum running time guarantee against
low-non-preempted-30min: 8000 # 30 min minimum running time guarantee against 8000 < p < 10000
```

To realise this feature, the plugin also introduces `TolerationSeconds` in the preemption toleration policy API. See [Preemption Toleration API](#preemption-toleration-api) section and [Implementing Typical Use Cases by Preemption Toleration APIs](#implementing-typical-use-cases-by-preemption-toleration-apis) section below.

## Design Details

### Preemption Toleration API
Expand Down Expand Up @@ -107,13 +114,90 @@ metadata:
# this key is needed to enable preemption toleration policy for distinguishing
# between no toleration policy and empty toleration policy (all fields will be default)
preemption-toleration.scheduling.sigs.k8s.io/enabled: ""
# This priority class can tolerate preemption by priority with p < 10000.
preemption-toleration.scheduling.sigs.k8s.io/minimum-preemptable-priority: "10000"
# And it can tolerate preemption in 1 hour by the pod with priority (p < 10000).
preemption-toleration.scheduling.sigs.k8s.io/toleration-seconds: "3600"
value: 8000
```
### Implementing Typical Use Cases by Preemption Toleration APIs
This section describes how to implement scenarios described in [Use Cases](#use-cases) section by the preemption toleration policy.
#### Lower priority value but not-being-preempted priority
Assume cluster administrator introduces `low-non-preempted` priority class that can not be preempted by high but can be preempted by system-critical. Then, they would declare `low-non-preempted` `PriorityClass` with preemption toleration policy of `MinimumPreemptablePriority=10000,TolerationSeconds=-1` like below:

```yaml
# system-critical can preempt low-10min pods immediately
# because low-non-preempted-10min can't tolerate the preemption from this priority class
# (because of MinimumPreemptablePriority=10000)
system-critical: 10000
# high pods can not preempt low-non-preempted pods forever
# because low-non-preempted priority can tolerate the preemption forever
# by priority class p < 10000(=MinimumPreemptablePriority)
# (because of MinimumPreemptablePriority=10000 and TolerationSeconds=-1)
high: 9000
# with MinimumPreemptablePriority=10000,TolerationSeconds=-1
low-non-preempted: 8000
# low with no preemption toleration policy. normal preemption behavior,
# i.e. this priority class will be preempted by priority p > 8000
low: 8000
```

Thus, `low-non-preempted` manifest would be like below:

```yaml
kind: PriorityClass
metadata:
name: low-non-preempted
annotation:
preemption-toleration.scheduling.sigs.k8s.io/enabled: ""
# This priority class can tolerate preemption by priority with p < 10000.
preemption-toleration.scheduling.sigs.k8s.io/minimum-preemptable-priority: "10000"
# This priority class can tolerate preemption forever by priority with p < 10000(=minimum-preemptable-priority)
preemption-toleration.scheduling.sigs.k8s.io/toleration-seconds: "-1"
value: 8000
```

#### Conditional preemption with guaranteed running time

Assume cluster administrator introduces `low-non-preempted-10min` priority class that can tolerate the preemption by high priority class for at lease 10 minutes, i.e. this guarantees 10 minutes running time for the priority class. Then, they would declare `low-non-preempted-10min` `PriorityClass` with preemption toleration policy of `MinimumPreemptablePriority=10000,TolerationSeconds=600` like below:

```yaml
# system-critical can preempt low-non-preempted-10min pods immediately
# because low-non-preempted-10min can't tolerate the preemption from this priority class
# (because of MinimumPreemptablePriority=10000)
system-critical: 10000
# high pods can preempt low-non-preempted-10min pods which elapsed
# at least 10min from being scheduled because toleration expired in 10 minutes
# (because of MinimumPreemptablePriority=10000 and TolerationSeconds=600)
high: 9000
# with MinimumPreemptablePriority=10000, TolerationSeconds=600
low-non-preempted-10min: 8000
```

Thus, `low-non-preempted-10min` manifest would be like below:

```yaml
kind: PriorityClass
metadata:
name: low-non-preempted-10min
annotation:
preemption-toleration.scheduling.sigs.k8s.io/enabled: ""
# This priority class can tolerate preemption by priority with p < 10000.
preemption-toleration.scheduling.sigs.k8s.io/minimum-preemptable-priority: "10000"
# This priority class can tolerate preemption for 10 minutes (600 seconds)
# by priority with p < 10000(=minimum-preemptable-priority)
preemption-toleration.scheduling.sigs.k8s.io/toleration-seconds: "600"
value: 8000
```

### Plugin implementation

#### PostFilter
Expand Down

0 comments on commit c50b6e2

Please sign in to comment.