-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for disk based eviction #1196
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,10 +31,25 @@ summary API. | |
| Eviction Signal | Description | | ||
|------------------|---------------------------------------------------------------------------------| | ||
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` | | ||
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` | | ||
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` | | ||
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` | | ||
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` | | ||
|
||
In future releases, the `kubelet` will support the ability to trigger eviction decisions based on disk pressure. | ||
Each of the above signals support either a literal or percentage based value. The percentage based value | ||
is calculated relative to the total capacity associated with each signal. | ||
|
||
Until that time, it is recommended users take advantage of [garbage collection](/docs/admin/garbage-collection/). | ||
`kubelet` supports only two filesystem partitions. | ||
|
||
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc. | ||
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers. | ||
|
||
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. `kubelet` does not care about any | ||
other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is | ||
*not OK* to store volumes and logs in a dedicated `filesystem`. | ||
|
||
In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/) | ||
support in favor of eviction in response to disk pressure. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll still do some garbage collection though, right? Saying we're going to deprecate it is confusing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Garbage collection (for both image and containers) is more aggressive than the disk eviction AFAIK. For example, it deletes containers associated with deleted pods, and it also ensure we kept no more than N containers per (pod, container) tuple. Is the plan to fold garbage collection into eviction? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
|
||
### Eviction Thresholds | ||
|
||
|
@@ -47,6 +62,14 @@ Each threshold is of the following form: | |
* valid `eviction-signal` tokens as defined above. | ||
* valid `operator` tokens are `<` | ||
* valid `quantity` tokens must match the quantity representation used by Kubernetes | ||
* an eviction threshold can be expressed as a percentage if ends with `%` token. | ||
|
||
For example, if a node has `10Gi` of memory, and the desire is to induce eviction | ||
if available memory falls below `1Gi`, an eviction threshold can be specified as either | ||
of the following (but not both). | ||
|
||
* `memory.available<10%` | ||
* `memory.available<1Gi` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where are these specified? I assume there's a flag / config for it somewhere? What are the defaults? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they are specified as flags on the kubelet that are defined later in this doc. this document does not define defaults. defaults right now are specific to a commercial offering imo. |
||
|
||
#### Soft Eviction Thresholds | ||
|
||
|
@@ -84,6 +107,10 @@ To configure hard eviction thresholds, the following flag is supported: | |
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met | ||
would trigger a pod eviction. | ||
|
||
The `kubelet` has the following default hard eviction thresholds: | ||
|
||
* `--eviction-hard=memory.available<100Mi` | ||
|
||
### Eviction Monitoring Interval | ||
|
||
The `kubelet` evaluates eviction thresholds per its configured housekeeping interval. | ||
|
@@ -103,6 +130,7 @@ The following node conditions are defined that correspond to the specified evict | |
| Node Condition | Eviction Signal | Description | | ||
|----------------|------------------|------------------------------------------------------------------| | ||
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold | | ||
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold | | ||
|
||
The `kubelet` will continue to report node status updates at the frequency specified by | ||
`--node-status-update-frequency` which defaults to `10s`. | ||
|
@@ -124,15 +152,44 @@ The `kubelet` would ensure that it has not observed an eviction threshold being | |
for the specified pressure condition for the period specified before toggling the | ||
condition back to `false`. | ||
|
||
### Eviction of Pods | ||
### Reclaiming node level resources | ||
|
||
If an eviction threshold has been met and the grace period has passed, | ||
the `kubelet` will initiate the process of evicting pods until it has observed | ||
the signal has gone below its defined threshold. | ||
the `kubelet` will initiate the process of reclaiming the pressured resource | ||
until it has observed the signal has gone below its defined threshold. | ||
|
||
The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If | ||
disk pressure is observed, the `kubelet` reclaims node level resources differently if the | ||
machine has a dedicated `imagefs` configured for the container runtime. | ||
|
||
#### With Imagefs | ||
|
||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: | ||
|
||
1. Delete dead pods/containers | ||
|
||
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: | ||
|
||
The `kubelet` ranks pods for eviction 1) by their quality of service, | ||
2) and among those with the same quality of service by the consumption of the | ||
starved compute resource relative to the pods scheduling request. | ||
1. Delete all unused images | ||
|
||
#### Without Imagefs | ||
|
||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: | ||
|
||
1. Delete dead pods/containers | ||
1. Delete all unused images | ||
|
||
### Evicting end-user pods | ||
|
||
If the `kubelet` is unable to reclaim sufficient resource on the node, | ||
it will begin evicting pods. | ||
|
||
The `kubelet` ranks pods for eviction as follows: | ||
|
||
* by their quality of service | ||
* by the consumption of the starved compute resource relative to the pods scheduling request. | ||
|
||
As a result, pod eviction occurs in the following order: | ||
|
||
* `BestEffort` pods that consume the most of the starved resource are failed | ||
first. | ||
|
@@ -151,6 +208,49 @@ and the node only has `Guaranteed` pod(s) remaining, then the node must choose t | |
`Guaranteed` pod in order to preserve node stability, and to limit the impact | ||
of the unexpected consumption to other `Guaranteed` pod(s). | ||
|
||
Local disk is a `BestEffort` resource. If necessary, `kubelet` will evict pods one at a time to reclaim | ||
disk when `DiskPressure` is encountered. The `kubelet` will rank pods by quality of service. If the `kubelet` | ||
is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service | ||
first. If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service | ||
that consumes the largest amount of disk and kill those first. | ||
|
||
#### With Imagefs | ||
|
||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs` | ||
- local volumes + logs of all its containers. | ||
|
||
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers. | ||
|
||
#### Without Imagefs | ||
|
||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage | ||
- local volumes + logs & writable layer of all its containers. | ||
|
||
### Minimum eviction reclaim | ||
|
||
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in | ||
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, | ||
is time consuming. | ||
|
||
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes | ||
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below | ||
the configured eviction threshold. | ||
|
||
For example, with the following configuration: | ||
|
||
``` | ||
--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi | ||
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"` | ||
``` | ||
|
||
If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure | ||
that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` will work | ||
to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will | ||
work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure | ||
on their associated resources. | ||
|
||
The default `eviction-minimum-reclaim` is `0` for all resources. | ||
|
||
### Scheduler | ||
|
||
The node will report a condition when a compute resource is under pressure. The | ||
|
@@ -159,7 +259,8 @@ pods on the node. | |
|
||
| Node Condition | Scheduler Behavior | | ||
| ---------------- | ------------------------------------------------ | | ||
| `MemoryPressure` | `BestEffort` pods are not scheduled to the node. | | ||
| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. | | ||
| `DiskPressure` | No new pods are scheduled to the node. | | ||
|
||
## Node OOM Behavior | ||
|
||
|
@@ -223,3 +324,46 @@ candidate set of pods provided to the eviction strategy. | |
In general, it is strongly recommended that `DaemonSet` not | ||
create `BestEffort` pods to avoid being identified as a candidate pod | ||
for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods. | ||
|
||
## Deprecation of existing feature flags to reclaim disk | ||
|
||
`kubelet` has been freeing up disk space on demand to keep the node stable. | ||
|
||
As disk based eviction matures, the following `kubelet` flags will be marked for deprecation | ||
in favor of the simpler configuation supported around eviction. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like simplifying the configuration, just one question. Will eviction only happen when the threshold is met? More aggressive, periodic cleanup, seems desirable IMO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should discuss this in a design issue, its tangental to the doc. what's documented is the current plan of record. |
||
| Existing Flag | New Flag | | ||
| ------------- | -------- | | ||
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | | ||
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | | ||
| `--maximum-dead-containers` | deprecated | | ||
| `--maximum-dead-containers-per-container` | deprecated | | ||
| `--minimum-container-ttl-duration` | deprecated | | ||
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | | ||
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | | ||
|
||
## Known issues | ||
|
||
### kubelet may not observe memory pressure right away | ||
|
||
The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage | ||
increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller` | ||
will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this | ||
latency, and instead have the kernel tell us when a threshold has been crossed immmediately. | ||
|
||
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for | ||
this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature | ||
to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance. | ||
|
||
### kubelet may evict more pods than needed | ||
|
||
The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding | ||
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future. | ||
|
||
### How kubelet ranks pods for eviction in response to inode exhaustion | ||
|
||
At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes | ||
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does it rank pods in the same QoS class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is addressed under "Known issues" section "How kubelet ranks pods for eviction in response to inode exhaustion", in the future, we want it to rank by consumption when we can know it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to be clear, it does not differentiate within a class, so its random. |
||
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods | ||
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict | ||
that pod over others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this hold for
memory.available
too?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. it holds for all the things.