diff --git a/docs/admin/out-of-resource.md b/docs/admin/out-of-resource.md index 16b8cc9e1e3fb..480e074c42d21 100644 --- a/docs/admin/out-of-resource.md +++ b/docs/admin/out-of-resource.md @@ -31,10 +31,25 @@ summary API. | Eviction Signal | Description | |------------------|---------------------------------------------------------------------------------| | `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` | +| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` | +| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` | +| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` | +| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` | -In future releases, the `kubelet` will support the ability to trigger eviction decisions based on disk pressure. +Each of the above signals support either a literal or percentage based value. The percentage based value +is calculated relative to the total capacity associated with each signal. -Until that time, it is recommended users take advantage of [garbage collection](/docs/admin/garbage-collection/). +`kubelet` supports only two filesystem partitions. + +1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc. +1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers. + +`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. `kubelet` does not care about any +other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is +*not OK* to store volumes and logs in a dedicated `filesystem`. + +In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/) +support in favor of eviction in response to disk pressure. ### Eviction Thresholds @@ -47,6 +62,14 @@ Each threshold is of the following form: * valid `eviction-signal` tokens as defined above. * valid `operator` tokens are `<` * valid `quantity` tokens must match the quantity representation used by Kubernetes +* an eviction threshold can be expressed as a percentage if ends with `%` token. + +For example, if a node has `10Gi` of memory, and the desire is to induce eviction +if available memory falls below `1Gi`, an eviction threshold can be specified as either +of the following (but not both). + +* `memory.available<10%` +* `memory.available<1Gi` #### Soft Eviction Thresholds @@ -84,6 +107,10 @@ To configure hard eviction thresholds, the following flag is supported: * `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met would trigger a pod eviction. +The `kubelet` has the following default hard eviction thresholds: + +* `--eviction-hard=memory.available<100Mi` + ### Eviction Monitoring Interval The `kubelet` evaluates eviction thresholds per its configured housekeeping interval. @@ -103,6 +130,7 @@ The following node conditions are defined that correspond to the specified evict | Node Condition | Eviction Signal | Description | |----------------|------------------|------------------------------------------------------------------| | `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold | +| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold | The `kubelet` will continue to report node status updates at the frequency specified by `--node-status-update-frequency` which defaults to `10s`. @@ -124,15 +152,44 @@ The `kubelet` would ensure that it has not observed an eviction threshold being for the specified pressure condition for the period specified before toggling the condition back to `false`. -### Eviction of Pods +### Reclaiming node level resources If an eviction threshold has been met and the grace period has passed, -the `kubelet` will initiate the process of evicting pods until it has observed -the signal has gone below its defined threshold. +the `kubelet` will initiate the process of reclaiming the pressured resource +until it has observed the signal has gone below its defined threshold. + +The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If +disk pressure is observed, the `kubelet` reclaims node level resources differently if the +machine has a dedicated `imagefs` configured for the container runtime. + +#### With Imagefs + +If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: + +1. Delete dead pods/containers + +If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: -The `kubelet` ranks pods for eviction 1) by their quality of service, -2) and among those with the same quality of service by the consumption of the -starved compute resource relative to the pods scheduling request. +1. Delete all unused images + +#### Without Imagefs + +If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: + +1. Delete dead pods/containers +1. Delete all unused images + +### Evicting end-user pods + +If the `kubelet` is unable to reclaim sufficient resource on the node, +it will begin evicting pods. + +The `kubelet` ranks pods for eviction as follows: + +* by their quality of service +* by the consumption of the starved compute resource relative to the pods scheduling request. + +As a result, pod eviction occurs in the following order: * `BestEffort` pods that consume the most of the starved resource are failed first. @@ -151,6 +208,49 @@ and the node only has `Guaranteed` pod(s) remaining, then the node must choose t `Guaranteed` pod in order to preserve node stability, and to limit the impact of the unexpected consumption to other `Guaranteed` pod(s). +Local disk is a `BestEffort` resource. If necessary, `kubelet` will evict pods one at a time to reclaim +disk when `DiskPressure` is encountered. The `kubelet` will rank pods by quality of service. If the `kubelet` +is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service +first. If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service +that consumes the largest amount of disk and kill those first. + +#### With Imagefs + +If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs` +- local volumes + logs of all its containers. + +If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers. + +#### Without Imagefs + +If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage +- local volumes + logs & writable layer of all its containers. + +### Minimum eviction reclaim + +In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in +`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, + is time consuming. + +To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes +resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below +the configured eviction threshold. + +For example, with the following configuration: + +``` +--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi +--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"` +``` + +If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure +that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` will work +to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will +work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure +on their associated resources. + +The default `eviction-minimum-reclaim` is `0` for all resources. + ### Scheduler The node will report a condition when a compute resource is under pressure. The @@ -159,7 +259,8 @@ pods on the node. | Node Condition | Scheduler Behavior | | ---------------- | ------------------------------------------------ | -| `MemoryPressure` | `BestEffort` pods are not scheduled to the node. | +| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. | +| `DiskPressure` | No new pods are scheduled to the node. | ## Node OOM Behavior @@ -223,3 +324,46 @@ candidate set of pods provided to the eviction strategy. In general, it is strongly recommended that `DaemonSet` not create `BestEffort` pods to avoid being identified as a candidate pod for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods. + +## Deprecation of existing feature flags to reclaim disk + +`kubelet` has been freeing up disk space on demand to keep the node stable. + +As disk based eviction matures, the following `kubelet` flags will be marked for deprecation +in favor of the simpler configuation supported around eviction. + +| Existing Flag | New Flag | +| ------------- | -------- | +| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | +| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | +| `--maximum-dead-containers` | deprecated | +| `--maximum-dead-containers-per-container` | deprecated | +| `--minimum-container-ttl-duration` | deprecated | +| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | +| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | + +## Known issues + +### kubelet may not observe memory pressure right away + +The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage +increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller` +will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this +latency, and instead have the kernel tell us when a threshold has been crossed immmediately. + +If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for +this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature +to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance. + +### kubelet may evict more pods than needed + +The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding +the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future. + +### How kubelet ranks pods for eviction in response to inode exhaustion + +At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes +inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor +to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods +by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict +that pod over others.