kubernetes · devin-donnelly · Sep 21, 2016 · Sep 9, 2016 · timstclair · Sep 9, 2016
diff --git a/docs/admin/out-of-resource.md b/docs/admin/out-of-resource.md
@@ -31,10 +31,25 @@ summary API.
 | Eviction Signal  | Description                                                                     |
 |------------------|---------------------------------------------------------------------------------|
 | `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
+| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
+| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
+| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
+| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
 
-In future releases, the `kubelet` will support the ability to trigger eviction decisions based on disk pressure.
+Each of the above signals support either a literal or percentage based value.  The percentage based value
+is calculated relative to the total capacity associated with each signal.
 
-Until that time, it is recommended users take advantage of [garbage collection](/docs/admin/garbage-collection/).
+`kubelet` supports only two filesystem partitions.
+
+1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
+1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
+
+`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.  `kubelet` does not care about any 
+other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is
+*not OK* to store volumes and logs in a dedicated `filesystem`.
+
+In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/)
+support in favor of eviction in response to disk pressure.
 
 ### Eviction Thresholds
 
@@ -47,6 +62,14 @@ Each threshold is of the following form:
 * valid `eviction-signal` tokens as defined above.
 * valid `operator` tokens are `<`
 * valid `quantity` tokens must match the quantity representation used by Kubernetes
+* an eviction threshold can be expressed as a percentage if ends with `%` token.
+
+For example, if a node has `10Gi` of memory, and the desire is to induce eviction
+if available memory falls below `1Gi`, an eviction threshold can be specified as either
+of the following (but not both).
+
+* `memory.available<10%`
+* `memory.available<1Gi`
 
 #### Soft Eviction Thresholds
 
@@ -84,6 +107,10 @@ To configure hard eviction thresholds, the following flag is supported:
 * `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
 would trigger a pod eviction.
 
+The `kubelet` has the following default hard eviction thresholds:
+
+* `--eviction-hard=memory.available<100Mi`
+
 ### Eviction Monitoring Interval
 
 The `kubelet` evaluates eviction thresholds per its configured housekeeping interval.
@@ -103,6 +130,7 @@ The following node conditions are defined that correspond to the specified evict
 | Node Condition | Eviction Signal  | Description                                                      |
 |----------------|------------------|------------------------------------------------------------------|
 | `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
+| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
 
 The `kubelet` will continue to report node status updates at the frequency specified by
 `--node-status-update-frequency` which defaults to `10s`.
@@ -124,15 +152,44 @@ The `kubelet` would ensure that it has not observed an eviction threshold being
 for the specified pressure condition for the period specified before toggling the
 condition back to `false`.
 
-### Eviction of Pods
+### Reclaiming node level resources
 
 If an eviction threshold has been met and the grace period has passed,
-the `kubelet` will initiate the process of evicting pods until it has observed 
-the signal has gone below its defined threshold.
+the `kubelet` will initiate the process of reclaiming the pressured resource
+until it has observed the signal has gone below its defined threshold.
+
+The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If
+disk pressure is observed, the `kubelet` reclaims node level resources differently if the
+machine has a dedicated `imagefs` configured for the container runtime.
+
+#### With Imagefs
+
+If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete dead pods/containers
+
+If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
 
-The `kubelet` ranks pods for eviction 1) by their quality of service,
-2) and among those with the same quality of service by the consumption of the
-starved compute resource relative to the pods scheduling request.
+1. Delete all unused images
+
+#### Without Imagefs
+
+If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete dead pods/containers
+1. Delete all unused images
+
+### Evicting end-user pods
+
+If the `kubelet` is unable to reclaim sufficient resource on the node,
+it will begin evicting pods.
+
+The `kubelet` ranks pods for eviction as follows:
+
+* by their quality of service
+* by the consumption of the starved compute resource relative to the pods scheduling request.
+
+As a result, pod eviction occurs in the following order:
 
 * `BestEffort` pods that consume the most of the starved resource are failed
 first.
@@ -151,6 +208,49 @@ and the node only has `Guaranteed` pod(s) remaining, then the node must choose t
 `Guaranteed` pod in order to preserve node stability, and to limit the impact
 of the unexpected consumption to other `Guaranteed` pod(s).
 
+Local disk is a `BestEffort` resource.  If necessary, `kubelet` will evict pods one at a time to reclaim
+disk when `DiskPressure` is encountered.  The `kubelet` will rank pods by quality of service.  If the `kubelet`
+is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service
+first.  If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service
+that consumes the largest amount of disk and kill those first.
+
+#### With Imagefs
+
+If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
+- local volumes + logs of all its containers.
+
+If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
+
+#### Without Imagefs
+
+If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
+- local volumes + logs & writable layer of all its containers.
+
+### Minimum eviction reclaim
+
+In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
+`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
+ is time consuming.
+
+To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
+resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below
+the configured eviction threshold.
+
+For example, with the following configuration:
+
+```
+--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
+--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
+```
+
+If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure
+that `memory.available` is at least `500Mi`.  For `nodefs.available`, the `kubelet` will work
+to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will
+work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
+on their associated resources.
+
+The default `eviction-minimum-reclaim` is `0` for all resources.
+
 ### Scheduler
 
 The node will report a condition when a compute resource is under pressure.  The
@@ -159,7 +259,8 @@ pods on the node.
 
 | Node Condition    | Scheduler Behavior                               |
 | ---------------- | ------------------------------------------------ |
-| `MemoryPressure` | `BestEffort` pods are not scheduled to the node. |
+| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. |
+| `DiskPressure` | No new pods are scheduled to the node. |
 
 ## Node OOM Behavior
 
@@ -223,3 +324,46 @@ candidate set of pods provided to the eviction strategy.
 In general, it is strongly recommended that `DaemonSet` not
 create `BestEffort` pods to avoid being identified as a candidate pod
 for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods.
+
+## Deprecation of existing feature flags to reclaim disk
+
+`kubelet` has been freeing up disk space on demand to keep the node stable.
+
+As disk based eviction matures, the following `kubelet` flags will be marked for deprecation
+in favor of the simpler configuation supported around eviction.
+
+| Existing Flag | New Flag |
+| ------------- | -------- |
+| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
+| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
+| `--maximum-dead-containers` | deprecated |
+| `--maximum-dead-containers-per-container` | deprecated |
+| `--minimum-container-ttl-duration` | deprecated |
+| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
+| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
+
+## Known issues
+
+### kubelet may not observe memory pressure right away
+
+The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval.  If memory usage
+increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller`
+will still be invoked.  We intend to integrate with the `memcg` notification API in a future release to reduce this
+latency, and instead have the kernel tell us when a threshold has been crossed immmediately.
+
+If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for
+this issue is to set eviction thresholds at approximately 75% capacity.  This increases the ability of this feature
+to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
+
+### kubelet may evict more pods than needed
+
+The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
+the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.
+
+### How kubelet ranks pods for eviction in response to inode exhaustion
+
+At this time, it is not possible to know how many inodes were consumed by a particular container.  If the `kubelet` observes
+inode exhaustion, it will evict pods by ranking them by quality of service.  The following issue has been opened in cadvisor
+to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
+by inode consumption.  For example, this would let us identify a container that created large numbers of 0 byte files, and evict
+that pod over others.