Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for disk based eviction #1196

Merged
merged 1 commit into from
Sep 21, 2016
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 153 additions & 9 deletions docs/admin/out-of-resource.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,25 @@ summary API.
| Eviction Signal | Description |
|------------------|---------------------------------------------------------------------------------|
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |

In future releases, the `kubelet` will support the ability to trigger eviction decisions based on disk pressure.
Each of the above signals support either a literal or percentage based value. The percentage based value
is calculated relative to the total capacity associated with each signal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this hold for memory.available too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. it holds for all the things.


Until that time, it is recommended users take advantage of [garbage collection](/docs/admin/garbage-collection/).
`kubelet` supports only two filesystem partitions.

1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.

`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. `kubelet` does not care about any
other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is
*not OK* to store volumes and logs in a dedicated `filesystem`.

In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/)
support in favor of eviction in response to disk pressure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll still do some garbage collection though, right? Saying we're going to deprecate it is confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Garbage collection (for both image and containers) is more aggressive than the disk eviction AFAIK. For example, it deletes containers associated with deleted pods, and it also ensure we kept no more than N containers per (pod, container) tuple. Is the plan to fold garbage collection into eviction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


### Eviction Thresholds

Expand All @@ -47,6 +62,14 @@ Each threshold is of the following form:
* valid `eviction-signal` tokens as defined above.
* valid `operator` tokens are `<`
* valid `quantity` tokens must match the quantity representation used by Kubernetes
* an eviction threshold can be expressed as a percentage if ends with `%` token.

For example, if a node has `10Gi` of memory, and the desire is to induce eviction
if available memory falls below `1Gi`, an eviction threshold can be specified as either
of the following (but not both).

* `memory.available<10%`
* `memory.available<1Gi`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these specified? I assume there's a flag / config for it somewhere? What are the defaults?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are specified as flags on the kubelet that are defined later in this doc.

this document does not define defaults.

defaults right now are specific to a commercial offering imo.


#### Soft Eviction Thresholds

Expand Down Expand Up @@ -84,6 +107,10 @@ To configure hard eviction thresholds, the following flag is supported:
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
would trigger a pod eviction.

The `kubelet` has the following default hard eviction thresholds:

* `--eviction-hard=memory.available<100Mi`

### Eviction Monitoring Interval

The `kubelet` evaluates eviction thresholds per its configured housekeeping interval.
Expand All @@ -103,6 +130,7 @@ The following node conditions are defined that correspond to the specified evict
| Node Condition | Eviction Signal | Description |
|----------------|------------------|------------------------------------------------------------------|
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |

The `kubelet` will continue to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`.
Expand All @@ -124,15 +152,44 @@ The `kubelet` would ensure that it has not observed an eviction threshold being
for the specified pressure condition for the period specified before toggling the
condition back to `false`.

### Eviction of Pods
### Reclaiming node level resources

If an eviction threshold has been met and the grace period has passed,
the `kubelet` will initiate the process of evicting pods until it has observed
the signal has gone below its defined threshold.
the `kubelet` will initiate the process of reclaiming the pressured resource
until it has observed the signal has gone below its defined threshold.

The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If
disk pressure is observed, the `kubelet` reclaims node level resources differently if the
machine has a dedicated `imagefs` configured for the container runtime.

#### With Imagefs

If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:

1. Delete dead pods/containers

If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:

The `kubelet` ranks pods for eviction 1) by their quality of service,
2) and among those with the same quality of service by the consumption of the
starved compute resource relative to the pods scheduling request.
1. Delete all unused images

#### Without Imagefs

If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:

1. Delete dead pods/containers
1. Delete all unused images

### Evicting end-user pods

If the `kubelet` is unable to reclaim sufficient resource on the node,
it will begin evicting pods.

The `kubelet` ranks pods for eviction as follows:

* by their quality of service
* by the consumption of the starved compute resource relative to the pods scheduling request.

As a result, pod eviction occurs in the following order:

* `BestEffort` pods that consume the most of the starved resource are failed
first.
Expand All @@ -151,6 +208,49 @@ and the node only has `Guaranteed` pod(s) remaining, then the node must choose t
`Guaranteed` pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other `Guaranteed` pod(s).

Local disk is a `BestEffort` resource. If necessary, `kubelet` will evict pods one at a time to reclaim
disk when `DiskPressure` is encountered. The `kubelet` will rank pods by quality of service. If the `kubelet`
is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service
first. If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service
that consumes the largest amount of disk and kill those first.

#### With Imagefs

If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
- local volumes + logs of all its containers.

If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.

#### Without Imagefs

If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
- local volumes + logs & writable layer of all its containers.

### Minimum eviction reclaim

In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
is time consuming.

To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below
the configured eviction threshold.

For example, with the following configuration:

```
--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
```

If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure
that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` will work
to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will
work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
on their associated resources.

The default `eviction-minimum-reclaim` is `0` for all resources.

### Scheduler

The node will report a condition when a compute resource is under pressure. The
Expand All @@ -159,7 +259,8 @@ pods on the node.

| Node Condition | Scheduler Behavior |
| ---------------- | ------------------------------------------------ |
| `MemoryPressure` | `BestEffort` pods are not scheduled to the node. |
| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. |
| `DiskPressure` | No new pods are scheduled to the node. |

## Node OOM Behavior

Expand Down Expand Up @@ -223,3 +324,46 @@ candidate set of pods provided to the eviction strategy.
In general, it is strongly recommended that `DaemonSet` not
create `BestEffort` pods to avoid being identified as a candidate pod
for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods.

## Deprecation of existing feature flags to reclaim disk

`kubelet` has been freeing up disk space on demand to keep the node stable.

As disk based eviction matures, the following `kubelet` flags will be marked for deprecation
in favor of the simpler configuation supported around eviction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like simplifying the configuration, just one question. Will eviction only happen when the threshold is met? More aggressive, periodic cleanup, seems desirable IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should discuss this in a design issue, its tangental to the doc. what's documented is the current plan of record.

| Existing Flag | New Flag |
| ------------- | -------- |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
| `--maximum-dead-containers` | deprecated |
| `--maximum-dead-containers-per-container` | deprecated |
| `--minimum-container-ttl-duration` | deprecated |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |

## Known issues

### kubelet may not observe memory pressure right away

The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage
increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller`
will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this
latency, and instead have the kernel tell us when a threshold has been crossed immmediately.

If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for
this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature
to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.

### kubelet may evict more pods than needed

The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.

### How kubelet ranks pods for eviction in response to inode exhaustion

At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it rank pods in the same QoS class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is addressed under "Known issues" section "How kubelet ranks pods for eviction in response to inode exhaustion", in the future, we want it to rank by consumption when we can know it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear, it does not differentiate within a class, so its random.

to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
that pod over others.