Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2268: move non-graceful node shutdown to GA #4029

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-storage/2268.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@ kep-number: 2268
alpha:
approver: "@deads2k"
beta:
approver: "@deads2k"
stable:
approver: "@deads2k"
81 changes: 60 additions & 21 deletions keps/sig-storage/2268-non-graceful-shutdown/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# Non graceful node shutdown

This includes the Summary and Motivation sections.
# KEP-2268: Non graceful node shutdown

## Table of Contents

Expand Down Expand Up @@ -41,20 +39,20 @@ This includes the Summary and Motivation sections.
## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [X] (R) KEP approvers have approved the KEP status as `implementable`
- [X] (R) Design details are appropriately documented
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [X] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [X] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation - e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [X] (R) Production readiness review completed
- [X] (R) Production readiness review approved
- [X] "Implementation History" section is up-to-date for milestone
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [X] Supporting documentation - e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those
approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
Expand Down Expand Up @@ -146,7 +144,7 @@ To mitigate this we plan to have a high test coverage and to introduce this enha

### Test Plan

[x] I/we understand the owners of the involved components may require updates to
[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

Expand Down Expand Up @@ -386,20 +384,38 @@ logs or events for this purpose.
The usage of this feature requires the manual step of applying a taint
so the operator should be the one applying it.

###### How can someone using this feature know that it is working for their instance?

<!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->

- [X] API .status
If it works, pods in the stateful workload should be re-scheduled to another
running node. `Phase` in Pod `Status` should be `Running` for a new Pod
on the other running node.
If not, check the pod status to see why it does not come up.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->
- [X] Metrics
- Metric name:
- We can add new metrics `deleting_pods_total`, `deleting_pods_error_total`
in Pod GC Controller.
For Attach Detach Controller, there's already a metric:
attachdetach_controller_forced_detaches
It is also useful to know how many nodes have taints. We can explore with [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) which generates metrics about the state of the objects.
- New metrics are added in Pod GC Controller:
- `force_delete_pods_total{reason="out-of-service|terminated|orphaned|unscheduled"}`, the number of pods that are being forcefully deleted since the Pod GC Controller started.
- `force_delete_pod_errors_total{reason="out-of-service|terminated|orphaned|unscheduled"}`, the number of errors encountered when forcefully deleting the pods since the Pod GC Controller started.
- For Attach Detach Controller, the following metric will be recorded if a force detach is performed because the node has the `out-of-service` taint or a timeout happens:
- `attachdetach_controller_forced_detaches{reason="out-of-service|timeout"}`, the number of times the Attach Detach Controller performed a forced detach.
- There is also a `kube_node_spec_taint` in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/node-metrics.md) that is a metric for the taint of a Kubernetes cluster node.
- [Optional] Aggregation method:
- Components exposing the metric:
- Components exposing the metric: kube-controller-manager
- [X] Other (treat as last resort)
- Details:
- Check whether the workload moved to a different running node
Expand Down Expand Up @@ -490,6 +506,13 @@ For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->

Without this feature, a user can forcefully delete the pods after they are
in terminating state and new pods will be re-scheduled to another running
node after 6 minutes. With this feature, new pods will be re-scheduled to
another running node without the 6 minute wait after the user has applied
the `out-of-service` taint. It speeds up the failover but should not
affect the scalability.

###### Will enabling / using this feature result in any new API calls?

<!--
Expand Down Expand Up @@ -560,6 +583,19 @@ This through this both in small and large cases, again with respect to the
-->
No.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

<!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?

Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
No.

### Troubleshooting

<!--
Expand Down Expand Up @@ -648,6 +684,9 @@ For each of them, fill in the following information by copying the below templat
- 2020-11-10: KEP updated to handle part of the node partitioning
- 2021-08-26: The scope of the KEP is narrowed down to handle a real node shutdown. Test plan is updated. Node partitioning will be handled in the future and it can be built on top of this design.
- 2021-12-03: Removed `SafeDetach` flag. Requires a user to add the `out-of-service` taint when he/she knows the node is shutdown.
- Kubernetes v1.24: moved to alpha.
- Kubernete v1.26: moved to beta.
- Kubernete v1.28: moved to stable.

## Alternatives

Expand Down
15 changes: 12 additions & 3 deletions keps/sig-storage/2268-non-graceful-shutdown/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,28 @@ see-also:
replaces:

# The target maturity stage in the current dev cycle for this KEP.
stage: beta
stage: stable

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.26"
latest-milestone: "v1.28"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.24"
beta: "v1.26"
stable: "v1.27"
stable: "v1.28"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: NodeOutOfServiceVolumeDetach
components:
- kube-controller-manager
disable-supported: true

# The following PRR answers are required at beta release
metrics:
- force_delete_pods_total{reason="out-of-service|terminated|orphaned|unscheduled"}
- force_delete_pod_errors_total{reason="out-of-service|terminated|orphaned|unscheduled"}