- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Currently, Jobs and Deployments start Pods as soon as they are marked for terminating.
This KEP proposes a new field for the Job, Deployment and ReplicaSet controllers that counts terminating
pods as active. The goal of this KEP is to allow for opt-in behavior where terminating pods count as active.
Existing Issues:
- Job Creates Replacement Pods as soon as Pod is marked for deletion
- Option for acknowledging terminating Pods in Deployment rolling update
- Kueue: Account for terminating pods when doing preemption
Many common machine learning frameworks, such as Tensorflow, require unique pods. Terminating pods that count as active pods can cause errors. This is a rare case but it can provide problems if a job needs to guarantee that the existing pods terminate before starting new pods.
In Option for acknowledging terminating Pods in Deployment rolling update, there is a request in the Deployment API to guarantee that the number of replicas should include terminating. Terminating pods do utilize resources because resources are still allocated to them and there is potential for a user to be charged for utilizing those resources.
In scarce compute environments, these resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated.
- Job Controller should only create new pods once the existing ones are marked as Failed/Succeeded
- Deployment controller should allow for flexibility in waiting for pods to be fully terminated before creating new ones
- DaemonSets and StatefulSets are not included in this proposal
- They were designed to enforce uniqueness from the start so we will not include them in this design.
Both Jobs and the ReplicaSet controller get a list of active pods. Active pods usually mean pods that have not been registered for deletion. In this KEP, we want to include terminating pods as active pods.
We will propose two new API fields in Jobs and Deployments/ReplicaSets in this KEP.
As a machine learning user, ML frameworks allow scheduling of multiple pods.
The Job controller does not typically wait for terminating pods to be marked as failed. Tensorflow and other ML frameworks may have a requirement that they only want Pods to be started once the other pods are fully terminated. The following yaml can fit these needs:
This case was added due to a bug discovered with running IndexedJobs with Tensorflow. See Jobs create replacement Pods as soon as a Pod is marked for deletion for more details.
As a cloud user, users would want to guarantee that the number of pods that are running is exactly the amount that they specify. Terminating pods do not relinguish resources so scarce compute resource are still scheduled to those pods. See Kueue: Account for terminating pods when doing preemption for an example of this.
As a cloud user, users would want to guarantee that the number of pods that are running includes terminating pods. In scare compute environments, users may only have a limited amount of nodes and they do not want to try and schedule pods to a new resource. Counting terminating pods as active allows for the scheduling of pods to wait until pods are terminated.
See Option for acknowledging terminating Pods in Deployment rolling update for more examples.
The Deployment API is open for discussion. We put the field in Deployment/ReplicaSet because it is related to RolloutStrategy.
It is not clear if recreate
and/or rollingupdate
need this API for both rollout options.
Another open question is if we want to include Deployments in the initial release of this feature. There is some discussion about releasing the Job API first and then follow up with Deployment.
We decided to define the APIs in this KEP as they can utilize the same implementation.
With 3329-retriable-and-non-retriable-failures and PodFailurePolicy enabled, terminating pods are only marked as failed once they have been transitioned to failed. If PodFailurePolicy is disabled, then we mark a terminating pod as failed as soon as deletion is registered.
Should we add a new field to the status that reflects terminating pods?
Job controller should wait for Pods to be in a terminal phase before considering them failed or succeeded is a relevant issue for this case.
I am not sure how to handle these two different cases if we want to count terminating pods as active.
Should we use this feature to help solve 116858? When this feature toggle is on, then we mark terminating pods only as failed once they are complete regardless of PodFailurePolicy.
- TerminatingAsActive
- ActiveUntilTerminal
- DelayPodRecreationUntilTerminal
- ?
At the JobSpec level, we are adding a new BoolPtr field:
type JobSpec struct{
...
// terminatingAsActive specifies if the Job controller should include terminating pods
// as active. If the field is true, then the Job controller will include active pods
// to mean running or terminating pods
// +optional
TerminatingAsActive *bool
}
// DeploymentStrategy stores information about the strategy and rolling-update
// behavior of a deployment.
type DeploymentStrategy struct {
...
// TerminatingAsActive specifies if the Deployments should include terminating pods
// as active. If the field is true, then the Deployment controller will include active pods
// to mean running or terminating pods
// +optional
TerminatingAsActive *bool
}
In Option for acknowledging terminating Pods in Deployment rolling update
there was a request to add this as part of the DeploymentStrategy
field. Generally, handling terminating pods as active can be useful in both RollingUpdates and Recreating rollouts. Having this field for both strategies allows for handling of terminating pods in both cases.
Deployments create ReplicaSets so there is a need to add a field in the ReplicaSet as well. Since ReplicaSets are not typically set by users, we should add a field to the ReplicaSet that is set from the DeploymentSpec.
// ReplicaSetSpec is the specification of a ReplicaSet.
// As the internal representation of a ReplicaSet, it must have
// a Template set.
type ReplicaSetSpec struct {
...
// TerminatingAsActive specifies if the Deployments should include terminating pods
// as active. If the field is true, then the Deployment controller will include active pods
// to mean running or terminating pods
// +optional
TerminatingAsActive *bool
}
Generally, both the Job controller and ReplicaSets utilize FilterActivePods
in their reconciliation loop. FilterActivePods
gets a list of pods that are not terminating. This KEP will include terminating pods in this list.
// FilterActivePods returns pods that have not terminated.
func FilterActivePods(pods []*v1.Pod, terminatingPods bool) []*v1.Pod {
var result []*v1.Pod
for _, p := range pods {
if IsPodActive(p) {
result = append(result, p)
} else if IsPodTerminating(p) && terminatingPods {
result = append(result, p)
} else {
klog.V(4).Infof("Ignoring inactive pod %v/%v in state %v, deletion time %v",
p.Namespace, p.Name, p.Status.Phase, p.DeletionTimestamp)
}
}
return result
}
func IsPodTerminating(p *v1.Pod) bool {
return v1.PodSucceeded != p.Status.Phase &&
v1.PodFailed != p.Status.Phase &&
p.DeletionTimestamp != nil
}
The Job Controller uses this list to determine if there is a mismatch of active pods between expected values in the JobSpec.
Including active pods in this list allows the job controller to wait until these terminating pods.
Filter Active Pods Usage in Job Controller filters the active pods.
For the Deployment/ReplicaSet, ReplicaSets filter out active pods. The implementation for this should include reading the deployment field and setting the replicaset the same field in the replicaset.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
controller_utils
:April 3rd 2023
-56.6
replicaset
:April 3rd 2023
-78.5
deployment
:April 3rd 2023
-66.4
job
:April 3rd 2023
-90.4
We will add the following integration test for the Job controller:
TerminatingAsActive Feature Toggle On:
- NonIndexedJob starts pods that takes a while to terminate
- Delete pods
- Verify that pod creation only occurs once terminating pods are removed
We should test the above with the FeatureToggle off also.
We will add a similar integration test for Deployment:
- Job controller includes terminating pods as active
- Deployment strategy optionally includes terminating pods as active
- Unit Tests
- Initial e2e tests
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: TerminatingAsActive
- Components depending on the feature gate: kube-controller-manager
Yes, terminating pods are included in the active pod count for FilterActivePods
.
This means that deployments/Jobs when field is enabled will only create new pods once the existing pods have terminated.
This could potentially make deployments slower.
Yes.
Terminating pods will now be dropped from active list and we will revert to old behavior. This means that terminating pods will be considered deleted and new pods will be created.
Yes. Unit tests will include the fields off/on and verify behavior.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
If a user terminates pods that are controlled by a deployment/job, then we should wait until the existing pods are terminated before starting new ones.
We will add e2e test that determine this.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
NA
Are there any missing metrics that would be useful to have to improve observability of this feature?
This feature is closely related to the 3329-retriable-and-nonretriable-failures but not sure if that is considered a dependency.
No
Generally, enabling this will slow down rollouts if pods take a long time to terminate. We would wait to create new pods until the existing ones are terminated
No
We add TerminatingAsActive
to JobSpec
, DeploymentStrategy
and ReplicaSetSpec
. This is a boolPtr.
No
For Job API, we are adding a BoolPtr field named TerminatingAsActive
which is a boolPtr of 8 bytes.
- API type(s): boolPtr
- Estimated increase in size: 8B
ReplicaSet and Deployment have two additions:
- API type(s): boolPtr
DeploymentStrategy
and ReplicaSetSpec- Estimated increase in size: 16B (2 x 8B)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
Enabling this feature may have rollouts become slower.
We discussed having this under the PodFailurePolicy but this is a more general idea than the PodFailurePolicy.
NA