fix: issue#723 -- Do-not-evict only for running pods #778

mallow111 · 2023-11-11T00:54:25Z

Fixes #723

Description
Check if the pod has do-not-evict annotation also in ready status, then we treat this pod as non-evictable. Otherwise, when pods are not in ready status, even if it has do-not-evict annotation, we can still evict the pods.

How was this change tested?
I tested this in my local cluster.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jonathan-innis · 2023-11-11T03:57:47Z

pkg/utils/pod/scheduling.go

@@ -69,6 +69,15 @@ func IsOwnedByNode(pod *v1.Pod) bool {
 	})
 }

+func IsPodReady(pod *v1.Pod) bool {
+	for _, condition := range pod.Status.Conditions {
+		if condition.Type != v1.PodReady {


I think you may have intended to check that the condition.Type is the v1.PodReady type and then you want to check if this is true rather than just checking if there is any condition that isn't the PodReadiness condition

jonathan-innis · 2023-11-27T22:25:52Z

/assign jmdeal

k8s-ci-robot · 2023-11-28T00:40:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mallow111
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jmdeal · 2023-12-05T19:29:28Z

There are a few concerns that were laid out in the original issue that I don't think are addressed here. The main concern with the original feature request is that it isn't clear which container states are non-recoverable. We don't want to ignore the do-not-evict annotation if the pod could eventually become ready. Otherwise, the pod may become ready between the time Karpenter chooses to disrupt the node and it is removed. By waiting for the pod to become ready before respecting the do-not-evict annotation this could happen frequently during normal operation, such as pulling the container image. Do you have any thoughts on how to address these concerns?

github-actions · 2023-12-20T12:01:40Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

mallow111 · 2023-12-20T17:18:31Z

@jmdeal @jonathan-innis what do you guys recommend the fix here, the initial issue has been discussed back and forth, but there is no solid solution there. What I am proposing is to make sure do-not-evict annotation only applies on running pods, for pods not ready, it contains various reasons, if we want to set a waiting period, how long it should wait, also is that worthy the effort to wait for those un ready pods?

jmdeal · 2023-12-20T17:54:37Z

Our stance was that Karpenter should only ignore the do-not-evict annotation in cases where it is certain that the pod will not become ready. Currently Karpenter does this when a pod is in a terminal state, i.e. Succeeded / Failed, or if it is terminating (ref).

We don't want to only consider the annotation for running pods because it is possible that a pending pod becomes ready between the time of the disruption decision and the time that the pod is removed. IMO what this PR would have to do is determine a set of container states that are truly unrecoverable. Then, if a pod is pending but one of it's containers is in an unrecoverable state we know we could disrupt it. I haven't looked into this deeply myself but from @njtran's comment on the original issue it doesn't look like there's a standard set of enum values for this. Without this standard set of 'unrecoverable values', anything Karpenter does here would be pretty opinionated and potentially dangerous.

mallow111 · 2023-12-20T18:31:07Z

If there is no standard set of enum values, is there a way to move forward the change. Shall we put the issue as hold until there is such a standard coming out, otherwise I cannot see how shall we proceed the fix.

jmdeal · 2023-12-20T18:36:14Z

Ya, I don't think we can move forward with this unless this core concern is addressed. Like I said, I haven't looked too deeply into it myself so it may be possible with the information available today but it's not immediately obvious.

mallow111 · 2023-12-20T18:41:01Z

@jonathan-innis do you agree with the above comments, I am going to close this PR if you agree on the above.

jonathan-innis · 2023-12-27T18:36:10Z

@mallow111 I'm mainly interested about the use-case around you wanting this feature. I'm assuming that it's quite similar to #723? I wonder if #752 works as a better, more comprehensive solution here as a first pass. You can optionally define a duration value on your karpenter.sh/do-not-disrupt annotation like karpenter.sh/do-not-disrupt: 24h which says that we should start ignoring the do-not-disrupt annotation after 24h from the initial creationTimestamp for the pod.

Realistically, I think we need a design doc to map out the problem and the use-cases here that we can solve the problems that we are hitting here. Putting something like that together combined with a few solution options should help us drive toward a solution that doesn't leave nodes around that contain pods that will never go ready.

If it's alright with you, we can close this PR in favor of a design proposal around handling stuck nodes that hang due to misconfiguration of pods that schedule to the nodes.

fix: issue#723 -- Do-not-evict only for running pods

0cdb6f6

mallow111 requested a review from a team as a code owner November 11, 2023 00:54

mallow111 requested a review from tzneal November 11, 2023 00:54

mallow111 mentioned this pull request Nov 11, 2023

"Do not evict" only for running pods #723

Closed

jonathan-innis reviewed Nov 11, 2023

View reviewed changes

mallow111 added 2 commits November 15, 2023 10:55

wip

894b4df

address comments

e6d78fb

mallow111 force-pushed the add-podrunning branch from 493cbba to e6d78fb Compare November 15, 2023 21:18

Merge branch 'main' into add-podrunning

e67f207

k8s-ci-robot assigned jmdeal Nov 27, 2023

Merge branch 'main' into add-podrunning

d55791e

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 28, 2023

mallow111 added 2 commits November 30, 2023 15:32

Merge branch 'main' into add-podrunning

f600c30

Merge branch 'main' into add-podrunning

3b7c75d

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2023

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2023

mallow111 closed this Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: issue#723 -- Do-not-evict only for running pods #778

fix: issue#723 -- Do-not-evict only for running pods #778

mallow111 commented Nov 11, 2023

jonathan-innis Nov 11, 2023 •

edited by njtran

Loading

mallow111 Nov 15, 2023

jonathan-innis commented Nov 27, 2023

k8s-ci-robot commented Nov 28, 2023

jmdeal commented Dec 5, 2023

github-actions bot commented Dec 20, 2023

mallow111 commented Dec 20, 2023

jmdeal commented Dec 20, 2023

mallow111 commented Dec 20, 2023

jmdeal commented Dec 20, 2023 •

edited

Loading

mallow111 commented Dec 20, 2023

jonathan-innis commented Dec 27, 2023 •

edited

Loading

fix: issue#723 -- Do-not-evict only for running pods #778

fix: issue#723 -- Do-not-evict only for running pods #778

Conversation

mallow111 commented Nov 11, 2023

jonathan-innis Nov 11, 2023 • edited by njtran Loading

Choose a reason for hiding this comment

mallow111 Nov 15, 2023

Choose a reason for hiding this comment

jonathan-innis commented Nov 27, 2023

k8s-ci-robot commented Nov 28, 2023

jmdeal commented Dec 5, 2023

github-actions bot commented Dec 20, 2023

mallow111 commented Dec 20, 2023

jmdeal commented Dec 20, 2023

mallow111 commented Dec 20, 2023

jmdeal commented Dec 20, 2023 • edited Loading

mallow111 commented Dec 20, 2023

jonathan-innis commented Dec 27, 2023 • edited Loading

jonathan-innis Nov 11, 2023 •

edited by njtran

Loading

jmdeal commented Dec 20, 2023 •

edited

Loading

jonathan-innis commented Dec 27, 2023 •

edited

Loading