-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: issue#723 -- Do-not-evict only for running pods #778
Conversation
pkg/utils/pod/scheduling.go
Outdated
@@ -69,6 +69,15 @@ func IsOwnedByNode(pod *v1.Pod) bool { | |||
}) | |||
} | |||
|
|||
func IsPodReady(pod *v1.Pod) bool { | |||
for _, condition := range pod.Status.Conditions { | |||
if condition.Type != v1.PodReady { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may have intended to check that the condition.Type
is the v1.PodReady
type and then you want to check if this is true rather than just checking if there is any condition that isn't the PodReadiness condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
493cbba
to
e6d78fb
Compare
/assign jmdeal |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mallow111 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There are a few concerns that were laid out in the original issue that I don't think are addressed here. The main concern with the original feature request is that it isn't clear which container states are non-recoverable. We don't want to ignore the |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
@jmdeal @jonathan-innis what do you guys recommend the fix here, the initial issue has been discussed back and forth, but there is no solid solution there. What I am proposing is to make sure do-not-evict annotation only applies on running pods, for pods not ready, it contains various reasons, if we want to set a waiting period, how long it should wait, also is that worthy the effort to wait for those un ready pods? |
Our stance was that Karpenter should only ignore the We don't want to only consider the annotation for running pods because it is possible that a pending pod becomes ready between the time of the disruption decision and the time that the pod is removed. IMO what this PR would have to do is determine a set of container states that are truly unrecoverable. Then, if a pod is pending but one of it's containers is in an unrecoverable state we know we could disrupt it. I haven't looked into this deeply myself but from @njtran's comment on the original issue it doesn't look like there's a standard set of enum values for this. Without this standard set of 'unrecoverable values', anything Karpenter does here would be pretty opinionated and potentially dangerous. |
If there is no standard set of enum values, is there a way to move forward the change. Shall we put the issue as hold until there is such a standard coming out, otherwise I cannot see how shall we proceed the fix. |
Ya, I don't think we can move forward with this unless this core concern is addressed. Like I said, I haven't looked too deeply into it myself so it may be possible with the information available today but it's not immediately obvious. |
@jonathan-innis do you agree with the above comments, I am going to close this PR if you agree on the above. |
@mallow111 I'm mainly interested about the use-case around you wanting this feature. I'm assuming that it's quite similar to #723? I wonder if #752 works as a better, more comprehensive solution here as a first pass. You can optionally define a duration value on your Realistically, I think we need a design doc to map out the problem and the use-cases here that we can solve the problems that we are hitting here. Putting something like that together combined with a few solution options should help us drive toward a solution that doesn't leave nodes around that contain pods that will never go ready. If it's alright with you, we can close this PR in favor of a design proposal around handling stuck nodes that hang due to misconfiguration of pods that schedule to the nodes. |
Fixes #723
Description
Check if the pod has do-not-evict annotation also in ready status, then we treat this pod as non-evictable. Otherwise, when pods are not in ready status, even if it has do-not-evict annotation, we can still evict the pods.
How was this change tested?
I tested this in my local cluster.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.