Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision to specify the max Drain / Pod evictions attempts in Provisioner #3052

Closed
ra-grover opened this issue Dec 16, 2022 · 15 comments
Closed
Labels
deprovisioning Issues related to node deprovisioning feature New feature or request

Comments

@ra-grover
Copy link
Contributor

ra-grover commented Dec 16, 2022

Tell us about your request

We have hundreds on nodes in our clusters which are just hung on Unschedulable state because Karpenter is not able to drain them, might be a bad Application PDB or a Pod on that node stuck in ImagePullBackoff and Karpenter tries forever to evict it.
I think a flag in Consolidation stanza of the provisioner spec would be good option to specify maxDrainAttempts/maxPodEvictions for Karpenter to not stuck indefinitely.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

The nodes are stuck in un-schedulable state which are impacting costs.

Are you currently working around this issue?

We have manually written a cron job to Delete the pods stuck in those state, preventing Karpenter to terminate gracefully.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ra-grover ra-grover added the feature New feature or request label Dec 16, 2022
@ra-grover ra-grover changed the title Provision to specify the max Drain attempts in Provisioner Provision to specify the max Drain / Pod evictions attempts in Provisioner Dec 16, 2022
@ellistarn
Copy link
Contributor

Hey @ra-grover. Not saying no to this request, but I'm curious why it isn't easier for you folks to fix the underlying problem of broken PDBs. If karpenter deprovisions nodes, it's just going to move around the misconfigured pods. I worry that violating fundamental mechanisms of kubernetes (e.g. pdbs) may have unforeseen side effects.

@ra-grover
Copy link
Contributor Author

Hey @ellistarn , We have almost fixed the bad PDBs by implementing Gatekeeper rules and keeping them clean. But still there are exceptions such as Pods stuck in ImagePullBackOff on which Karpenter gets stuck.
The configurable attempts that I am proposing, will atleast help us to keep the test environments cost effective. And we can set high values for production to make sure that Karpenter has made enough attempts to Evict it gracefully.

If you still insists on not providing the option for the right reasons you mentioned :) Is there any chance for Karpenter to ignore graceful evictions if any of the container is stuck in ImagePullBackOff or is failing to start ?

@ellistarn
Copy link
Contributor

ellistarn commented Dec 16, 2022

Pods stuck in ImagePullBackOff on which Karpenter gets stuck.

Why doesn't karpenter evict these? Is there a PDB in the way?

@ra-grover
Copy link
Contributor Author

Apologies for replying late @ellistarn , I think the PDB gets in the way because the pod is already in a failed state (due to whatever reason). I found this long standing issue which is resolved just recently in kubernetes/kubernetes#113375.

I think a work-around that I proposed to specify a configurable max eviction tries will be really helpful.

@ellistarn
Copy link
Contributor

That upstream issue seems like a great fit. If you enabled that policy, would you still need the karpenter override?

@ra-grover
Copy link
Contributor Author

I think we would still need the Karpenter override, we are running 1.22 and wont be able to just upgrade to 1.26 (EKS is supported till 1.24) for that feature. Also thats behind a feature gate, It will be really helpful to have this override in Karpenter.

@ellistarn
Copy link
Contributor

Sure -- I'm thinking about the future state though. Do we need a long term solution in Karpenter, or do we need a short term fix to tide you over until 1.26. Once that feature is available, is there any value in karpenter also having an override?

@ra-grover
Copy link
Contributor Author

I think it's more of a short term fix only. We wont need Karpenter override past 1.26.

@ellistarn
Copy link
Contributor

Let me think on what we can do to help in the near term. One idea:

  • We could ignore imagepullbackoff pods in termination decisions
  • It's possible the pod could get stuck evicting due to PDBs
  • You could implement a cronjob that gets and deletes pods that are in imgpullbackoff
    wdyt?

@mhulscher
Copy link

In the case where nodes are removed for consolidation purposes, does Karpenter only select nodes for scale-down if it is sure that all pods on those nodes can be scheduled somewhere else? AFAIK that is the behavior of the cluster-autoscaler, which I personally think is an acceptable behavior.

Not saying no to this request, but I'm curious why it isn't easier for you folks to fix the underlying problem of broken PDBs.

In several companies I have seen scenarios where the (platform) team building the cluster(s) was not the same team that was deploying the workload(s); these were development teams. The platform-team would be responsible for configuring scheduling of the nodes (with karpenter, ca, whatever) and dev teams would deploy applications. Sometimes teams would deploy broken PDBs that did not allow disruptions, which is undesirable, but not necessarily easy to prevent w/ policies (kyverno, opa, etc).

Because of this, I personally don't think Karpenter can make a safe assumption that PDBs will eventually allow disruptions. Either it should not consider nodes for scale-down if PDBs don't allow disruptions. Or it should forcefully evict pods, disrespecting the PDB, if desired (or required, after a certain time, to deal w/ EC2 shutdowns or spot-interruptions)

@tzneal
Copy link
Contributor

tzneal commented Dec 28, 2022

In the case where nodes are removed for consolidation purposes, does Karpenter only select nodes for scale-down if it is sure that all pods on those nodes can be scheduled somewhere else? AFAIK that is the behavior of the cluster-autoscaler, which I personally think is an acceptable behavior.

No, we attempt this first but if we can replace a node with a cheaper node, we will also launch a cheaper replacement and then drain/shutdown the more expensive node.

@njtran
Copy link
Contributor

njtran commented Jan 3, 2023

@ra-grover looking at this issue again, is it feasible to do what Ellis has mentioned above here? #3052 (comment)

@ra-grover
Copy link
Contributor Author

You could implement a cronjob that gets and deletes pods that are in imgpullbackoff

We can implement for sure on our end, but I think since Karpenter is the one initiating the Cordoning/Pod eviction, it makes much more sense(atleast for me) to have this in Karpenter.

@jonathan-innis
Copy link
Contributor

Similar to kubernetes-sigs/karpenter#752 in its use-case

@billrayburn billrayburn added the deprovisioning Issues related to node deprovisioning label May 24, 2023
@njtran njtran removed their assignment Jun 27, 2023
@njtran
Copy link
Contributor

njtran commented Aug 14, 2023

Sounds like the linked upstream issue is beta in 1.27 now in the above comments. Considering EKS supports both 1.26 and 1.27, which has the feature in alpha and in beta, I'll close this one. Feel free to re-open if you disagree.

@njtran njtran closed this as completed Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprovisioning Issues related to node deprovisioning feature New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants