-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very quick turnover in nodes #1014
Comments
Can you share the pod specs, NodePools, EC2NodeClasses that are used here? |
Other thing that it would be good to know here is if the consolidation actions are occurring against nodes that are completely empty? Are those nodes even getting the pods that we intend to schedule to them scheduled to them or is something happening to the nodes that's causing us to think that we can no longer schedule a pod to the node? |
The nodes aren't completely empty and pods are getting scheduled. There are many pods running on these nodes and we use keda. It's not easy to track down issues. But what I'm getting at here is that it is preferable to have a minimum uptime for each node. As really if they are churning to much we are paying for things to boot up, the node, the pods as opposed to running the applications, in cluster-autoscaler this is 10 minutes which is configurable - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work. I think Karpenter should do the same |
So, do you believe this is truly a bug or an additional layer of configuration that Karpenter should offer? It's hard to tell from the logs and the scaling what might be happening and if Karpenter is actually doing something wrong based on our current feature set today.
For this, we actually have thought a bit about adding something like this. I'd take a look at #735 which talks about allowing a user-configurable amount of time after we see a node as underutilized, to actually go and disrupt it and aws/karpenter-provider-aws#1738 which covers the whole set of deprovisioning controls that we are thinking about. |
As I say it's difficult to know whether it is a bug, I just think it's quite odd that Karpenter is removing newly provisioned nodes. The logs don't give you much information, it doesn't even have the nodepool from which it is consolidation, let alone the reasoning for that being picked above others. See here again
I think however if there is no protection over how long to leave a node after it becomes consolidatible then it is a feature |
This one is tough only because we can do consolidations across NodePools, so the decision-making process can't log this or be quite so specific
We just added detail in this change (#1025) that should at least give us more information into the number of pods that are coming off of the node so that we can see a little clearer how many pods that are actually being disrupted by the operation. As for the node's interaction with Keda scaling and pods, it sounds like you may also be looking for something like this issue on top of the If this is causing a lot of churn, I'd recommend one of two things: 1) Only enable the |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
Description
Observed Behavior:
Expected Behavior:
Ideally nodes should have a minimum run time
Reproduction Steps (Please include YAML):
Versions: 0.33.2
kubectl version
):The text was updated successfully, but these errors were encountered: