Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MinAge for NodeClaims #1030

Open
jonathan-innis opened this issue Feb 20, 2024 · 16 comments
Open

MinAge for NodeClaims #1030

jonathan-innis opened this issue Feb 20, 2024 · 16 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@jonathan-innis
Copy link
Member

Description

What problem are you trying to solve?

We are having some discussion around how to control deprovisioning here: aws/karpenter-provider-aws#1738. One mechanism for adding another layer of control is to allow users to set a minNodeLifetime or minAge on the NodeClaim spec which would indicate to Karpenter that it can't begin to disrupt the node until after it is past its node lifetime.

This has been mentioned in a few places across a few different issues (like #1014 and #752) but hasn't been formalized into its own separate issue. This effectively allows users to say that they wouldn't like Karpenter to begin disrupting the nodes until after a fixed amount of time. The benefit of this is that Karpenter will not quickly launch and deprovision nodes (even if there is some cost savings) since that churn on the application often means that the application never gets to run.

The use-cases for this has some overlap with #752, #795, and #735

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 20, 2024
@Bryce-Soghigian
Copy link
Member

would minAge be specifed on the nodepool in the disruption controls? and be on the nodeclaim itself? This seems to suggest we would put it on the nodeclaim, but it would be rather cumbrous to manage.

I can see value in the per nodeclaim controls but also would like to manage this through the nodepool.

@thomaspeitz
Copy link

would minAge be specifed on the nodepool in the disruption controls? +1

For us karpenter scales just to fast, and even with budgets it is hard to control.
I want it to scale every few hours and not every few minutes.

This feature would solve it for me and would be a really nice one!

@Bryce-Soghigian
Copy link
Member

@thomaspeitz based on your experience what do you think a good default value for minAge should be?

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 20, 2024

Seems rather parallel to --scale-down-delay-after-add from the cluster autoscaler, where we don't want to get rid of any nodes post scale up. But min age seems to solve the problem better in the sense that only the "New Nodes" are blocked from scale down. 

That being said additional granularity comes at the cost of the configuration being more complex for the end user. I would love to build a DisruptionDefaults/ClusterDefaults CRD so that the end user doesn't have to configure the same thing all throughout their nodepool crs.

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 26, 2024

Is anyone already tackling this or can I attempt a design + impl for this?

@sftim
Copy link

sftim commented Feb 26, 2024

When we come to name the field, try to pick a name that helps people work out the behavior in different scenarios just from the field name:

  • I create a NodePool and I specify that I want nodes to last at least an hour. I run some pods and end up with three healthy nodes.
  • After 20 minutes, I delete one NodeClaim using kubectl. How long does the claimed node survive for?
  • After a further 30 minutes, I have three healthy nodes, one only half an hour old. I delete the NodePool. How long does the NodePool take to delete?

@thomaspeitz
Copy link

So what I did to tune the minAge (kind of)
Going from no budgets (old release)
image

In detail you can see how bad it is:
image

to
this config (which only scales down in the first 10 minutes of an hour)

          budgets:
          - nodes: "1"
            schedule: "0 * * * *"
            duration: 10m
          - nodes: "0"
            schedule: "10 * * * *"
            duration: 50m

This reduced the (wrong) scaling down a lot.
image

Blue line cpu allocatable / green line requested cores.

--scale-down-delay-after-add
This helps to reduce the scale down.

I would suggest to allow configuring a scale down delay after add. This would solve a lot of problems from my point of view. I could just block scale down for 15m. This would mean less uneeded interruption. In combination of minAge of a node e.g. 15m. This would ensure:

  • Nodes are rotated only if they were running for a bit (less costs of image pull on the node) -> Less traffic
  • Nodes are scaled down only if they are not needed for x amount of time.

The graphs is the production setup of a client of mine which has a lot of HPAs running.

In the end I won't to avoid that nodes get scaled down, just because some HPA scales down for a short amount of time.
The scaling is to fast and then creates unneeded pod rotations. Devs are unhappy as there pods are restarting all the time.

@jonathan-innis
Copy link
Member Author

One thing that we should consider here with the design is the overlap with #735. For you @thomaspeitz, I'm suspecting that either this issue or that issue could help solve your problem. The question that I want to answer here is: What use-cases does minAge address that consolidateAfter doesn't. Really, consolidateAfter says, "wait for nodes to be in an under-utilized state for this long before spinning them down" while minAge says, "don't deprovision me for any disruption reason after you have initially spun me up". Both are valid, but I question the utility of blocking drift on a node after it has initially been launched, particularly if its configuration is no longer valid.

Does consolidateAfter just cover everything?

@jonathan-innis
Copy link
Member Author

That being said additional granularity comes at the cost of the configuration being more complex for the end user. I would love to build a DisruptionDefaults/ClusterDefaults CRD so that the end user doesn't have to configure the same thing all throughout their nodepool crs

I'm personally less of a fan of cluster-wide defaults in a separate CRD; since, I think that any default that you want to set on an individual resource could just be managed by your IAC, Kustomize, Helm, or GitOps tooling. @Bryce-Soghigian Have you heard from users that managing the defaults across resources is cumbersome to do with this tooling?

We've tended to index towards taking away global configuration (AWS provider removed global tagging, in favor of NodeClass-based tagging) exactly for the reason of: if you have the granularity, then you just need a templating engine to manage that consistency across resources.

@thomaspeitz
Copy link

For me Does consolidateAfter just cover everything? - Yes.

MinAge would ensure that old nodes are terminated with a higher frequency then the newest ones.

But thats what we have the TTL for anyways.

So yeah I am fine with consolidateAfter and not an implementation of minAge.
MinAge would solve only a symptom. ConsolidateAfter would solve the problem.

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 28, 2024

but I question the utility of blocking drift on a node after it has initially been launched, particularly if its configuration is no longer valid.

Someone just opened an issue requesting some flavor this type of functionality see here

DisruptAfter or MinAge allow blocking all churn or disruption for a specified amount of time. Some users value this SLO guarantee that you will get this node for 1 week and nothing in the system with automatically remove it.

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 28, 2024

Have you heard from users that managing the defaults across resources is cumbersome to do with this tooling?

You are right of course that you can solve the problem with IAC solutions.

I have also heard the customer narrative when watching videos on karpenter describing as harder to configure and understand than cluster autoscaler. This begs the question, Why? CAS has 100 Flags what makes it easier to understand?

Harder to understand makes sense since we are working with more dimensions via pod + node level controls, node replacements + disruptions etc. There are pros to making this bit more complex.

With CAS to configure it you go to a single place, the flags, and modify all changes to autoscaling behavior through this one interface. Karpenter has 3 CRDs for managing this control, plus the helm/deployment configuration values.

Adding a fourth global configuration CRD may not make the configuration process easier but I would like us to think about ways we can make configuration simpler, and have observability so they can understand why karpenter makes the decisions it does.

@jonathan-innis
Copy link
Member Author

I have also heard the customer narrative when watching videos on karpenter describing as harder to configure and understand than cluster autoscaler.

Interesting, would love a link to the videos just to hear directly the reasoning around why people think that we are more confusing. I agree that if this is the customer sentiment, we should drive to make our configuration simpler to understand.

@edas-smith
Copy link

+1 for this feature

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Apr 2, 2024

would love a link to the videos just to hear directly the reasoning around why people think that we are more confusing

Let me see if I can dig up some of those videos.

@Bryce-Soghigian
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

6 participants