`MinAge` for NodeClaims #1030

jonathan-innis · 2024-02-20T01:45:21Z

Description

What problem are you trying to solve?

We are having some discussion around how to control deprovisioning here: aws/karpenter-provider-aws#1738. One mechanism for adding another layer of control is to allow users to set a minNodeLifetime or minAge on the NodeClaim spec which would indicate to Karpenter that it can't begin to disrupt the node until after it is past its node lifetime.

This has been mentioned in a few places across a few different issues (like #1014 and #752) but hasn't been formalized into its own separate issue. This effectively allows users to say that they wouldn't like Karpenter to begin disrupting the nodes until after a fixed amount of time. The benefit of this is that Karpenter will not quickly launch and deprovision nodes (even if there is some cost savings) since that churn on the application often means that the application never gets to run.

The use-cases for this has some overlap with #752, #795, and #735

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

Bryce-Soghigian · 2024-02-20T02:27:24Z

would minAge be specifed on the nodepool in the disruption controls? and be on the nodeclaim itself? This seems to suggest we would put it on the nodeclaim, but it would be rather cumbrous to manage.

I can see value in the per nodeclaim controls but also would like to manage this through the nodepool.

thomaspeitz · 2024-02-20T08:46:28Z

would minAge be specifed on the nodepool in the disruption controls? +1

For us karpenter scales just to fast, and even with budgets it is hard to control.
I want it to scale every few hours and not every few minutes.

This feature would solve it for me and would be a really nice one!

Bryce-Soghigian · 2024-02-20T09:00:43Z

@thomaspeitz based on your experience what do you think a good default value for minAge should be?

Bryce-Soghigian · 2024-02-20T09:21:11Z

Seems rather parallel to --scale-down-delay-after-add from the cluster autoscaler, where we don't want to get rid of any nodes post scale up. But min age seems to solve the problem better in the sense that only the "New Nodes" are blocked from scale down.

That being said additional granularity comes at the cost of the configuration being more complex for the end user. I would love to build a DisruptionDefaults/ClusterDefaults CRD so that the end user doesn't have to configure the same thing all throughout their nodepool crs.

Bryce-Soghigian · 2024-02-26T04:58:56Z

Is anyone already tackling this or can I attempt a design + impl for this?

sftim · 2024-02-26T17:17:27Z

When we come to name the field, try to pick a name that helps people work out the behavior in different scenarios just from the field name:

I create a NodePool and I specify that I want nodes to last at least an hour. I run some pods and end up with three healthy nodes.
After 20 minutes, I delete one NodeClaim using kubectl. How long does the claimed node survive for?
After a further 30 minutes, I have three healthy nodes, one only half an hour old. I delete the NodePool. How long does the NodePool take to delete?

thomaspeitz · 2024-02-26T17:41:57Z

So what I did to tune the minAge (kind of)
Going from no budgets (old release)

In detail you can see how bad it is:

to
this config (which only scales down in the first 10 minutes of an hour)

          budgets:
          - nodes: "1"
            schedule: "0 * * * *"
            duration: 10m
          - nodes: "0"
            schedule: "10 * * * *"
            duration: 50m

This reduced the (wrong) scaling down a lot.

Blue line cpu allocatable / green line requested cores.

--scale-down-delay-after-add
This helps to reduce the scale down.

I would suggest to allow configuring a scale down delay after add. This would solve a lot of problems from my point of view. I could just block scale down for 15m. This would mean less uneeded interruption. In combination of minAge of a node e.g. 15m. This would ensure:

Nodes are rotated only if they were running for a bit (less costs of image pull on the node) -> Less traffic
Nodes are scaled down only if they are not needed for x amount of time.

The graphs is the production setup of a client of mine which has a lot of HPAs running.

In the end I won't to avoid that nodes get scaled down, just because some HPA scales down for a short amount of time.
The scaling is to fast and then creates unneeded pod rotations. Devs are unhappy as there pods are restarting all the time.

jonathan-innis · 2024-02-27T06:48:45Z

One thing that we should consider here with the design is the overlap with #735. For you @thomaspeitz, I'm suspecting that either this issue or that issue could help solve your problem. The question that I want to answer here is: What use-cases does minAge address that consolidateAfter doesn't. Really, consolidateAfter says, "wait for nodes to be in an under-utilized state for this long before spinning them down" while minAge says, "don't deprovision me for any disruption reason after you have initially spun me up". Both are valid, but I question the utility of blocking drift on a node after it has initially been launched, particularly if its configuration is no longer valid.

Does consolidateAfter just cover everything?

jonathan-innis · 2024-02-27T06:51:19Z

That being said additional granularity comes at the cost of the configuration being more complex for the end user. I would love to build a DisruptionDefaults/ClusterDefaults CRD so that the end user doesn't have to configure the same thing all throughout their nodepool crs

I'm personally less of a fan of cluster-wide defaults in a separate CRD; since, I think that any default that you want to set on an individual resource could just be managed by your IAC, Kustomize, Helm, or GitOps tooling. @Bryce-Soghigian Have you heard from users that managing the defaults across resources is cumbersome to do with this tooling?

We've tended to index towards taking away global configuration (AWS provider removed global tagging, in favor of NodeClass-based tagging) exactly for the reason of: if you have the granularity, then you just need a templating engine to manage that consistency across resources.

thomaspeitz · 2024-02-27T11:38:20Z

For me Does consolidateAfter just cover everything? - Yes.

MinAge would ensure that old nodes are terminated with a higher frequency then the newest ones.

But thats what we have the TTL for anyways.

So yeah I am fine with consolidateAfter and not an implementation of minAge.
MinAge would solve only a symptom. ConsolidateAfter would solve the problem.

Bryce-Soghigian · 2024-02-28T05:35:58Z

but I question the utility of blocking drift on a node after it has initially been launched, particularly if its configuration is no longer valid.

Someone just opened an issue requesting some flavor this type of functionality see here

DisruptAfter or MinAge allow blocking all churn or disruption for a specified amount of time. Some users value this SLO guarantee that you will get this node for 1 week and nothing in the system with automatically remove it.

Bryce-Soghigian · 2024-02-28T06:11:17Z

Have you heard from users that managing the defaults across resources is cumbersome to do with this tooling?

You are right of course that you can solve the problem with IAC solutions.

I have also heard the customer narrative when watching videos on karpenter describing as harder to configure and understand than cluster autoscaler. This begs the question, Why? CAS has 100 Flags what makes it easier to understand?

Harder to understand makes sense since we are working with more dimensions via pod + node level controls, node replacements + disruptions etc. There are pros to making this bit more complex.

With CAS to configure it you go to a single place, the flags, and modify all changes to autoscaling behavior through this one interface. Karpenter has 3 CRDs for managing this control, plus the helm/deployment configuration values.

Adding a fourth global configuration CRD may not make the configuration process easier but I would like us to think about ways we can make configuration simpler, and have observability so they can understand why karpenter makes the decisions it does.

jonathan-innis · 2024-02-29T01:34:18Z

I have also heard the customer narrative when watching videos on karpenter describing as harder to configure and understand than cluster autoscaler.

Interesting, would love a link to the videos just to hear directly the reasoning around why people think that we are more confusing. I agree that if this is the customer sentiment, we should drive to make our configuration simpler to understand.

edas-smith · 2024-03-08T15:57:52Z

+1 for this feature

Bryce-Soghigian · 2024-04-02T17:03:41Z

would love a link to the videos just to hear directly the reasoning around why people think that we are more confusing

Let me see if I can dig up some of those videos.

Bryce-Soghigian · 2024-04-02T17:05:18Z

/lifecycle frozen

jonathan-innis added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 20, 2024

jonathan-innis mentioned this issue Feb 20, 2024

Mega Issue: Deprovisioning Controls aws/karpenter-provider-aws#1738

Closed

18 tasks

Bryce-Soghigian mentioned this issue Feb 28, 2024

Allow Forced Disruption of Nodes after TTL (disruptionTolerationDuration) #1047

Closed

engedaam mentioned this issue Mar 21, 2024

We are trying to add a configuration for our NodePool to consolidate when under utilized aws/karpenter-provider-aws#5912

Closed

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MinAge` for NodeClaims #1030

`MinAge` for NodeClaims #1030

jonathan-innis commented Feb 20, 2024

Bryce-Soghigian commented Feb 20, 2024

thomaspeitz commented Feb 20, 2024

Bryce-Soghigian commented Feb 20, 2024

Bryce-Soghigian commented Feb 20, 2024 •

edited

Loading

Bryce-Soghigian commented Feb 26, 2024 •

edited

Loading

sftim commented Feb 26, 2024

thomaspeitz commented Feb 26, 2024

jonathan-innis commented Feb 27, 2024

jonathan-innis commented Feb 27, 2024

thomaspeitz commented Feb 27, 2024

Bryce-Soghigian commented Feb 28, 2024 •

edited

Loading

Bryce-Soghigian commented Feb 28, 2024 •

edited

Loading

jonathan-innis commented Feb 29, 2024

edas-smith commented Mar 8, 2024

Bryce-Soghigian commented Apr 2, 2024 •

edited

Loading

Bryce-Soghigian commented Apr 2, 2024

MinAge for NodeClaims #1030

MinAge for NodeClaims #1030

Comments

jonathan-innis commented Feb 20, 2024

Description

Bryce-Soghigian commented Feb 20, 2024

thomaspeitz commented Feb 20, 2024

Bryce-Soghigian commented Feb 20, 2024

Bryce-Soghigian commented Feb 20, 2024 • edited Loading

Bryce-Soghigian commented Feb 26, 2024 • edited Loading

sftim commented Feb 26, 2024

thomaspeitz commented Feb 26, 2024

jonathan-innis commented Feb 27, 2024

jonathan-innis commented Feb 27, 2024

thomaspeitz commented Feb 27, 2024

Bryce-Soghigian commented Feb 28, 2024 • edited Loading

Bryce-Soghigian commented Feb 28, 2024 • edited Loading

jonathan-innis commented Feb 29, 2024

edas-smith commented Mar 8, 2024

Bryce-Soghigian commented Apr 2, 2024 • edited Loading

Bryce-Soghigian commented Apr 2, 2024

`MinAge` for NodeClaims #1030

`MinAge` for NodeClaims #1030

Bryce-Soghigian commented Feb 20, 2024 •

edited

Loading

Bryce-Soghigian commented Feb 26, 2024 •

edited

Loading

Bryce-Soghigian commented Feb 28, 2024 •

edited

Loading

Bryce-Soghigian commented Feb 28, 2024 •

edited

Loading

Bryce-Soghigian commented Apr 2, 2024 •

edited

Loading