Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for stacked/cooperative autoscaling or adoption of 3rd party created node claims #920

Open
gilqubex opened this issue Jan 3, 2024 · 14 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@gilqubex
Copy link

gilqubex commented Jan 3, 2024

Description

What problem are you trying to solve?

(in continuation of slack conversation in #karpenter-dev on kubernetes slack)

As a 3rd party independent software vendor, I would like Karpenter to allow for cooperative autoscaling - that is, two or more autoscalers sharing responsibility for the same cluster.

While having more than one autoscaler start instances is generally benign (at most, one ends up with temporary overcapacity), more than one autoscaler trying to downscale at the same time may very well cause self-induced unavailability.

in v1alpha5, the "linking" mechanism enabled karpenter to "adopt" foreign instances and thus remain as the sole cluster downscaler. This however was incidental as this was not the original purpose of the linking mechanism, perhaps now a proper feature aimed at this goal can be designed.

In lieu of cooperative autoscaling, we have observed 3rd party vendors requiring customers to migrate off of karpenter (or cluster autoscaler for that matter), which (at least in our scenario as a specialized autoscaler) is undesirable.

as a 3rd party vendor, we plan on assuming all integration responsibility upon ourselves, but the most basic supporting mechanism (e.g. not rejecting foreign nodeclaims) should hopefully exist.

How important is this feature to you?

This is very important to us as we would like to continue supporting cooperative autoscaling from v1beta1 onwards.

@gilqubex gilqubex added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 3, 2024
@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Jan 3, 2024

Have you thought about natively implementing your instance launch in karpenter via an implementation of the cloud provider interface? Seems like bringing your own VMS means that effectively you take the responsiblity inside of karpenter for launching/creating instances(the responsibility of the cloud provider).

You can have cloudprovider.Create() wake up a hibernated instance, and cloudprovider.Delete() put it back to sleep.


Link to original slack thread: https://kubernetes.slack.com/archives/C04JW2J5J5P/p1703276514443079

@gilqubex
Copy link
Author

gilqubex commented Jan 3, 2024

Have you thought about natively implementing your instance launch in karpenter via an implementation of the cloud provider interface? Seems like bringing your own VMS means that effectively you take the responsiblity inside of karpenter for launching/creating instances(the responsibility of the cloud provider).

Link to original slack thread: https://kubernetes.slack.com/archives/C04JW2J5J5P/p1703276514443079

Before referring to the specific use case of our software, I have to say that I cannot confirm that this is a sufficient solution for all third party cooperative autoscalers. Even if we could "squeeze in", I believe this degree of freedom is very likely to be limiting for other use cases.

As for the specifics of us as a third party: we use hibernated instances. in order for karpenter to support our use case, warm pool functionality must be implemented.
This however is sorts of a circular: we would be replacing instances under karpenter's watch, we would just be doing it through the warm pool. if karpenter's warm pool implementation would still be using nodeclaims to avoid foreign nodes, we'd be at the same starting point. and if it doesn't, well, why not support foreign nodes to begin with?

edit: another example of a missing feaure is scale in protection (i.e. this instance won't be scaled in for X minutes). our current way of implementing this is delaying the transfer of responsibility to karpenter for those X minutes.

@Bryce-Soghigian
Copy link
Member

edit: another example of a missing feaure is scale in protection (i.e. this instance won't be scaled in for X minutes). our current way of implementing this is delaying the transfer of responsibility to karpenter for those X minutes.

Nodes can be opted out of consolidation disruption by setting the annotation karpenter.sh/do-not-disrupt: "true" on the node. Do you need more granular controls? Like do not disrupt until after x minutes?

apiVersion: v1
kind: Node
metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"

https://karpenter.sh/preview/concepts/disruption/#node-level-controls

@gilqubex
Copy link
Author

gilqubex commented Jan 3, 2024

edit: another example of a missing feaure is scale in protection (i.e. this instance won't be scaled in for X minutes). our current way of implementing this is delaying the transfer of responsibility to karpenter for those X minutes.

Nodes can be opted out of consolidation disruption by setting the annotation karpenter.sh/do-not-disrupt: "true" on the node. Do you need more granular controls? Like do not disrupt until after x minutes?

apiVersion: v1
kind: Node
metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"

https://karpenter.sh/preview/concepts/disruption/#node-level-controls

The mechanism itself is technically sufficient and we actually plan to slightly improve our current Scale in protection implementation with it. Our concerns are:

(A) We have not been able to affirm that this applies to all forms of possible downscale - drift, empty, consolidation...
(B) We risk a race condition if the tag is not applied post-factum and not during node object creation. Currently we plan to create it before the relevant machine object

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Jan 3, 2024

for A)
All disruption methods are blocked via that annotation. Each disruption method calls disrupt, and in each disrupt call, we call a function called GetCandidates to get its disruption candidates.

In this function in the NewCandidate filter we check for the disruption annotation here. So it should cover all possible forms of automated scale down.

@jonathan-innis @njtran feel free to correct any mistakes in my explanation, I think the code here speaks for itself however.

@jonathan-innis jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 3, 2024
@jonathan-innis
Copy link
Member

Like do not disrupt until after x minutes

Agree with @Bryce-Soghigian that a good way of doing this is to add the karpenter.sh/do-not-disrupt annotation to the NodePool spec.template.metadata so that it automatically propagates to all NodeClaims that are created by the NodePool. The trade-off (right now) is that we don't currently have a concept of a time that this annotation is respected. Realistically, we are trying to get this feature in: #752 so that we could have the concept of "respect this do-not-disrupt annotation up until the grace period (which would be a set time past the creationTimestamp)" which I think would cover the use-case you are trying to achieve here.

in order for karpenter to support our use case, warm pool functionality must be implemented

I'd be interested to see if we can just knock-out each of the current gaps that you are identifying in Karpenter today rather than delegating Karpenter as purely a down-scaler. IMO, this breaks a lot of assumptions that we make around safety when we are walking through disrupting nodes, since we assume that we can launch replacements for these nodes and ensure that replacements are safely running and a node is fully drained before continuing on. With your current implementation with linking, how do you get around these assumptions that Karpenter makes around its ability to launch replacements? I'm assuming that Karpenter might be fighting with whichever up-scaling solution that you are using when it tries to disrupt existing nodes with pods.

@gilqubex
Copy link
Author

gilqubex commented Jan 3, 2024

in order for karpenter to support our use case, warm pool functionality must be implemented

I'd be interested to see if we can just knock-out each of the current gaps that you are identifying in Karpenter today rather than delegating Karpenter as purely a down-scaler. IMO, this breaks a lot of assumptions that we make around safety when we are walking through disrupting nodes, since we assume that we can launch replacements for these nodes and ensure that replacements are safely running and a node is fully drained before continuing on. With your current implementation with linking, how do you get around these assumptions that Karpenter makes around its ability to launch replacements? I'm assuming that Karpenter might be fighting with whichever up-scaling solution that you are using when it tries to disrupt existing nodes with pods.

When we request karpenter to take responsibility for our nodes, we assign them to a provisioner and we would be the one at fault if they are not compatible (including from the customer's perspective). We take extreme care to verify that, inasmuch as we monitor for every karpenter configuration change and adapt our configuration to match.

It works beautifully with v1alpha5: while both our scaler and karpenter start instances, once "adopted" karpenter will process those nodes successfully as it downscales them, if at all.

@garvinp-stripe
Copy link
Contributor

Wonder if it somewhat related to #749

@sftim
Copy link

sftim commented Jan 29, 2024

When we request karpenter to take responsibility for our nodes, we assign them to a provisioner and we would be the one at fault if they are not compatible (including from the customer's perspective). We take extreme care to verify that, inasmuch as we monitor for every karpenter configuration change and adapt our configuration to match.

For v1 of NodePool and friends, we could aim to define the exact details of which party is responsible for what. This might be:

  • Karpenter doesn't support adoption; you can try it but if it breaks you get to keep both halves
  • Support for Karpenter node adoption is determined by the specific provider integration you use
  • Karpenter will adopt your nodes subject to [conditions], but may replace the node at will

etc

@njtran njtran changed the title Minimal support for stacked/cooperative autoscaling Support for stacked/cooperative autoscaling or adoption of 3rd party created node claims Feb 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2024
@njtran
Copy link
Contributor

njtran commented Jun 12, 2024

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 12, 2024
@arikkfir
Copy link

arikkfir commented Oct 8, 2024

Thank you everyone for helping out on this 🙏
This is indeed still needed as of Karpenter v1.

Given aws/karpenter-provider-aws#6601 and this issue - what would be the best course of action here, given our desire to get our node "adopted" by a v1 Karpenter? Even a workaround would work for us for now.

Things we've considered:

  1. Creating a new EC2 machine (as we do today) with appropriate tags: seems Karpenter deletes it immediately.
  2. Creating a new NodeClaim seems to get Karpenter to immediately create another EC2 machine, not allowing us the chance to update its status subresource to point to our instance.
  3. Any other suggestions that might work? Even in an interim until a full solution is in place.

@bartosz-qubex FYI :)

Thanks again for everyone in the thread!

@sftim
Copy link

sftim commented Dec 24, 2024

@arikkfir you could send in a PR that proposes the definition you think people will agree to, and see if it gets consensus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

9 participants