Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ability to disable a provisioner #2491

Closed
grosser opened this issue Sep 12, 2022 · 24 comments
Closed

ability to disable a provisioner #2491

grosser opened this issue Sep 12, 2022 · 24 comments
Labels
feature New feature or request

Comments

@grosser
Copy link

grosser commented Sep 12, 2022

Tell us about your request
when something goes wrong during node bootstrap I want to be able to disable the provisioner so they do not add or remove nodes

Are you currently working around this issue?
No

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@grosser grosser added the feature New feature or request label Sep 12, 2022
@spring1843
Copy link
Contributor

Why not delete the provisioner instead?

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 12, 2022

Deleting the provisioner without the --cascade=orphan would cause the nodes that were created by the provisioner to also be removed. You could delete the provisioner but you would then have to manage the nodes on your own and wouldn't be able to transfer the ownership of nodes back to a Provisioner when debugging is done.

@grosser There are a couple of workarounds to enable "disabling" behavior without explicitly having the field surfaced in the Provisioner API:

  1. Add a .spec.taint to the Provisioner that none of the workloads deployed onto the nodes can tolerate
  2. Add requirements to the provisioner that make it impossible to schedule any node, such as including a DoesNotExist requirement on a label that Karpenter will always provision to a node
requirements:
- key: "karpenter.sh/capacity-type"
  operator: DoesNotExist

@grosser
Copy link
Author

grosser commented Sep 12, 2022

none of these really work:

  • deleting is not really disabling since I can't turn it back on
  • adding taints does not stop it from shutting down nodes
  • adding requirements does not stop it from shutting down nodes

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 12, 2022

When you say "shutting down nodes", what do you mean? Are you referring to consolidation, emptiness of nodes, or expiration?

@grosser
Copy link
Author

grosser commented Sep 12, 2022

all of them (but atm we don't plan on using expiration)

@jonathan-innis
Copy link
Contributor

So can you update the provisioner to have the additional requirement listed above along with ensuring that you do not have ttlSecondsAfterEmpty set and consolidation is not enabled?

@jonathan-innis
Copy link
Contributor

I'm also trying to get to root of the use-case here. What do you need the provisioner disabled for? My assumption is that you want to do debugging on the node and you don't want the provisioner to interfere.

@grosser
Copy link
Author

grosser commented Sep 12, 2022

if there is for example an az outage or we have trouble with node bootstrap (new nodes cannot boot) we want to disable the provisioner (and everything it does) to have a stable environment

also for blue-green rollouts we want to disable the old provisioner and slowly roll out a new one so it would be useful there too

@jonathan-innis
Copy link
Contributor

So can you update the provisioner to have the additional requirement listed above along with ensuring that you do not have ttlSecondsAfterEmpty set and consolidation is not enabled?

Can you explain the problems with this solution in your opinion? In this case, if you did this to your provisioner, no new nodes would launch and karpenter wouldn't delete any nodes for you on your behalf.

The only caveat here is if you went about deleting the provisioner, the cascading deletion would still take effect but you probably shouldn't be doing this with an AZ outage or blue-green rollout anyways.

@grosser
Copy link
Author

grosser commented Sep 12, 2022

yes it would be a workable workaround

but:

  • it's complicated (we'd need tooling to do disable+enable and also store the before to be able to undo the change)
  • it's not future-proof since we'd need to update the disable logic every time a new feature gets added to karpenter

@sftim
Copy link
Contributor

sftim commented Oct 27, 2022

Relevant to #2737

@ellistarn
Copy link
Contributor

ellistarn commented Nov 15, 2022

I think we likely need a more holistic solution to control different actions that karpenter takes, e.g.

  • provisioning
  • deletion
  • replacement
  • drift
  • interruption
  • expiration

@ellistarn
Copy link
Contributor

Why not set limits to 0? This will stop provisioning, but not deprovisioning.

@ellistarn
Copy link
Contributor

Do you want this at the provisioner level or karpenter globally? It's likely that another provisioner will still work if you just disable one.

@grosser
Copy link
Author

grosser commented Nov 23, 2022

I'd want to stop provisioning and de-provisioning.
Global would be a good start, but ideally per-provisioner so I can for example disable some az or some group instead of everything (for example during an az-outage)

@ellistarn
Copy link
Contributor

disable some az

This is currently possible by modifying requirements to exclude the impacted AZ.

@grosser
Copy link
Author

grosser commented Nov 23, 2022

would that make the az no longer scale-down either ?
afaik that is all done by provisioner-name

@ellistarn
Copy link
Contributor

would that make the az no longer scale-down either ?

I'm not grokking the big picture that you're imagining. During an AZ outage, the node lifecycle controller will disable taint based eviction, so pods won't get evicted, the nodes won't become empty, and scale down shouldn't occur.

@grosser
Copy link
Author

grosser commented Nov 25, 2022

during an az outage (or other problem in 1 az) we drain the nodes in that az from everything except pods with pvcs, but we keep the nodes around (we could delete the ones that get empty though)
we block CAS by setting the asgs to min==max

is there some annotation we could set to make karpenter not scale down certain nodes ? (would be useful for another usecase we have too, which is marking nodes as no-scale when they run long-running pods) and maybe set max so it does not scale up

but overall this feels a bit hacky and I was hoping a "off switch" would be easier to use

@ellistarn
Copy link
Contributor

we drain the nodes in that az

Highly recommend against this: https://github.com/aws/aws-eks-best-practices/pull/247/files?short_path=19e909f#diff-19e909fa40e2670b64ac33aa645bc59b7b2335cdad31d7d5070d8dddb5cefc70

is there some annotation we could set to make karpenter not scale down certain nodes ?

You can always set the karpenter.sh/do-not-evict annotation on a pod to prevent karpenter from scaling down. You can also add karpenter.sh/do-not-consolidate onto a node to disable consolidation.

@grosser
Copy link
Author

grosser commented Nov 25, 2022

yeah good article, the way we drain is that we create copies first and only drain if these copies are healthy, so some of the gotchas don't apply

a kind of karpenter.sh/do-not-scale-down for nodes would be helpful then (since I assume the node would still be removed if it's semi-empty)
... or a "disable" flag for the provisioner so we don't have to juggle multiple labels on the node :)

@ellistarn
Copy link
Contributor

Playing with this: kubernetes-sigs/karpenter#87

@njtran
Copy link
Contributor

njtran commented Aug 9, 2023

Hey @grosser, sorry for the delay, tackling this for provisioning and deprovisioning.

Provisioning: The way that we do this in tests and in our clusters is by setting limits to 0. It won't cause any machines to scale down, but it won't allow any new machines to be scaled up from this provisioner.
Deprovisioning: We're going to be adding time-based mechanisms with a disruption budget so that you can disable deprovisioning here and we're going to be allowing users to disable/enable each TTL based field as part of the v1beta1 discussion.

I'm going to close this as those are the ways we're planning on allowing users to effectively disable the core mechanisms of Karpenter without having to delete their Provisioners.

@njtran njtran closed this as completed Aug 9, 2023
@grosser
Copy link
Author

grosser commented Aug 9, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants