-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile nodes that are out of spec #1457
Comments
Thanks for opening the issue.
I believe the linked issue above encompasses both problems here, and we've definitely heard this multiple times. We're exploring better mechanisms for node upgrades, which can include this issue. The main point lies here in the other similar issue as mentioned by Ellis: #1430 (comment)
I'm going to close this issue in favor of tracking it in #1018. Feel free to add more about your use-case. This is an important issue that we should spend time and design properly since it changes the way provisioners think about the world. |
I was specifically asked in #1430 to create a new issue, so I'm a bit confused. I think I've addressed many points in that comment: reconciling is not itself disruptive/impactful, expirationTTLs are not sufficient, and orchestration on top of Karpenter would be ill-equipped to determine which nodes need reconciliation. |
I also challenge whether this is duplicate of #1018. One possible result would be for Karpenter to identify and label nodes that are out of spec, leaving it to external orchestration to then reconcile them. |
Sorry @johngmyers, we got our wires crossed. I think it's a good idea to track this request here. It's sufficiently different from #1018. |
Ah sorry about that. Didn't quite communicate well on our end. |
At least it's eventually consistent :-) |
The Tarn Tran train sometimes goes off the rails (trails?). |
Well now that I understand, thanks for cutting this. This is a pretty interesting idea. Although mutable, Karpenter doesn't modify any existing nodes that Karpenter has created, and only deletes them based off some circumstances. It would be great if nodes could conform to the provisioner spec if we can guarantee it won't break existing scheduling constraints or other workloads. Here are the places we could modify the Provisioner which may modify the nodes it has provisioned. The ProvisionerSpec.Constraints are separated into Kubernetes constraints and AWS constraints. Right off the bat, I'm wary to change these because of some examples:
If we can ensure that changing select constraints in the Provisioner will not impact any workloads, this is a great idea, but might require some more thought and validation. Additionally, if we end up modifying this functionality for a small subset of constraints this could surprise a lot of people, as it's a niche case that we consider. Is there a particular set of constraints listed in the links that you think would be non-breaking that are useful for your use-case? In addition, as we explore and develop node upgrades, you'll definitely be able to configure node-provisioner conformance in some way. |
My current thinking is that one should reconcile a node by deleting it. That lets the eviction process control the disruption to the workloads running on the node. There might be some things that could be reconciled by modifying the node to bring it into conformance, but that could have unexpected impact on workloads. It is at best an optimization. The only one of my use cases that could be optimized that way would be addition or correction of cost control tags. That optimization wouldn't seem worthwhile, especially since Karpenter would have to also change the tags on the instance's volumes. I do think it's worthwhile to separate the problems of identifying (and possibly labeling) nodes that are out of spec versus the rate at which one reconciles them. |
Hmm |
I don't see a need for a minimum amount of time between when a node goes out of spec and when it can be deleted, so I don't see a use for While PodDisruptionBudget has some issues, when used correctly it is generally capable of preventing workload disruption. |
This is a great point -- a TTL makes no sense here. It would be nice to toggle this feature somehow. Further, I think NodeDisruptionBudgets could really help in this case. |
I think one question is whether, presuming adequate rate limiting is implemented, treating out-of-spec nodes differently than expired nodes is desirable. What would be the use case for only reconciling expired nodes or reconciling them at a different rate than out-of-spec nodes? I can't find an issue for NodeDisruptionBudgets; perhaps we should create one. Or maybe it's in scope of #1018? What would be the use case for having the selector of a NodeDisruptionBudget be different than the scope of a Provisioner? |
We've been thinking about decoupling provisioning and deprovisioning decisions. It plays into the consolidation work, and whether or not consolidation should happen across provisioners or not. I think it probably should. By using node selectors, you could do cool things like controlling parallel node termination per zone, but your provisioners wouldn't be modeled per zone. |
Does this come under the umbrella of NTH v2 (future) like spot instances @bwagner5? If Karpenter can tell NTH that a node is out of spec and requires terminating then NTH can apply a strategy based on the inputs from Karpenter? |
I'd like to add my support for |
@stevehipwell what is the use case for a grace period? |
Grace periods are useful in general. I can specifically see it being used in a true IaC environment where workloads are updated at the same time as node specs, specifically AMI IDs in the launch templates. When updating my workloads I don't want my nodes flipping at the same time, nor do I want to wait for the nodes before finishing the upgrade, this is what my canary clusters are for. This is a real world use case pain point I have today. |
A |
@johngmyers if I'm pushing a set of Helm chart changes I don't want my nodes updating during that period to muddy the water as to if the Helm chart changes are good. For example if my nodes are updating at the same time and my PDBs are restrictive I might get a failure when I wouldn't get one with stable nodes. This is a current scenario we have with instance refresh. |
Closing in favor of the mega issue: #1738 |
Tell us about your request
Karpenter should reconcile nodes which no longer satisfy their (or a)
Provisioner
specification by deleting them at an appropriate rate. This rate should normally cause the node to be deleted before it would otherwise expire.In general, a node should be deleted not unduly long after it no longer matches the current spec. So that would be if it no longer fits within the requirements, subnetSelector, or securityGroupSelector or matches the taints, labels, kubeletConfiguration, instanceProfile, launchTemplate, tags, or metadataOptions.
Another common case would be if the node is not on the desired Kubernetes version.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
There are numerous situations for causing a
Provisioner
specification to change such that preexisting instances no longer satisfy it. Taints and labels may be added or removed, cost control tags may be established or corrected, networks might be under renumbering, AMIs might be updated to deploy security fixes, and so on.Most of these changes need more timeliness than one wants for baseline node expiration.
Karpenter is uniquely positioned for knowing which instances are out of spec. For an external subsystem to determine which nodes need to be reconciled, it would have to duplicate a substantial amount of Karpenter's logic.
Are you currently working around this issue?
Have not deployed Karpenter sufficiently to create a workaround. Am currently solving the general problem by using kOps to deploy and reconcile nodes.
Additional context
#1018 is a related ticket. There are actually two separate issues:
I believe concerns about (1) causing too much disruption to the cluster are somewhat misplaced: such disruption is not endemic to (1), but instead a consequence of (2) being overly aggressive.
Perhaps this ticket should be (1) and #1018 be (2).
Attachments
Community Note
The text was updated successfully, but these errors were encountered: