Reconcile nodes that are out of spec #1457

johngmyers · 2022-03-02T19:19:14Z

Tell us about your request

Karpenter should reconcile nodes which no longer satisfy their (or a) Provisioner specification by deleting them at an appropriate rate. This rate should normally cause the node to be deleted before it would otherwise expire.

In general, a node should be deleted not unduly long after it no longer matches the current spec. So that would be if it no longer fits within the requirements, subnetSelector, or securityGroupSelector or matches the taints, labels, kubeletConfiguration, instanceProfile, launchTemplate, tags, or metadataOptions.

Another common case would be if the node is not on the desired Kubernetes version.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

There are numerous situations for causing a Provisioner specification to change such that preexisting instances no longer satisfy it. Taints and labels may be added or removed, cost control tags may be established or corrected, networks might be under renumbering, AMIs might be updated to deploy security fixes, and so on.

Most of these changes need more timeliness than one wants for baseline node expiration.

Karpenter is uniquely positioned for knowing which instances are out of spec. For an external subsystem to determine which nodes need to be reconciled, it would have to duplicate a substantial amount of Karpenter's logic.

Are you currently working around this issue?

Have not deployed Karpenter sufficiently to create a workaround. Am currently solving the general problem by using kOps to deploy and reconcile nodes.

Additional context

#1018 is a related ticket. There are actually two separate issues:

identifying nodes as being out of spec and thus needing reconciliation
the rate at which such nodes (and probably also expired nodes) are to be reconciled

I believe concerns about (1) causing too much disruption to the cluster are somewhat misplaced: such disruption is not endemic to (1), but instead a consequence of (2) being overly aggressive.

Perhaps this ticket should be (1) and #1018 be (2).

Attachments

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

njtran · 2022-03-02T20:27:25Z

Thanks for opening the issue.

#1018 is a related ticket. There are actually two separate issues:

identifying nodes as being out of spec and thus needing reconciliation
the rate at which such nodes (and probably also expired nodes) are to be reconciled
I believe concerns about (1) causing too much disruption to the cluster are somewhat misplaced: such disruption is not endemic to (1), but instead a consequence of (2) being overly aggressive.

I believe the linked issue above encompasses both problems here, and we've definitely heard this multiple times. We're exploring better mechanisms for node upgrades, which can include this issue.

The main point lies here in the other similar issue as mentioned by Ellis: #1430 (comment)

This is not the semantic of the provisioner. We're not a node group abstraction and don't reconcile this type of field. Doing so would be pretty disruptive and impactful to cluster capacity. Instead, we're built on an eventually consistent model, where nodes will converge to the current specification. This can be accelerated via expirationTTLs. If you want to build your own rollout orchestration on top of Karpenter, we're highly supportive of this and our finalizer enables easy integration for cordon/drain.

I'm going to close this issue in favor of tracking it in #1018. Feel free to add more about your use-case. This is an important issue that we should spend time and design properly since it changes the way provisioners think about the world.

johngmyers · 2022-03-02T20:46:11Z

I was specifically asked in #1430 to create a new issue, so I'm a bit confused. I think I've addressed many points in that comment: reconciling is not itself disruptive/impactful, expirationTTLs are not sufficient, and orchestration on top of Karpenter would be ill-equipped to determine which nodes need reconciliation.

johngmyers · 2022-03-02T21:14:30Z

I also challenge whether this is duplicate of #1018. One possible result would be for Karpenter to identify and label nodes that are out of spec, leaving it to external orchestration to then reconcile them.

ellistarn · 2022-03-02T21:29:19Z

I was specifically asked in #1430 to create a new issue, so I'm a bit confused.

Sorry @johngmyers, we got our wires crossed. I think it's a good idea to track this request here. It's sufficiently different from #1018.

njtran · 2022-03-02T22:27:08Z

Ah sorry about that. Didn't quite communicate well on our end.

johngmyers · 2022-03-02T22:38:26Z

At least it's eventually consistent :-)

ellistarn · 2022-03-02T22:42:11Z

The Tarn Tran train sometimes goes off the rails (trails?).

njtran · 2022-03-03T18:41:35Z

Well now that I understand, thanks for cutting this.

This is a pretty interesting idea. Although mutable, Karpenter doesn't modify any existing nodes that Karpenter has created, and only deletes them based off some circumstances. It would be great if nodes could conform to the provisioner spec if we can guarantee it won't break existing scheduling constraints or other workloads.

Here are the places we could modify the Provisioner which may modify the nodes it has provisioned. The ProvisionerSpec.Constraints are separated into Kubernetes constraints and AWS constraints.

Right off the bat, I'm wary to change these because of some examples:

if node taints change unexpectedly, we can end up evicting pods that no longer tolerate new taints.
if node labels change unexpectedly, pods or other resources that depend on the node labels for discovery or other things might break

If we can ensure that changing select constraints in the Provisioner will not impact any workloads, this is a great idea, but might require some more thought and validation. Additionally, if we end up modifying this functionality for a small subset of constraints this could surprise a lot of people, as it's a niche case that we consider. Is there a particular set of constraints listed in the links that you think would be non-breaking that are useful for your use-case?

In addition, as we explore and develop node upgrades, you'll definitely be able to configure node-provisioner conformance in some way.

johngmyers · 2022-03-03T19:02:22Z

My current thinking is that one should reconcile a node by deleting it. That lets the eviction process control the disruption to the workloads running on the node.

There might be some things that could be reconciled by modifying the node to bring it into conformance, but that could have unexpected impact on workloads. It is at best an optimization. The only one of my use cases that could be optimized that way would be addition or correction of cost control tags. That optimization wouldn't seem worthwhile, especially since Karpenter would have to also change the tags on the instance's volumes.

I do think it's worthwhile to separate the problems of identifying (and possibly labeling) nodes that are out of spec versus the rate at which one reconciles them.

ellistarn · 2022-03-03T19:06:48Z

There might be some things that could be reconciled by modifying the node to bring it into conformance

Hmm ttlSecondsAfterNonconformant? Our current semantic of "do nothing if undefined" would work nicely here.

johngmyers · 2022-03-04T00:57:13Z

I don't see a need for a minimum amount of time between when a node goes out of spec and when it can be deleted, so I don't see a use for ttlSecondsAfterNonconformant. When (a potentially large number of) nodes go out of spec, it shouldn't wait a while and then delete them all at the same time, it should meter the deletions in order to constrain the rate of change going into the Kubernetes API and the networking stack. The same thing goes for node expiration; while it is less likely a large number of nodes would expire around the same time, it is still possible.

While PodDisruptionBudget has some issues, when used correctly it is generally capable of preventing workload disruption.

ellistarn · 2022-03-04T04:18:45Z

This is a great point -- a TTL makes no sense here. It would be nice to toggle this feature somehow. Further, I think NodeDisruptionBudgets could really help in this case.

johngmyers · 2022-03-04T17:12:15Z

I think one question is whether, presuming adequate rate limiting is implemented, treating out-of-spec nodes differently than expired nodes is desirable. What would be the use case for only reconciling expired nodes or reconciling them at a different rate than out-of-spec nodes?

I can't find an issue for NodeDisruptionBudgets; perhaps we should create one. Or maybe it's in scope of #1018? What would be the use case for having the selector of a NodeDisruptionBudget be different than the scope of a Provisioner?

ellistarn · 2022-03-04T18:09:48Z

What would be the use case for having the selector of a NodeDisruptionBudget be different than the scope of a Provisioner?

We've been thinking about decoupling provisioning and deprovisioning decisions. It plays into the consolidation work, and whether or not consolidation should happen across provisioners or not. I think it probably should. By using node selectors, you could do cool things like controlling parallel node termination per zone, but your provisioners wouldn't be modeled per zone.

stevehipwell · 2022-03-09T16:26:47Z

Does this come under the umbrella of NTH v2 (future) like spot instances @bwagner5? If Karpenter can tell NTH that a node is out of spec and requires terminating then NTH can apply a strategy based on the inputs from Karpenter?

stevehipwell · 2022-03-16T17:17:19Z

I'd like to add my support for ttlSecondsAfterNonconformant to provide a grace period. I'd also request a setting such as nonconformantAction which could be either terminate or taint, related to kubernetes-sigs/karpenter#752.

johngmyers · 2022-03-16T19:02:53Z

@stevehipwell what is the use case for a grace period?

stevehipwell · 2022-03-16T22:41:34Z

Grace periods are useful in general. I can specifically see it being used in a true IaC environment where workloads are updated at the same time as node specs, specifically AMI IDs in the launch templates. When updating my workloads I don't want my nodes flipping at the same time, nor do I want to wait for the nodes before finishing the upgrade, this is what my canary clusters are for. This is a real world use case pain point I have today.

johngmyers · 2022-03-16T22:49:20Z

A ttlSecondsAfterNonconformant would be a minimum time before it could start reconciling any nodes. I don't see how it would have anything to do with workloads waiting for nodes. Even if one wanted workloads to update before starting to reconcile nodes, I don't see how such a ttl could reliably ensure that.

stevehipwell · 2022-03-16T22:59:07Z

@johngmyers if I'm pushing a set of Helm chart changes I don't want my nodes updating during that period to muddy the water as to if the Helm chart changes are good. For example if my nodes are updating at the same time and my PDBs are restrictive I might get a failure when I wouldn't get one with stable nodes. This is a current scenario we have with instance refresh.

ellistarn · 2022-08-05T19:09:48Z

Closing in favor of the mega issue: #1738

johngmyers added the feature New feature or request label Mar 2, 2022

johngmyers mentioned this issue Mar 2, 2022

Tag root volumes #1430

Closed

njtran closed this as completed Mar 2, 2022

ellistarn reopened this Mar 2, 2022

stevehipwell mentioned this issue Mar 9, 2022

Support terminating nodes when pod density is too low #1491

Closed

This was referenced Apr 17, 2022

Cannot evict pods with PDB when all hosting nodes are expired at the same time #1682

Closed

Provide more control over node upgrading #1018

Closed

bwagner5 mentioned this issue Apr 27, 2022

Automatic Update of Nodes #1716

Closed

njtran mentioned this issue Apr 28, 2022

Mega Issue: Deprovisioning Controls #1738

Closed

18 tasks

ra-grover mentioned this issue May 23, 2022

Reconciling Nodes on Provisioner Spec changes #1841

Closed

3 tasks

bwagner5 mentioned this issue Aug 5, 2022

Instance tags do not get updated when a node template gets updated #2257

Closed

ellistarn closed this as completed Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile nodes that are out of spec #1457

Reconcile nodes that are out of spec #1457

johngmyers commented Mar 2, 2022

njtran commented Mar 2, 2022

johngmyers commented Mar 2, 2022

johngmyers commented Mar 2, 2022

ellistarn commented Mar 2, 2022 •

edited

Loading

njtran commented Mar 2, 2022

johngmyers commented Mar 2, 2022

ellistarn commented Mar 2, 2022

njtran commented Mar 3, 2022

johngmyers commented Mar 3, 2022

ellistarn commented Mar 3, 2022 •

edited

Loading

johngmyers commented Mar 4, 2022

ellistarn commented Mar 4, 2022

johngmyers commented Mar 4, 2022 •

edited

Loading

ellistarn commented Mar 4, 2022

stevehipwell commented Mar 9, 2022

stevehipwell commented Mar 16, 2022

johngmyers commented Mar 16, 2022

stevehipwell commented Mar 16, 2022

johngmyers commented Mar 16, 2022

stevehipwell commented Mar 16, 2022

ellistarn commented Aug 5, 2022

Reconcile nodes that are out of spec #1457

Reconcile nodes that are out of spec #1457

Comments

johngmyers commented Mar 2, 2022

Community Note

njtran commented Mar 2, 2022

johngmyers commented Mar 2, 2022

johngmyers commented Mar 2, 2022

ellistarn commented Mar 2, 2022 • edited Loading

njtran commented Mar 2, 2022

johngmyers commented Mar 2, 2022

ellistarn commented Mar 2, 2022

njtran commented Mar 3, 2022

johngmyers commented Mar 3, 2022

ellistarn commented Mar 3, 2022 • edited Loading

johngmyers commented Mar 4, 2022

ellistarn commented Mar 4, 2022

johngmyers commented Mar 4, 2022 • edited Loading

ellistarn commented Mar 4, 2022

stevehipwell commented Mar 9, 2022

stevehipwell commented Mar 16, 2022

johngmyers commented Mar 16, 2022

stevehipwell commented Mar 16, 2022

johngmyers commented Mar 16, 2022

stevehipwell commented Mar 16, 2022

ellistarn commented Aug 5, 2022

ellistarn commented Mar 2, 2022 •

edited

Loading

ellistarn commented Mar 3, 2022 •

edited

Loading

johngmyers commented Mar 4, 2022 •

edited

Loading