Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile nodes that are out of spec #1457

Closed
johngmyers opened this issue Mar 2, 2022 · 21 comments
Closed

Reconcile nodes that are out of spec #1457

johngmyers opened this issue Mar 2, 2022 · 21 comments
Labels
feature New feature or request

Comments

@johngmyers
Copy link

Tell us about your request

Karpenter should reconcile nodes which no longer satisfy their (or a) Provisioner specification by deleting them at an appropriate rate. This rate should normally cause the node to be deleted before it would otherwise expire.

In general, a node should be deleted not unduly long after it no longer matches the current spec. So that would be if it no longer fits within the requirements, subnetSelector, or securityGroupSelector or matches the taints, labels, kubeletConfiguration, instanceProfile, launchTemplate, tags, or metadataOptions.

Another common case would be if the node is not on the desired Kubernetes version.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

There are numerous situations for causing a Provisioner specification to change such that preexisting instances no longer satisfy it. Taints and labels may be added or removed, cost control tags may be established or corrected, networks might be under renumbering, AMIs might be updated to deploy security fixes, and so on.

Most of these changes need more timeliness than one wants for baseline node expiration.

Karpenter is uniquely positioned for knowing which instances are out of spec. For an external subsystem to determine which nodes need to be reconciled, it would have to duplicate a substantial amount of Karpenter's logic.

Are you currently working around this issue?

Have not deployed Karpenter sufficiently to create a workaround. Am currently solving the general problem by using kOps to deploy and reconcile nodes.

Additional context

#1018 is a related ticket. There are actually two separate issues:

  1. identifying nodes as being out of spec and thus needing reconciliation
  2. the rate at which such nodes (and probably also expired nodes) are to be reconciled

I believe concerns about (1) causing too much disruption to the cluster are somewhat misplaced: such disruption is not endemic to (1), but instead a consequence of (2) being overly aggressive.

Perhaps this ticket should be (1) and #1018 be (2).

Attachments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@johngmyers johngmyers added the feature New feature or request label Mar 2, 2022
@njtran
Copy link
Contributor

njtran commented Mar 2, 2022

Thanks for opening the issue.

#1018 is a related ticket. There are actually two separate issues:

identifying nodes as being out of spec and thus needing reconciliation
the rate at which such nodes (and probably also expired nodes) are to be reconciled
I believe concerns about (1) causing too much disruption to the cluster are somewhat misplaced: such disruption is not endemic to (1), but instead a consequence of (2) being overly aggressive.

I believe the linked issue above encompasses both problems here, and we've definitely heard this multiple times. We're exploring better mechanisms for node upgrades, which can include this issue.

The main point lies here in the other similar issue as mentioned by Ellis: #1430 (comment)

This is not the semantic of the provisioner. We're not a node group abstraction and don't reconcile this type of field. Doing so would be pretty disruptive and impactful to cluster capacity. Instead, we're built on an eventually consistent model, where nodes will converge to the current specification. This can be accelerated via expirationTTLs. If you want to build your own rollout orchestration on top of Karpenter, we're highly supportive of this and our finalizer enables easy integration for cordon/drain.

I'm going to close this issue in favor of tracking it in #1018. Feel free to add more about your use-case. This is an important issue that we should spend time and design properly since it changes the way provisioners think about the world.

@njtran njtran closed this as completed Mar 2, 2022
@johngmyers
Copy link
Author

I was specifically asked in #1430 to create a new issue, so I'm a bit confused. I think I've addressed many points in that comment: reconciling is not itself disruptive/impactful, expirationTTLs are not sufficient, and orchestration on top of Karpenter would be ill-equipped to determine which nodes need reconciliation.

@johngmyers
Copy link
Author

I also challenge whether this is duplicate of #1018. One possible result would be for Karpenter to identify and label nodes that are out of spec, leaving it to external orchestration to then reconcile them.

@ellistarn ellistarn reopened this Mar 2, 2022
@ellistarn
Copy link
Contributor

ellistarn commented Mar 2, 2022

I was specifically asked in #1430 to create a new issue, so I'm a bit confused.

Sorry @johngmyers, we got our wires crossed. I think it's a good idea to track this request here. It's sufficiently different from #1018.

@njtran
Copy link
Contributor

njtran commented Mar 2, 2022

Ah sorry about that. Didn't quite communicate well on our end.

@johngmyers
Copy link
Author

At least it's eventually consistent :-)

@ellistarn
Copy link
Contributor

The Tarn Tran train sometimes goes off the rails (trails?).

@njtran
Copy link
Contributor

njtran commented Mar 3, 2022

Well now that I understand, thanks for cutting this.

This is a pretty interesting idea. Although mutable, Karpenter doesn't modify any existing nodes that Karpenter has created, and only deletes them based off some circumstances. It would be great if nodes could conform to the provisioner spec if we can guarantee it won't break existing scheduling constraints or other workloads.

Here are the places we could modify the Provisioner which may modify the nodes it has provisioned. The ProvisionerSpec.Constraints are separated into Kubernetes constraints and AWS constraints.

Right off the bat, I'm wary to change these because of some examples:

  • if node taints change unexpectedly, we can end up evicting pods that no longer tolerate new taints.
  • if node labels change unexpectedly, pods or other resources that depend on the node labels for discovery or other things might break

If we can ensure that changing select constraints in the Provisioner will not impact any workloads, this is a great idea, but might require some more thought and validation. Additionally, if we end up modifying this functionality for a small subset of constraints this could surprise a lot of people, as it's a niche case that we consider. Is there a particular set of constraints listed in the links that you think would be non-breaking that are useful for your use-case?

In addition, as we explore and develop node upgrades, you'll definitely be able to configure node-provisioner conformance in some way.

@johngmyers
Copy link
Author

My current thinking is that one should reconcile a node by deleting it. That lets the eviction process control the disruption to the workloads running on the node.

There might be some things that could be reconciled by modifying the node to bring it into conformance, but that could have unexpected impact on workloads. It is at best an optimization. The only one of my use cases that could be optimized that way would be addition or correction of cost control tags. That optimization wouldn't seem worthwhile, especially since Karpenter would have to also change the tags on the instance's volumes.

I do think it's worthwhile to separate the problems of identifying (and possibly labeling) nodes that are out of spec versus the rate at which one reconciles them.

@ellistarn
Copy link
Contributor

ellistarn commented Mar 3, 2022

There might be some things that could be reconciled by modifying the node to bring it into conformance

Hmm ttlSecondsAfterNonconformant? Our current semantic of "do nothing if undefined" would work nicely here.

@johngmyers
Copy link
Author

I don't see a need for a minimum amount of time between when a node goes out of spec and when it can be deleted, so I don't see a use for ttlSecondsAfterNonconformant. When (a potentially large number of) nodes go out of spec, it shouldn't wait a while and then delete them all at the same time, it should meter the deletions in order to constrain the rate of change going into the Kubernetes API and the networking stack. The same thing goes for node expiration; while it is less likely a large number of nodes would expire around the same time, it is still possible.

While PodDisruptionBudget has some issues, when used correctly it is generally capable of preventing workload disruption.

@ellistarn
Copy link
Contributor

This is a great point -- a TTL makes no sense here. It would be nice to toggle this feature somehow. Further, I think NodeDisruptionBudgets could really help in this case.

@johngmyers
Copy link
Author

johngmyers commented Mar 4, 2022

I think one question is whether, presuming adequate rate limiting is implemented, treating out-of-spec nodes differently than expired nodes is desirable. What would be the use case for only reconciling expired nodes or reconciling them at a different rate than out-of-spec nodes?

I can't find an issue for NodeDisruptionBudgets; perhaps we should create one. Or maybe it's in scope of #1018? What would be the use case for having the selector of a NodeDisruptionBudget be different than the scope of a Provisioner?

@ellistarn
Copy link
Contributor

What would be the use case for having the selector of a NodeDisruptionBudget be different than the scope of a Provisioner?

We've been thinking about decoupling provisioning and deprovisioning decisions. It plays into the consolidation work, and whether or not consolidation should happen across provisioners or not. I think it probably should. By using node selectors, you could do cool things like controlling parallel node termination per zone, but your provisioners wouldn't be modeled per zone.

@stevehipwell
Copy link
Contributor

Does this come under the umbrella of NTH v2 (future) like spot instances @bwagner5? If Karpenter can tell NTH that a node is out of spec and requires terminating then NTH can apply a strategy based on the inputs from Karpenter?

@stevehipwell
Copy link
Contributor

I'd like to add my support for ttlSecondsAfterNonconformant to provide a grace period. I'd also request a setting such as nonconformantAction which could be either terminate or taint, related to kubernetes-sigs/karpenter#752.

@johngmyers
Copy link
Author

@stevehipwell what is the use case for a grace period?

@stevehipwell
Copy link
Contributor

Grace periods are useful in general. I can specifically see it being used in a true IaC environment where workloads are updated at the same time as node specs, specifically AMI IDs in the launch templates. When updating my workloads I don't want my nodes flipping at the same time, nor do I want to wait for the nodes before finishing the upgrade, this is what my canary clusters are for. This is a real world use case pain point I have today.

@johngmyers
Copy link
Author

A ttlSecondsAfterNonconformant would be a minimum time before it could start reconciling any nodes. I don't see how it would have anything to do with workloads waiting for nodes. Even if one wanted workloads to update before starting to reconcile nodes, I don't see how such a ttl could reliably ensure that.

@stevehipwell
Copy link
Contributor

@johngmyers if I'm pushing a set of Helm chart changes I don't want my nodes updating during that period to muddy the water as to if the Helm chart changes are good. For example if my nodes are updating at the same time and my PDBs are restrictive I might get a failure when I wouldn't get one with stable nodes. This is a current scenario we have with instance refresh.

@ellistarn
Copy link
Contributor

Closing in favor of the mega issue: #1738

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants