Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add initialization buffer to disruptable nodes #926

Closed
wants to merge 3 commits into from

Conversation

njtran
Copy link
Contributor

@njtran njtran commented Jan 6, 2024

Fixes #N/A

Description
Adds a 30s initialization buffer into candidacy for nodes for disruption. Only nodes that have been initialized for 30 seconds can be considered candidates for disruption, ensuring that since we're able to do multiple commands in parallel, that we don't disrupt replacement nodes for a previous action in a future action.

How was this change tested?
make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: njtran

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 6, 2024
// This is critical to ensure that nodes that are created as part of a replacement aren't
// chosen for disruption as soon as they're initialized. We only need a small time frame
// to get one pod nomination event, as that will block candidacy further.
cond := in.NodeClaim.StatusConditions().GetCondition(v1beta1.Initialized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add this check for emptiness. Otherwise, we are going to mark NodeClaims as empty that are also within this buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. On one hand, we're expecting nodes to get pods scheduled onto them soon, which could result in a redundant double patch (mark as empty, then mark as not empty). On the other hand, the 30s is a heuristic on how long we're expecting this to be good enough for pod scheduling. Since we're using the initialization buffer to hold on disrupting, I'm not sure I agree with holding on identifying status conditions of individual node claims. It may be preferable to have the emptiness condition progress while the buffer is going.

@@ -221,6 +222,19 @@ func (in *StateNode) Initialized() bool {
return true
}

// Disruptable returns if a node has been initialized long enough for Karpenter to disrupt it
func (in *StateNode) Disruptable(clock clock.Clock) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the naming, but because we are going to use this with emptiness, I wonder if we have to rename this to something else like, "WithinInitializionWindow()" or something more verbose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can see the case for this if we have more Disruptable requirements. Not sure I understand why reasoning about emptiness motivates you to think of another name.

// skip candidates that aren't initialized
if !node.Initialized() {
// skip candidates that haven't been initialized for a period time
if !node.Disruptable(clk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of introducing Disruptable. Should it eventually be controlable beyond this static 30s value?

We have the ability to ignore nodes for disruption based in the DoNotDisrupt annotation but there is not temporal control there.

#920 (comment) mentions an issue where this 3rd party autoscaler prevents handing the instances off to karpenter because it will disrupt too quickly.

I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life" or a concept like this.

Not saying its something we need to solve in this pr just food for thought

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life"

We have this as an annotation that can get propagated onto the Node from the NodeClaim today. Seems like this might be a really nice place to do #752 and push the karpenter.sh/do-not-disrupt annotation that can either be configured at the Pod spec level or at the NodeClaim level (which could be passed down from NodePool spec.template.metadata.annotations so you could do something like):

spec:
  template:
     metadata:
       annotations:
          karpenter.sh/do-not-disrupt: 10m

Should it eventually be controlable beyond this static 30s value

I think the annotation handling above gives us a nice path to this; however, the Node karpenter.sh/do-not-disrupt min lifetime differs a bit from the InitializationBuffer -- both in implementation and also in what it should be checking against. The karpenter.sh/do-not-disrupt with a time should be comparing the current time to the creation timestamp of that "thing" (whether that be a Pod or a NodeClaim) and then checking if the time since its creation timestamp is within the grace period associated with the karpenter.sh/do-not-disrupt annotation. In the case of the initialization buffer, I think we always need this to be a static value based off of the initialization time.

The original intention of the initialization operation was to know when Karpenter could start to deprovision nodes and nodes could be considered "healthy" by the system. One issue here is that after a node becomes "healthy" to schedule nodes, pods may not schedule to it immediately (scheduler needs some time to see the pods, new node, and bind the pods as well as our watch on our nodes and the pods needs to see the binding). If someone has their consolidationPolicy set to WhenEmpty and they have their consolidateAfter value set to 0s, now we have a race. We'll try to terminate new nodes that we create (which are valid for pods) a lot since we see that that node that is initialized doesn't have any pods scheduled to it (but might 2-5s later).

Our current solve for that was to use scheduler nomination to know that pods were going to land on nodes. This worked okay, but 1) It's more complicated than a flat amount of time and 2) We didn't extend that scheduler nomination when we were launching nodes for disruption. This means that any nodes that we were launching as replacements wouldn't know that they were having pods scheduled against them, which could be a problem for the reasons stated above around race conditions.

Given all this, it seems we'll always need a flat buffer on top of our initialization time to ensuring that we aren't racing against the kube-scheduler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well said!

@njtran njtran changed the title Initializationbuffer chore: add initialization buffer to disruptable nodes Jan 8, 2024
@njtran
Copy link
Contributor Author

njtran commented Jan 8, 2024

closing after discussion offline with @jonathan-innis to go another way to solve the core problem

@njtran njtran closed this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants