chore: add initialization buffer to disruptable nodes #926

njtran · 2024-01-06T00:58:10Z

Fixes #N/A

Description
Adds a 30s initialization buffer into candidacy for nodes for disruption. Only nodes that have been initialized for 30 seconds can be considered candidates for disruption, ensuring that since we're able to do multiple commands in parallel, that we don't disrupt replacement nodes for a previous action in a future action.

How was this change tested?
make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2024-01-06T00:58:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: njtran

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [njtran]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jonathan-innis · 2024-01-06T01:03:13Z

pkg/controllers/state/statenode.go

+	// This is critical to ensure that nodes that are created as part of a replacement aren't
+	// chosen for disruption as soon as they're initialized. We only need a small time frame
+	// to get one pod nomination event, as that will block candidacy further.
+	cond := in.NodeClaim.StatusConditions().GetCondition(v1beta1.Initialized)


You need to add this check for emptiness. Otherwise, we are going to mark NodeClaims as empty that are also within this buffer

This is an interesting point. On one hand, we're expecting nodes to get pods scheduled onto them soon, which could result in a redundant double patch (mark as empty, then mark as not empty). On the other hand, the 30s is a heuristic on how long we're expecting this to be good enough for pod scheduling. Since we're using the initialization buffer to hold on disrupting, I'm not sure I agree with holding on identifying status conditions of individual node claims. It may be preferable to have the emptiness condition progress while the buffer is going.

jonathan-innis · 2024-01-06T01:03:42Z

pkg/controllers/state/statenode.go

@@ -221,6 +222,19 @@ func (in *StateNode) Initialized() bool {
 	return true
 }

+// Disruptable returns if a node has been initialized long enough for Karpenter to disrupt it
+func (in *StateNode) Disruptable(clock clock.Clock) bool {


I like the naming, but because we are going to use this with emptiness, I wonder if we have to rename this to something else like, "WithinInitializionWindow()" or something more verbose

i can see the case for this if we have more Disruptable requirements. Not sure I understand why reasoning about emptiness motivates you to think of another name.

Bryce-Soghigian · 2024-01-06T03:46:34Z

pkg/controllers/disruption/types.go

-	// skip candidates that aren't initialized
-	if !node.Initialized() {
+	// skip candidates that haven't been initialized for a period time
+	if !node.Disruptable(clk) {


I like the idea of introducing Disruptable. Should it eventually be controlable beyond this static 30s value?

We have the ability to ignore nodes for disruption based in the DoNotDisrupt annotation but there is not temporal control there.

#920 (comment) mentions an issue where this 3rd party autoscaler prevents handing the instances off to karpenter because it will disrupt too quickly.

I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life" or a concept like this.

Not saying its something we need to solve in this pr just food for thought

I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life"

We have this as an annotation that can get propagated onto the Node from the NodeClaim today. Seems like this might be a really nice place to do #752 and push the karpenter.sh/do-not-disrupt annotation that can either be configured at the Pod spec level or at the NodeClaim level (which could be passed down from NodePool spec.template.metadata.annotations so you could do something like):

spec: template: metadata: annotations: karpenter.sh/do-not-disrupt: 10m

Should it eventually be controlable beyond this static 30s value

I think the annotation handling above gives us a nice path to this; however, the Node karpenter.sh/do-not-disrupt min lifetime differs a bit from the InitializationBuffer -- both in implementation and also in what it should be checking against. The karpenter.sh/do-not-disrupt with a time should be comparing the current time to the creation timestamp of that "thing" (whether that be a Pod or a NodeClaim) and then checking if the time since its creation timestamp is within the grace period associated with the karpenter.sh/do-not-disrupt annotation. In the case of the initialization buffer, I think we always need this to be a static value based off of the initialization time.

The original intention of the initialization operation was to know when Karpenter could start to deprovision nodes and nodes could be considered "healthy" by the system. One issue here is that after a node becomes "healthy" to schedule nodes, pods may not schedule to it immediately (scheduler needs some time to see the pods, new node, and bind the pods as well as our watch on our nodes and the pods needs to see the binding). If someone has their consolidationPolicy set to WhenEmpty and they have their consolidateAfter value set to 0s, now we have a race. We'll try to terminate new nodes that we create (which are valid for pods) a lot since we see that that node that is initialized doesn't have any pods scheduled to it (but might 2-5s later).

Our current solve for that was to use scheduler nomination to know that pods were going to land on nodes. This worked okay, but 1) It's more complicated than a flat amount of time and 2) We didn't extend that scheduler nomination when we were launching nodes for disruption. This means that any nodes that we were launching as replacements wouldn't know that they were having pods scheduled against them, which could be a problem for the reasons stated above around race conditions.

Given all this, it seems we'll always need a flat buffer on top of our initialization time to ensuring that we aren't racing against the kube-scheduler

njtran · 2024-01-08T19:48:36Z

closing after discussion offline with @jonathan-innis to go another way to solve the core problem

njtran added 3 commits January 5, 2024 14:51

fix: only include initialized towards the budget total

57a4fc7

chore: add initialization buffer to disruptable nodes

fab3c05

add tests

adf70cf

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 6, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2024

k8s-ci-robot requested review from engedaam and tallaxes January 6, 2024 00:58

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 6, 2024

jonathan-innis reviewed Jan 6, 2024

View reviewed changes

Bryce-Soghigian reviewed Jan 6, 2024

View reviewed changes

jonathan-innis mentioned this pull request Jan 8, 2024

Add a gracePeriod for the do-not-disrupt pod annotation #752

Open

njtran changed the title ~~Initializationbuffer~~ chore: add initialization buffer to disruptable nodes Jan 8, 2024

njtran closed this Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add initialization buffer to disruptable nodes #926

chore: add initialization buffer to disruptable nodes #926

njtran commented Jan 6, 2024

k8s-ci-robot commented Jan 6, 2024

jonathan-innis Jan 6, 2024

njtran Jan 8, 2024

jonathan-innis Jan 6, 2024

njtran Jan 8, 2024

Bryce-Soghigian Jan 6, 2024

jonathan-innis Jan 8, 2024

njtran Jan 8, 2024

njtran commented Jan 8, 2024

chore: add initialization buffer to disruptable nodes #926

chore: add initialization buffer to disruptable nodes #926

Conversation

njtran commented Jan 6, 2024

k8s-ci-robot commented Jan 6, 2024

jonathan-innis Jan 6, 2024

Choose a reason for hiding this comment

njtran Jan 8, 2024

Choose a reason for hiding this comment

jonathan-innis Jan 6, 2024

Choose a reason for hiding this comment

njtran Jan 8, 2024

Choose a reason for hiding this comment

Bryce-Soghigian Jan 6, 2024

Choose a reason for hiding this comment

jonathan-innis Jan 8, 2024

Choose a reason for hiding this comment

njtran Jan 8, 2024

Choose a reason for hiding this comment

njtran commented Jan 8, 2024