-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: add initialization buffer to disruptable nodes #926
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: njtran The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// This is critical to ensure that nodes that are created as part of a replacement aren't | ||
// chosen for disruption as soon as they're initialized. We only need a small time frame | ||
// to get one pod nomination event, as that will block candidacy further. | ||
cond := in.NodeClaim.StatusConditions().GetCondition(v1beta1.Initialized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add this check for emptiness. Otherwise, we are going to mark NodeClaims as empty that are also within this buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point. On one hand, we're expecting nodes to get pods scheduled onto them soon, which could result in a redundant double patch (mark as empty, then mark as not empty). On the other hand, the 30s is a heuristic on how long we're expecting this to be good enough for pod scheduling. Since we're using the initialization buffer to hold on disrupting, I'm not sure I agree with holding on identifying status conditions of individual node claims. It may be preferable to have the emptiness condition progress while the buffer is going.
@@ -221,6 +222,19 @@ func (in *StateNode) Initialized() bool { | |||
return true | |||
} | |||
|
|||
// Disruptable returns if a node has been initialized long enough for Karpenter to disrupt it | |||
func (in *StateNode) Disruptable(clock clock.Clock) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the naming, but because we are going to use this with emptiness, I wonder if we have to rename this to something else like, "WithinInitializionWindow()" or something more verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can see the case for this if we have more Disruptable requirements. Not sure I understand why reasoning about emptiness motivates you to think of another name.
// skip candidates that aren't initialized | ||
if !node.Initialized() { | ||
// skip candidates that haven't been initialized for a period time | ||
if !node.Disruptable(clk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of introducing Disruptable. Should it eventually be controlable beyond this static 30s value?
We have the ability to ignore nodes for disruption based in the DoNotDisrupt annotation but there is not temporal control there.
#920 (comment) mentions an issue where this 3rd party autoscaler prevents handing the instances off to karpenter because it will disrupt too quickly.
I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life" or a concept like this.
Not saying its something we need to solve in this pr just food for thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is room in the disruption controls for something like "do not disrupt within first 5m of life"
We have this as an annotation that can get propagated onto the Node from the NodeClaim today. Seems like this might be a really nice place to do #752 and push the karpenter.sh/do-not-disrupt
annotation that can either be configured at the Pod spec level or at the NodeClaim level (which could be passed down from NodePool spec.template.metadata.annotations
so you could do something like):
spec:
template:
metadata:
annotations:
karpenter.sh/do-not-disrupt: 10m
Should it eventually be controlable beyond this static 30s value
I think the annotation handling above gives us a nice path to this; however, the Node karpenter.sh/do-not-disrupt
min lifetime differs a bit from the InitializationBuffer -- both in implementation and also in what it should be checking against. The karpenter.sh/do-not-disrupt
with a time should be comparing the current time to the creation timestamp of that "thing" (whether that be a Pod or a NodeClaim) and then checking if the time since its creation timestamp is within the grace period associated with the karpenter.sh/do-not-disrupt
annotation. In the case of the initialization buffer, I think we always need this to be a static value based off of the initialization time.
The original intention of the initialization operation was to know when Karpenter could start to deprovision nodes and nodes could be considered "healthy" by the system. One issue here is that after a node becomes "healthy" to schedule nodes, pods may not schedule to it immediately (scheduler needs some time to see the pods, new node, and bind the pods as well as our watch on our nodes and the pods needs to see the binding). If someone has their consolidationPolicy
set to WhenEmpty
and they have their consolidateAfter
value set to 0s
, now we have a race. We'll try to terminate new nodes that we create (which are valid for pods) a lot since we see that that node that is initialized doesn't have any pods scheduled to it (but might 2-5s later).
Our current solve for that was to use scheduler nomination to know that pods were going to land on nodes. This worked okay, but 1) It's more complicated than a flat amount of time and 2) We didn't extend that scheduler nomination when we were launching nodes for disruption. This means that any nodes that we were launching as replacements wouldn't know that they were having pods scheduled against them, which could be a problem for the reasons stated above around race conditions.
Given all this, it seems we'll always need a flat buffer on top of our initialization time to ensuring that we aren't racing against the kube-scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well said!
closing after discussion offline with @jonathan-innis to go another way to solve the core problem |
Fixes #N/A
Description
Adds a 30s initialization buffer into candidacy for nodes for disruption. Only nodes that have been initialized for 30 seconds can be considered candidates for disruption, ensuring that since we're able to do multiple commands in parallel, that we don't disrupt replacement nodes for a previous action in a future action.
How was this change tested?
make presubmit
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.