-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add NodeRegistrationHealthy status condition to nodepool #1969
base: main
Are you sure you want to change the base?
feat: add NodeRegistrationHealthy status condition to nodepool #1969
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jigisha620 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @jigisha620. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pull Request Test Coverage Report for Build 13425130967Details
💛 - Coveralls |
d060e31
to
659e4dd
Compare
03a7d60
to
3fd43cf
Compare
// If the nodeClaim failed to launch/register during the TTL set NodeRegistrationHealthy status condition on | ||
// NodePool to False. If the launch failed get the launch failure reason and message from nodeClaim. | ||
if nodeClaim.StatusConditions().IsTrue(v1.ConditionTypeLaunched) { | ||
nodePool.StatusConditions().SetFalse(v1.ConditionTypeNodeRegistrationHealthy, "Unhealthy", "Failed to register node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason is RegistrationFailed
. I'm also not sure if instead of that message we should try and make recommendations of things to double check.
3fd43cf
to
d202c66
Compare
5df485a
to
67ea6d0
Compare
bc920ec
to
7f356d4
Compare
7f356d4
to
a9d685a
Compare
a9d685a
to
810ad69
Compare
@@ -252,7 +252,7 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco | |||
}) | |||
c.recordPodSchedulingUndecidedMetric(pod) | |||
// Get the time for when we Karpenter first thought the pod was schedulable. This should be zero if we didn't simulate for this pod. | |||
schedulableTime := c.cluster.PodSchedulingSuccessTime(types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace}) | |||
schedulableTime := c.cluster.PodSchedulingSuccessTime(types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace}, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just create a separate function to access the PodSchedulingNodeRegistrationHealthySuccessTime
or something like this -- I think a boolean here is a bit hard to reason about
@@ -302,6 +302,7 @@ func (s *Scheduler) add(ctx context.Context, pod *corev1.Pod) error { | |||
// Pick existing node that we are about to create | |||
for _, nodeClaim := range s.newNodeClaims { | |||
if err := nodeClaim.Add(pod, s.cachedPodData[pod.UID]); err == nil { | |||
s.cluster.MarkPodToNodePoolSchedulingDecision(pod, nodeClaim.Labels[v1.NodePoolLabelKey]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is it to do one iteration through all the pods at the end? You could just iterate through the results and mark pod scheduling decisions with the NodePool attached in the same place I think rather than having to create an internal store to capture this
// on the NodePool if the nodeClaim fails to launch/register | ||
func (l *Liveness) updateNodePoolRegistrationHealth(ctx context.Context, nodeClaim *v1.NodeClaim) error { | ||
nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey] | ||
if ok && len(nodePoolName) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just check nodePoolName != ""
-- you don't even have to check ok
, since if the label doesn't exist then it will just return an empty string -- and ""
is an invalid name anyways
nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey] | ||
if ok && len(nodePoolName) != 0 { | ||
nodePool := &v1.NodePool{} | ||
if err := l.kubeClient.Get(ctx, types.NamespacedName{Name: nodePoolName}, nodePool); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you properly handle the NodePool NotFound error?
// on the NodePool if the nodeClaim that registered is owned by a NodePool | ||
func (r *Registration) updateNodePoolRegistrationHealth(ctx context.Context, nodeClaim *v1.NodeClaim) error { | ||
nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey] | ||
if ok && len(nodePoolName) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as in the liveness controller -- I would just check if the value is not equal to ""
func (c *Controller) Reconcile(ctx context.Context, nodePool *v1.NodePool) (reconcile.Result, error) { | ||
ctx = injection.WithControllerName(ctx, "nodepool.registrationhealth") | ||
|
||
nodeClass := nodepoolutils.GetNodeClassStatusObject(nodePool, c.cloudProvider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it too much to have this helper actually retrieve the NodeClass for us that the NodePool is referencing rather than just the schema?
|
||
// If NodeClass/NodePool have been updated then NodeRegistrationHealthy = Unknown | ||
if (nodePool.Status.NodeClassObservedGeneration != nodeClass.GetGeneration()) || | ||
(nodePool.Generation != nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).ObservedGeneration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get
can return nil
if the condition isn't found -- how are you going to handle that
} | ||
|
||
func (c *Controller) Reconcile(ctx context.Context, nodePool *v1.NodePool) (reconcile.Result, error) { | ||
ctx = injection.WithControllerName(ctx, "nodepool.registrationhealth") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check whether the NodePool is managed in this Reconcile to match our Predicate or are you wanting to handle that in the GetNodeClass() call
Expect(nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).IsUnknown()).To(BeTrue()) | ||
Expect(nodePool.Status.NodeClassObservedGeneration).To(Equal(int64(1))) | ||
}) | ||
It("should not set NodeRegistrationHealthy status condition on nodePool as Unknown if it is already set to true", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that I get this test -- why would we set the status condition to Unknown here -- all of the generation details match so I don't see our controller doing anything
nodePool = ExpectExists(ctx, env.Client, nodePool) | ||
Expect(nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).IsUnknown()).To(BeFalse()) | ||
}) | ||
It("should not set NodeRegistrationHealthy status condition on nodePool as Unknown if it is already set to false", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (at least) one of these tests should validate that we are properly updating the NodeClassObservedGeneration since that's the responsibility of this controller
@@ -78,6 +81,12 @@ var _ = Describe("Liveness", func() { | |||
ExpectFinalizersRemoved(ctx, env.Client, nodeClaim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate this: Do we have a test that we succeed when the NodeClaim has no owning NodePool?
"time" | ||
|
||
"github.com/awslabs/operatorpkg/status" | ||
operatorpkg "github.com/awslabs/operatorpkg/test/expectations" | ||
. "github.com/onsi/ginkgo/v2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate this: Do we have a test that we succeed when the NodeClaim has no owning NodePool?
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Fixes #N/A
Description
This PR adds
NodeRegistrationHealthy
status condition to nodePool which indicates if a misconfiguration exists that is preventing successful node launch/registrations that requires manual investigation.How was this change tested?
Added tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.