Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add NodeRegistrationHealthy status condition to nodepool #1969

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jigisha620
Copy link
Contributor

Fixes #N/A

Description
This PR adds NodeRegistrationHealthy status condition to nodePool which indicates if a misconfiguration exists that is preventing successful node launch/registrations that requires manual investigation.

How was this change tested?
Added tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign tzneal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025
@coveralls
Copy link

coveralls commented Feb 6, 2025

Pull Request Test Coverage Report for Build 13425130967

Details

  • 109 of 139 (78.42%) changed or added relevant lines in 10 files are covered.
  • 13 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.2%) to 81.24%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 1 0.0%
pkg/controllers/nodeclaim/lifecycle/liveness.go 17 25 68.0%
pkg/controllers/nodeclaim/lifecycle/registration.go 13 21 61.9%
pkg/controllers/nodepool/registrationhealth/controller.go 30 43 69.77%
Files with Coverage Reduction New Missed Lines %
pkg/test/expectations/expectations.go 2 94.81%
pkg/controllers/disruption/consolidation.go 4 88.55%
pkg/controllers/provisioning/scheduling/preferences.go 7 86.52%
Totals Coverage Status
Change from base Build 13422676255: -0.2%
Covered Lines: 9293
Relevant Lines: 11439

💛 - Coveralls

@jigisha620 jigisha620 force-pushed the degraded-nodepool-implementation branch 4 times, most recently from d060e31 to 659e4dd Compare February 6, 2025 22:28
@jigisha620 jigisha620 force-pushed the degraded-nodepool-implementation branch 2 times, most recently from 03a7d60 to 3fd43cf Compare February 12, 2025 02:13
// If the nodeClaim failed to launch/register during the TTL set NodeRegistrationHealthy status condition on
// NodePool to False. If the launch failed get the launch failure reason and message from nodeClaim.
if nodeClaim.StatusConditions().IsTrue(v1.ConditionTypeLaunched) {
nodePool.StatusConditions().SetFalse(v1.ConditionTypeNodeRegistrationHealthy, "Unhealthy", "Failed to register node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason is RegistrationFailed. I'm also not sure if instead of that message we should try and make recommendations of things to double check.

@jigisha620 jigisha620 force-pushed the degraded-nodepool-implementation branch from 3fd43cf to d202c66 Compare February 12, 2025 19:05
@jigisha620 jigisha620 changed the title chore: add NodeRegistrationHealthy status condition to nodepool feat: add NodeRegistrationHealthy status condition to nodepool Feb 12, 2025
@jigisha620 jigisha620 force-pushed the degraded-nodepool-implementation branch 3 times, most recently from 5df485a to 67ea6d0 Compare February 12, 2025 22:09
@jigisha620 jigisha620 force-pushed the degraded-nodepool-implementation branch from bc920ec to 7f356d4 Compare February 13, 2025 21:52
@jigisha620 jigisha620 force-pushed the degraded-nodepool-implementation branch from 7f356d4 to a9d685a Compare February 17, 2025 20:48
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 17, 2025
@jigisha620 jigisha620 force-pushed the degraded-nodepool-implementation branch from a9d685a to 810ad69 Compare February 20, 2025 00:28
@@ -252,7 +252,7 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
})
c.recordPodSchedulingUndecidedMetric(pod)
// Get the time for when we Karpenter first thought the pod was schedulable. This should be zero if we didn't simulate for this pod.
schedulableTime := c.cluster.PodSchedulingSuccessTime(types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace})
schedulableTime := c.cluster.PodSchedulingSuccessTime(types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace}, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just create a separate function to access the PodSchedulingNodeRegistrationHealthySuccessTime or something like this -- I think a boolean here is a bit hard to reason about

@@ -302,6 +302,7 @@ func (s *Scheduler) add(ctx context.Context, pod *corev1.Pod) error {
// Pick existing node that we are about to create
for _, nodeClaim := range s.newNodeClaims {
if err := nodeClaim.Add(pod, s.cachedPodData[pod.UID]); err == nil {
s.cluster.MarkPodToNodePoolSchedulingDecision(pod, nodeClaim.Labels[v1.NodePoolLabelKey])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is it to do one iteration through all the pods at the end? You could just iterate through the results and mark pod scheduling decisions with the NodePool attached in the same place I think rather than having to create an internal store to capture this

// on the NodePool if the nodeClaim fails to launch/register
func (l *Liveness) updateNodePoolRegistrationHealth(ctx context.Context, nodeClaim *v1.NodeClaim) error {
nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey]
if ok && len(nodePoolName) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just check nodePoolName != "" -- you don't even have to check ok, since if the label doesn't exist then it will just return an empty string -- and "" is an invalid name anyways

nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey]
if ok && len(nodePoolName) != 0 {
nodePool := &v1.NodePool{}
if err := l.kubeClient.Get(ctx, types.NamespacedName{Name: nodePoolName}, nodePool); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you properly handle the NodePool NotFound error?

// on the NodePool if the nodeClaim that registered is owned by a NodePool
func (r *Registration) updateNodePoolRegistrationHealth(ctx context.Context, nodeClaim *v1.NodeClaim) error {
nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey]
if ok && len(nodePoolName) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the liveness controller -- I would just check if the value is not equal to ""

func (c *Controller) Reconcile(ctx context.Context, nodePool *v1.NodePool) (reconcile.Result, error) {
ctx = injection.WithControllerName(ctx, "nodepool.registrationhealth")

nodeClass := nodepoolutils.GetNodeClassStatusObject(nodePool, c.cloudProvider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too much to have this helper actually retrieve the NodeClass for us that the NodePool is referencing rather than just the schema?


// If NodeClass/NodePool have been updated then NodeRegistrationHealthy = Unknown
if (nodePool.Status.NodeClassObservedGeneration != nodeClass.GetGeneration()) ||
(nodePool.Generation != nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).ObservedGeneration) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get can return nil if the condition isn't found -- how are you going to handle that

}

func (c *Controller) Reconcile(ctx context.Context, nodePool *v1.NodePool) (reconcile.Result, error) {
ctx = injection.WithControllerName(ctx, "nodepool.registrationhealth")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check whether the NodePool is managed in this Reconcile to match our Predicate or are you wanting to handle that in the GetNodeClass() call

Expect(nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).IsUnknown()).To(BeTrue())
Expect(nodePool.Status.NodeClassObservedGeneration).To(Equal(int64(1)))
})
It("should not set NodeRegistrationHealthy status condition on nodePool as Unknown if it is already set to true", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that I get this test -- why would we set the status condition to Unknown here -- all of the generation details match so I don't see our controller doing anything

nodePool = ExpectExists(ctx, env.Client, nodePool)
Expect(nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).IsUnknown()).To(BeFalse())
})
It("should not set NodeRegistrationHealthy status condition on nodePool as Unknown if it is already set to false", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (at least) one of these tests should validate that we are properly updating the NodeClassObservedGeneration since that's the responsibility of this controller

@@ -78,6 +81,12 @@ var _ = Describe("Liveness", func() {
ExpectFinalizersRemoved(ctx, env.Client, nodeClaim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate this: Do we have a test that we succeed when the NodeClaim has no owning NodePool?

"time"

"github.com/awslabs/operatorpkg/status"
operatorpkg "github.com/awslabs/operatorpkg/test/expectations"
. "github.com/onsi/ginkgo/v2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate this: Do we have a test that we succeed when the NodeClaim has no owning NodePool?

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants