feat: add NodeRegistrationHealthy status condition to nodepool #1969

jigisha620 · 2025-02-06T03:07:30Z

Fixes #N/A

Description
This PR adds NodeRegistrationHealthy status condition to nodePool which indicates if a misconfiguration exists that is preventing successful node launch/registrations that requires manual investigation.

How was this change tested?
Added tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-02-06T03:07:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign tzneal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-02-06T03:07:39Z

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-02-06T03:29:49Z

Pull Request Test Coverage Report for Build 13425130967

Details

109 of 139 (78.42%) changed or added relevant lines in 10 files are covered.
13 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-0.2%) to 81.24%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/controllers.go	0	1	0.0%
pkg/controllers/nodeclaim/lifecycle/liveness.go	17	25	68.0%
pkg/controllers/nodeclaim/lifecycle/registration.go	13	21	61.9%
pkg/controllers/nodepool/registrationhealth/controller.go	30	43	69.77%

Files with Coverage Reduction	New Missed Lines	%
pkg/test/expectations/expectations.go	2	94.81%
pkg/controllers/disruption/consolidation.go	4	88.55%
pkg/controllers/provisioning/scheduling/preferences.go	7	86.52%

Totals
Change from base Build 13422676255:	-0.2%
Covered Lines:	9293
Relevant Lines:	11439

💛 - Coveralls

pkg/controllers/nodeclaim/lifecycle/initialization.go

pkg/controllers/nodepool/registrationhealth/controller.go

pkg/controllers/nodeclaim/lifecycle/initialization.go

rschalo · 2025-02-12T05:17:10Z

pkg/controllers/nodeclaim/lifecycle/liveness.go

+		// If the nodeClaim failed to launch/register during the TTL set NodeRegistrationHealthy status condition on
+		// NodePool to False. If the launch failed get the launch failure reason and message from nodeClaim.
+		if nodeClaim.StatusConditions().IsTrue(v1.ConditionTypeLaunched) {
+			nodePool.StatusConditions().SetFalse(v1.ConditionTypeNodeRegistrationHealthy, "Unhealthy", "Failed to register node")


I think the reason is RegistrationFailed. I'm also not sure if instead of that message we should try and make recommendations of things to double check.

pkg/controllers/nodeclaim/lifecycle/liveness.go

pkg/controllers/nodepool/registrationhealth/controller.go

pkg/controllers/nodepool/registrationhealth/suite_test.go

pkg/apis/v1/nodepool_status.go

pkg/controllers/nodeclaim/lifecycle/liveness.go

pkg/controllers/nodeclaim/lifecycle/registration.go

pkg/controllers/nodeclaim/lifecycle/liveness_test.go

pkg/controllers/nodepool/registrationhealth/controller.go

pkg/controllers/nodepool/registrationhealth/suite_test.go

jonathan-innis · 2025-02-20T01:15:09Z

pkg/controllers/metrics/pod/controller.go

@@ -252,7 +252,7 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
 	})
 	c.recordPodSchedulingUndecidedMetric(pod)
 	// Get the time for when we Karpenter first thought the pod was schedulable. This should be zero if we didn't simulate for this pod.
-	schedulableTime := c.cluster.PodSchedulingSuccessTime(types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace})
+	schedulableTime := c.cluster.PodSchedulingSuccessTime(types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace}, false)


Can we just create a separate function to access the PodSchedulingNodeRegistrationHealthySuccessTime or something like this -- I think a boolean here is a bit hard to reason about

jonathan-innis · 2025-02-20T01:18:58Z

pkg/controllers/provisioning/scheduling/scheduler.go

@@ -302,6 +302,7 @@ func (s *Scheduler) add(ctx context.Context, pod *corev1.Pod) error {
 	// Pick existing node that we are about to create
 	for _, nodeClaim := range s.newNodeClaims {
 		if err := nodeClaim.Add(pod, s.cachedPodData[pod.UID]); err == nil {
+			s.cluster.MarkPodToNodePoolSchedulingDecision(pod, nodeClaim.Labels[v1.NodePoolLabelKey])


How expensive is it to do one iteration through all the pods at the end? You could just iterate through the results and mark pod scheduling decisions with the NodePool attached in the same place I think rather than having to create an internal store to capture this

jonathan-innis · 2025-02-20T01:20:45Z

pkg/controllers/nodeclaim/lifecycle/liveness.go

+// on the NodePool if the nodeClaim fails to launch/register
+func (l *Liveness) updateNodePoolRegistrationHealth(ctx context.Context, nodeClaim *v1.NodeClaim) error {
+	nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey]
+	if ok && len(nodePoolName) != 0 {


nit: Just check nodePoolName != "" -- you don't even have to check ok, since if the label doesn't exist then it will just return an empty string -- and "" is an invalid name anyways

jonathan-innis · 2025-02-20T01:21:50Z

pkg/controllers/nodeclaim/lifecycle/liveness.go

+	nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey]
+	if ok && len(nodePoolName) != 0 {
+		nodePool := &v1.NodePool{}
+		if err := l.kubeClient.Get(ctx, types.NamespacedName{Name: nodePoolName}, nodePool); err != nil {


Do you properly handle the NodePool NotFound error?

jonathan-innis · 2025-02-20T01:22:59Z

pkg/controllers/nodeclaim/lifecycle/registration.go

+// on the NodePool if the nodeClaim that registered is owned by a NodePool
+func (r *Registration) updateNodePoolRegistrationHealth(ctx context.Context, nodeClaim *v1.NodeClaim) error {
+	nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey]
+	if ok && len(nodePoolName) != 0 {


Same comment as in the liveness controller -- I would just check if the value is not equal to ""

jonathan-innis · 2025-02-20T01:28:58Z

pkg/controllers/nodepool/registrationhealth/controller.go

+func (c *Controller) Reconcile(ctx context.Context, nodePool *v1.NodePool) (reconcile.Result, error) {
+	ctx = injection.WithControllerName(ctx, "nodepool.registrationhealth")
+
+	nodeClass := nodepoolutils.GetNodeClassStatusObject(nodePool, c.cloudProvider)


Is it too much to have this helper actually retrieve the NodeClass for us that the NodePool is referencing rather than just the schema?

jonathan-innis · 2025-02-20T01:29:37Z

pkg/controllers/nodepool/registrationhealth/controller.go

+
+	// If NodeClass/NodePool have been updated then NodeRegistrationHealthy = Unknown
+	if (nodePool.Status.NodeClassObservedGeneration != nodeClass.GetGeneration()) ||
+		(nodePool.Generation != nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).ObservedGeneration) {


Get can return nil if the condition isn't found -- how are you going to handle that

jonathan-innis · 2025-02-20T01:30:53Z

pkg/controllers/nodepool/registrationhealth/controller.go

+}
+
+func (c *Controller) Reconcile(ctx context.Context, nodePool *v1.NodePool) (reconcile.Result, error) {
+	ctx = injection.WithControllerName(ctx, "nodepool.registrationhealth")


Should we check whether the NodePool is managed in this Reconcile to match our Predicate or are you wanting to handle that in the GetNodeClass() call

jonathan-innis · 2025-02-20T01:32:29Z

pkg/controllers/nodepool/registrationhealth/suite_test.go

+		Expect(nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).IsUnknown()).To(BeTrue())
+		Expect(nodePool.Status.NodeClassObservedGeneration).To(Equal(int64(1)))
+	})
+	It("should not set NodeRegistrationHealthy status condition on nodePool as Unknown if it is already set to true", func() {


Not sure that I get this test -- why would we set the status condition to Unknown here -- all of the generation details match so I don't see our controller doing anything

jonathan-innis · 2025-02-20T01:33:55Z

pkg/controllers/nodepool/registrationhealth/suite_test.go

+		nodePool = ExpectExists(ctx, env.Client, nodePool)
+		Expect(nodePool.StatusConditions().Get(v1.ConditionTypeNodeRegistrationHealthy).IsUnknown()).To(BeFalse())
+	})
+	It("should not set NodeRegistrationHealthy status condition on nodePool as Unknown if it is already set to false", func() {


I think (at least) one of these tests should validate that we are properly updating the NodeClassObservedGeneration since that's the responsibility of this controller

jonathan-innis · 2025-02-20T01:36:13Z

pkg/controllers/nodeclaim/lifecycle/liveness_test.go

@@ -78,6 +81,12 @@ var _ = Describe("Liveness", func() {
 			ExpectFinalizersRemoved(ctx, env.Client, nodeClaim)


We should validate this: Do we have a test that we succeed when the NodeClaim has no owning NodePool?

jonathan-innis · 2025-02-20T01:36:27Z

pkg/controllers/nodeclaim/lifecycle/registration_test.go

+	"time"
+
+	"github.com/awslabs/operatorpkg/status"
+	operatorpkg "github.com/awslabs/operatorpkg/test/expectations"
 	. "github.com/onsi/ginkgo/v2"


We should validate this: Do we have a test that we succeed when the NodeClaim has no owning NodePool?

k8s-ci-robot · 2025-02-21T01:54:58Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 6, 2025

k8s-ci-robot requested review from engedaam and jackfrancis February 6, 2025 03:07

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025

jigisha620 commented Feb 6, 2025

View reviewed changes

pkg/controllers/nodeclaim/lifecycle/initialization.go Outdated Show resolved Hide resolved

jigisha620 force-pushed the degraded-nodepool-implementation branch 4 times, most recently from d060e31 to 659e4dd Compare February 6, 2025 22:28

rschalo reviewed Feb 7, 2025

View reviewed changes

pkg/controllers/nodepool/registrationhealth/controller.go Outdated Show resolved Hide resolved

pkg/controllers/nodepool/registrationhealth/controller.go Outdated Show resolved Hide resolved

pkg/controllers/nodeclaim/lifecycle/initialization.go Outdated Show resolved Hide resolved

jigisha620 force-pushed the degraded-nodepool-implementation branch 2 times, most recently from 03a7d60 to 3fd43cf Compare February 12, 2025 02:13

rschalo reviewed Feb 12, 2025

View reviewed changes

pkg/controllers/nodeclaim/lifecycle/initialization.go Outdated Show resolved Hide resolved

rschalo reviewed Feb 12, 2025

View reviewed changes

pkg/controllers/nodeclaim/lifecycle/liveness.go Outdated Show resolved Hide resolved

rschalo reviewed Feb 12, 2025

View reviewed changes

pkg/controllers/nodeclaim/lifecycle/liveness.go Outdated Show resolved Hide resolved

rschalo reviewed Feb 12, 2025

View reviewed changes

pkg/controllers/nodepool/registrationhealth/controller.go Outdated Show resolved Hide resolved

jigisha620 force-pushed the degraded-nodepool-implementation branch from 3fd43cf to d202c66 Compare February 12, 2025 19:05

jigisha620 changed the title ~~chore: add NodeRegistrationHealthy status condition to nodepool~~ feat: add NodeRegistrationHealthy status condition to nodepool Feb 12, 2025

jigisha620 force-pushed the degraded-nodepool-implementation branch 3 times, most recently from 5df485a to 67ea6d0 Compare February 12, 2025 22:09

jxs1211 reviewed Feb 13, 2025

View reviewed changes

pkg/controllers/nodepool/registrationhealth/suite_test.go Show resolved Hide resolved

jigisha620 force-pushed the degraded-nodepool-implementation branch from bc920ec to 7f356d4 Compare February 13, 2025 21:52

jonathan-innis reviewed Feb 14, 2025

View reviewed changes

jigisha620 force-pushed the degraded-nodepool-implementation branch from 7f356d4 to a9d685a Compare February 17, 2025 20:48

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 17, 2025

jigisha620 mentioned this pull request Feb 17, 2025

chore: Add new metrics for pods that get scheduled to healthy nodePool #1976

Closed

feat: add NodeRegistrationHealthy status condition to nodepool

810ad69

jigisha620 force-pushed the degraded-nodepool-implementation branch from a9d685a to 810ad69 Compare February 20, 2025 00:28

jonathan-innis reviewed Feb 20, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add NodeRegistrationHealthy status condition to nodepool #1969

feat: add NodeRegistrationHealthy status condition to nodepool #1969

jigisha620 commented Feb 6, 2025

k8s-ci-robot commented Feb 6, 2025

k8s-ci-robot commented Feb 6, 2025

coveralls commented Feb 6, 2025 •

edited

Loading

rschalo Feb 12, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

jonathan-innis Feb 20, 2025

k8s-ci-robot commented Feb 21, 2025

		@@ -78,6 +81,12 @@ var _ = Describe("Liveness", func() {
		ExpectFinalizersRemoved(ctx, env.Client, nodeClaim)

feat: add NodeRegistrationHealthy status condition to nodepool #1969

Are you sure you want to change the base?

feat: add NodeRegistrationHealthy status condition to nodepool #1969

Conversation

jigisha620 commented Feb 6, 2025

k8s-ci-robot commented Feb 6, 2025

k8s-ci-robot commented Feb 6, 2025

coveralls commented Feb 6, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13425130967

Details

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 21, 2025

coveralls commented Feb 6, 2025 •

edited

Loading