Fix Maintenance Creation Check for Control Plane Nodes #110

razo7 · 2024-01-02T07:22:11Z

Fix etcd quorum check from looking only at DisruptionsAllowed to also looking for control plane node etcd guard pod. If there are no allowed disruptions and nm CR is for a node that is not disrupted, then we must not allow this CR creation as it would violate etcd quorum. Otherwise, when there is a failed guard pod (Ready status is False) or there is no guard pod for the node, then we allow the CR creation as it won't violate further the etcd quorum.

Furthermore, this etcd quorum check is only valid on OCP / OKD, since they have etcd quorum PDB. Thus, we won't run this validation on other platforms.

Originally the PR intended to block CR creation for any node, including workers that we currently support. It was decided to allow it for any node as long as the (control plane) node won't violate etcd quorum.

ECOPROJECT-1811

openshift-ci · 2024-01-02T07:22:15Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-01-02T07:22:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [razo7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

razo7 · 2024-01-02T07:23:43Z

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

razo7 · 2024-01-02T11:09:33Z

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

razo7 · 2024-01-03T10:45:26Z

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

clobrano · 2024-01-03T11:33:56Z

/lgtm
giving others a chance to review as well, feel free to unhold
/hold

razo7 · 2024-01-04T07:00:14Z

/retest

slintes · 2024-01-05T14:28:47Z

api/v1beta1/nodemaintenance_webhook.go

+	// find the status of node condition Ready
+	nodeConditionByType := make(map[corev1.NodeConditionType]corev1.NodeCondition)
+	for _, nc := range node.Status.Conditions {
+		nodeConditionByType[nc.Type] = nc


what's the value of using the nodeConditionByType map, and not directly checking for the Ready condition here? 🤔

I thought it looked nicer, but I see your point of being more explicit.

slintes · 2024-01-05T14:29:33Z

api/v1beta1/nodemaintenance_webhook.go

+	for _, nc := range node.Status.Conditions {
+		nodeConditionByType[nc.Type] = nc
+	}
+	nodeConditionReady := nodeConditionByType[corev1.NodeReady].Status


this will panic when there is no Ready condition (which can happen on new nodes AFAIK)

I am not aware of this scenario, but adding a check on whether the condition exists is a good point.

razo7 · 2024-01-07T06:23:59Z

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

razo7 · 2024-01-07T09:54:31Z

/retest

razo7 · 2024-01-07T13:03:42Z

/retest

razo7 · 2024-01-07T14:31:34Z

/test 4.14-openshift-e2e

Use IsEtcdDisruptionAllowed to check whether a node can be disrupted and it won't violate control plane etcd quorum prior to creating nm CR

The log messages from webook were less informative and specific. Adding the nodemaintenance CR and node name to the prints could improve that

Modify tests to search for new error and add cases for etcd guard pod

razo7 · 2024-01-16T14:04:25Z

Moving from blocking CR creation on unhealthy nodes to better checking of unhealthy nodes, and CP guard pods prior to CR creation and any etcd quorum violation medik8s/common#17

razo7 · 2024-01-16T16:36:56Z

/retest

slintes · 2024-01-16T19:03:25Z

api/v1beta1/nodemaintenance_webhook.go

-		// TODO do we need a fallback for k8s clusters?
-		nodemaintenancelog.Info("etcd quorum guard PDB hasn't been found. Skipping master/control-plane quorum validation.")
-		return nil
+	canDisrupt, err := etcd.IsEtcdDisruptionAllowed(context.Background(), v.client, nodemaintenancelog, node)


You are inventing new names again, which are less clear that what we already have (in the function name) 🤷🏼‍♂️
What's the advantage of "can" instead of "isAllowed"?
"Can" not only means if something is allowed, but it can (pun intended) also mean if you are able to do something. We are always able to disrupt etcd. But we should't always do that 😉

also: you need to check if we are on OCP / OKD. On k8s we don't have the etcd PDB / guards, and the check returns false in that case, which stops maintenance. In the old version maintenance is allowed.

Assert if cluster was installed on OpenShift, since only there a quorum PDB exists, otherwise we have nothing to assert

razo7 · 2024-01-17T14:06:49Z

/retest

slintes · 2024-01-19T09:47:10Z

pkg/utils/validation.go

+)
+
+// IsOpenshiftSupported will be set to true in case the operator was installed on OpenShift cluster
+var	IsOpenshiftSupported bool


I'm lazy and let Copilot do the review 😉 The issue is: do not use global variables!

I've reviewed the provided code and it appears to be correct. There are no syntax errors, the function ValidateIsOpenshift is properly checking if the cluster has the OpenShift config group, and the global variable IsOpenshiftSupported is being set correctly based on the presence of the OpenShift config group. However, there is a potential design issue that could be improved. The global variable IsOpenshiftSupported is being set within the ValidateIsOpenshift function. This could lead to unexpected behavior if this function is called multiple times, as the global variable would be overwritten each time. It would be better to return the result from the function and let the caller decide what to do with it. [...]

some more input from my side, skipping answers:

please provide an alternative to using the IsOpenshiftSupported global variable, without returning the result in ValidateIsOpenshift

make ValidateIsOpenshift private

no, isOpenshiftSupported still needs to be public, but not as global variable

now call validateIsOpenshift in NewOpenshiftValidator

final code:

package utils import ( "k8s.io/apimachinery/pkg/runtime/schema" "k8s.io/client-go/discovery" "k8s.io/client-go/rest" ) type OpenshiftValidator struct { isOpenshiftSupported bool } func NewOpenshiftValidator(config *rest.Config) (*OpenshiftValidator, error) { v := &OpenshiftValidator{} err := v.validateIsOpenshift(config) if err != nil { return nil, err } return v, nil } func (v *OpenshiftValidator) IsOpenshiftSupported() bool { return v.isOpenshiftSupported } func (v *OpenshiftValidator) validateIsOpenshift(config *rest.Config) error { dc, err := discovery.NewDiscoveryClientForConfig(config) if err != nil { return err } apiGroups, err := dc.ServerGroups() if err != nil { return err } kind := schema.GroupVersionKind{Group: "config.openshift.io", Version: "v1", Kind: "ClusterVersion"} for _, apiGroup := range apiGroups.Groups { for _, supportedVersion := range apiGroup.Versions { if supportedVersion.GroupVersion == kind.GroupVersion().String() { v.isOpenshiftSupported = true return nil } } } return nil }

WDYT?

do not use global variables

but good idea to run the detection code once only 👍🏼

Global variable should be avoided and encapsulating the value in a better way was needed by using an initalizaion function, NewOpenshiftValidator, and modify SetupWebhookWithManager input

razo7 · 2024-01-21T08:40:33Z

/test 4.13-openshift-e2e

slintes

one test is at a wrong place, otherwise lgtm

slintes · 2024-01-22T08:31:56Z

api/v1beta1/webhook_suite_test.go

@@ -97,6 +101,10 @@ var _ = BeforeSuite(func() {
 	Expect(err).NotTo(HaveOccurred())
 	Expect(k8sClient).NotTo(BeNil())

+	openshiftCheck, err := utils.NewOpenshiftValidator(cfg)


This looks like a test of the utils package, not? It shouldn't be in BeforeSuite but in an actual test. And ideally not in this api package.

It shouldn't be in BeforeSuite but in an actual test. And ideally not in this api package.

Where would it make more sense to place it? I thought of placing it here as it would happen once per suite and Webhook unit tests and not in the BeforeEach of the unit tests.
But from your answer, I understand that api package does not seem right. Do you have any other place in mind?

every test should be in an It()

ideally unit test suites should be where the code is, so for this one in the utils package

so for this one in the utils package

Are you suggesting moving all the unit test logic and files for Webhook from api package to utils package?

no, just the validator test

slintes · 2024-01-24T14:07:50Z

pkg/utils/validation_test.go

+)
+
+var _ = Describe("Check OpenShift Existance Validation", func() {
+	testEnv := &envtest.Environment{}


Any reason to not put this into BeforeSuite, just like we always do?
Maybe also configure the logger?

slintes · 2024-01-24T14:08:02Z

pkg/utils/validation.go

+		}
+	}
+	return nil
+}


missing new line

Remove OpenShift validation from Webhook, and add a test for validating it in Utils package

slintes · 2024-01-24T14:38:44Z

/lgtm

razo7 · 2024-01-24T14:43:17Z

/unhold

razo7 · 2024-01-25T06:41:29Z

/retest

openshift-ci bot added the do-not-merge/work-in-progress label Jan 2, 2024

openshift-ci bot added the approved label Jan 2, 2024

razo7 changed the title ~~Better Webhook Print for nodemaintenance CR on Not Ready node~~ Block CR Creation for Not Ready Nodes Jan 3, 2024

razo7 force-pushed the better-webhook-cp-message branch from 6ebe40b to 3e0d303 Compare January 3, 2024 10:45

openshift-ci bot added the do-not-merge/hold label Jan 3, 2024

openshift-ci bot assigned clobrano Jan 3, 2024

openshift-ci bot added the lgtm label Jan 3, 2024

razo7 marked this pull request as ready for review January 3, 2024 11:36

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 3, 2024

openshift-ci bot requested review from beekhof and clobrano January 3, 2024 11:36

slintes requested changes Jan 5, 2024

View reviewed changes

openshift-ci bot assigned slintes Jan 5, 2024

openshift-ci bot removed the lgtm label Jan 5, 2024

razo7 force-pushed the better-webhook-cp-message branch 2 times, most recently from a2b2c11 to e3caa57 Compare January 16, 2024 13:32

openshift-merge-robot added the needs-rebase label Jan 16, 2024

razo7 added 2 commits January 16, 2024 15:34

Fetch etcd package and IsEtcdDisruptionAllowed

43dc47e

Use IsEtcdDisruptionAllowed to check whether a node can be disrupted and it won't violate control plane etcd quorum prior to creating nm CR

Print nodemaintenance CR and node name on failed ValidateCreate

0e3bfdc

The log messages from webook were less informative and specific. Adding the nodemaintenance CR and node name to the prints could improve that

razo7 added 2 commits January 16, 2024 15:42

Fix error checking in tests and guard pods

26d87d4

Modify tests to search for new error and add cases for etcd guard pod

Use corev1 import synonym instead of generic v1

01c8263

razo7 force-pushed the better-webhook-cp-message branch from e3caa57 to 01c8263 Compare January 16, 2024 13:46

openshift-merge-robot removed the needs-rebase label Jan 16, 2024

razo7 changed the title ~~Block CR Creation for Not Ready Nodes~~ Fix Maintenance Creation Check for Control Plane Nodes Jan 16, 2024

slintes requested changes Jan 16, 2024

View reviewed changes

razo7 force-pushed the better-webhook-cp-message branch from c13f37c to 316d043 Compare January 17, 2024 12:54

Validate control-plane quorum only on OpenShift

138830d

Assert if cluster was installed on OpenShift, since only there a quorum PDB exists, otherwise we have nothing to assert

razo7 force-pushed the better-webhook-cp-message branch from 316d043 to 138830d Compare January 17, 2024 13:01

slintes requested changes Jan 19, 2024

View reviewed changes

Pass IsOpenshiftSupported to Webhook instead of global var

472dae3

Global variable should be avoided and encapsulating the value in a better way was needed by using an initalizaion function, NewOpenshiftValidator, and modify SetupWebhookWithManager input

slintes requested changes Jan 22, 2024

View reviewed changes

slintes mentioned this pull request Jan 23, 2024

Enable out-of-service taint in FAR medik8s/fence-agents-remediation#92

Merged

4 tasks

slintes reviewed Jan 24, 2024

View reviewed changes

pkg/utils/validation.go Outdated

}

}

return nil

}

Copy link

Member

slintes Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing new line

Create utils testing for OpenShift validator

9178a33

Remove OpenShift validation from Webhook, and add a test for validating it in Utils package

razo7 force-pushed the better-webhook-cp-message branch from f6381c9 to 9178a33 Compare January 24, 2024 14:15

openshift-ci bot added the lgtm label Jan 24, 2024

openshift-ci bot removed the do-not-merge/hold label Jan 24, 2024

openshift-merge-bot bot merged commit 5cb0cee into medik8s:main Jan 25, 2024
14 checks passed

razo7 mentioned this pull request Feb 20, 2024

Fetch ETCD Quorum Check fix for Unknown Guard Pod State #119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Maintenance Creation Check for Control Plane Nodes #110

Fix Maintenance Creation Check for Control Plane Nodes #110

razo7 commented Jan 2, 2024 •

edited

Loading

openshift-ci bot commented Jan 2, 2024

openshift-ci bot commented Jan 2, 2024

razo7 commented Jan 2, 2024

razo7 commented Jan 2, 2024

razo7 commented Jan 3, 2024

clobrano commented Jan 3, 2024

razo7 commented Jan 4, 2024

slintes Jan 5, 2024

razo7 Jan 7, 2024 •

edited

Loading

slintes Jan 5, 2024

razo7 Jan 7, 2024

razo7 commented Jan 7, 2024

razo7 commented Jan 7, 2024

razo7 commented Jan 7, 2024

razo7 commented Jan 7, 2024

razo7 commented Jan 16, 2024

razo7 commented Jan 16, 2024

slintes Jan 16, 2024

slintes Jan 16, 2024

razo7 commented Jan 17, 2024

slintes Jan 19, 2024 •

edited

Loading

slintes Jan 19, 2024 •

edited

Loading

razo7 commented Jan 21, 2024

slintes left a comment

slintes Jan 22, 2024

razo7 Jan 23, 2024

slintes Jan 23, 2024

razo7 Jan 23, 2024

slintes Jan 24, 2024

slintes Jan 24, 2024

slintes Jan 24, 2024

slintes commented Jan 24, 2024

razo7 commented Jan 24, 2024

razo7 commented Jan 25, 2024

Fix Maintenance Creation Check for Control Plane Nodes #110

Fix Maintenance Creation Check for Control Plane Nodes #110

Conversation

razo7 commented Jan 2, 2024 • edited Loading

openshift-ci bot commented Jan 2, 2024

openshift-ci bot commented Jan 2, 2024

razo7 commented Jan 2, 2024

razo7 commented Jan 2, 2024

razo7 commented Jan 3, 2024

clobrano commented Jan 3, 2024

razo7 commented Jan 4, 2024

Choose a reason for hiding this comment

razo7 Jan 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razo7 commented Jan 7, 2024

razo7 commented Jan 7, 2024

razo7 commented Jan 7, 2024

razo7 commented Jan 7, 2024

razo7 commented Jan 16, 2024

razo7 commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razo7 commented Jan 17, 2024

slintes Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

slintes Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

razo7 commented Jan 21, 2024

slintes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slintes commented Jan 24, 2024

razo7 commented Jan 24, 2024

razo7 commented Jan 25, 2024

razo7 commented Jan 2, 2024 •

edited

Loading

razo7 Jan 7, 2024 •

edited

Loading

slintes Jan 19, 2024 •

edited

Loading

slintes Jan 19, 2024 •

edited

Loading