Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Maintenance Creation Check for Control Plane Nodes #110

Merged
merged 7 commits into from
Jan 25, 2024

Conversation

razo7
Copy link
Member

@razo7 razo7 commented Jan 2, 2024

Fix etcd quorum check from looking only at DisruptionsAllowed to also looking for control plane node etcd guard pod. If there are no allowed disruptions and nm CR is for a node that is not disrupted, then we must not allow this CR creation as it would violate etcd quorum. Otherwise, when there is a failed guard pod (Ready status is False) or there is no guard pod for the node, then we allow the CR creation as it won't violate further the etcd quorum.

Furthermore, this etcd quorum check is only valid on OCP / OKD, since they have etcd quorum PDB. Thus, we won't run this validation on other platforms.

Originally the PR intended to block CR creation for any node, including workers that we currently support. It was decided to allow it for any node as long as the (control plane) node won't violate etcd quorum.

ECOPROJECT-1811

Copy link
Contributor

openshift-ci bot commented Jan 2, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Jan 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jan 2, 2024
@razo7
Copy link
Member Author

razo7 commented Jan 2, 2024

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

1 similar comment
@razo7
Copy link
Member Author

razo7 commented Jan 2, 2024

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

@razo7 razo7 changed the title Better Webhook Print for nodemaintenance CR on Not Ready node Block CR Creation for Not Ready Nodes Jan 3, 2024
@razo7 razo7 force-pushed the better-webhook-cp-message branch from 6ebe40b to 3e0d303 Compare January 3, 2024 10:45
@razo7
Copy link
Member Author

razo7 commented Jan 3, 2024

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

@clobrano
Copy link
Contributor

clobrano commented Jan 3, 2024

/lgtm
giving others a chance to review as well, feel free to unhold
/hold

@openshift-ci openshift-ci bot added the lgtm label Jan 3, 2024
@razo7 razo7 marked this pull request as ready for review January 3, 2024 11:36
@openshift-ci openshift-ci bot requested review from beekhof and clobrano January 3, 2024 11:36
@razo7
Copy link
Member Author

razo7 commented Jan 4, 2024

/retest

// find the status of node condition Ready
nodeConditionByType := make(map[corev1.NodeConditionType]corev1.NodeCondition)
for _, nc := range node.Status.Conditions {
nodeConditionByType[nc.Type] = nc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the value of using the nodeConditionByType map, and not directly checking for the Ready condition here? 🤔

Copy link
Member Author

@razo7 razo7 Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it looked nicer, but I see your point of being more explicit.

for _, nc := range node.Status.Conditions {
nodeConditionByType[nc.Type] = nc
}
nodeConditionReady := nodeConditionByType[corev1.NodeReady].Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will panic when there is no Ready condition (which can happen on new nodes AFAIK)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of this scenario, but adding a check on whether the condition exists is a good point.

@openshift-ci openshift-ci bot removed the lgtm label Jan 5, 2024
@razo7
Copy link
Member Author

razo7 commented Jan 7, 2024

/test 4.14-openshift-e2e
/test 4.15-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Jan 7, 2024

/retest

1 similar comment
@razo7
Copy link
Member Author

razo7 commented Jan 7, 2024

/retest

@razo7
Copy link
Member Author

razo7 commented Jan 7, 2024

/test 4.14-openshift-e2e

@razo7 razo7 force-pushed the better-webhook-cp-message branch 2 times, most recently from a2b2c11 to e3caa57 Compare January 16, 2024 13:32
razo7 added 2 commits January 16, 2024 15:34
Use IsEtcdDisruptionAllowed to check whether a node can be disrupted and it won't violate control plane etcd quorum prior to creating nm CR
The log messages from webook were less informative and specific. Adding the nodemaintenance CR and node name to the prints could improve that
razo7 added 2 commits January 16, 2024 15:42
Modify tests to search for new error and add cases for etcd guard pod
@razo7 razo7 force-pushed the better-webhook-cp-message branch from e3caa57 to 01c8263 Compare January 16, 2024 13:46
@razo7 razo7 changed the title Block CR Creation for Not Ready Nodes Fix Maintenance Creation Check for Control Plane Nodes Jan 16, 2024
@razo7
Copy link
Member Author

razo7 commented Jan 16, 2024

Moving from blocking CR creation on unhealthy nodes to better checking of unhealthy nodes, and CP guard pods prior to CR creation and any etcd quorum violation medik8s/common#17

@razo7
Copy link
Member Author

razo7 commented Jan 16, 2024

/retest

// TODO do we need a fallback for k8s clusters?
nodemaintenancelog.Info("etcd quorum guard PDB hasn't been found. Skipping master/control-plane quorum validation.")
return nil
canDisrupt, err := etcd.IsEtcdDisruptionAllowed(context.Background(), v.client, nodemaintenancelog, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are inventing new names again, which are less clear that what we already have (in the function name) 🤷🏼‍♂️
What's the advantage of "can" instead of "isAllowed"?
"Can" not only means if something is allowed, but it can (pun intended) also mean if you are able to do something. We are always able to disrupt etcd. But we should't always do that 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: you need to check if we are on OCP / OKD. On k8s we don't have the etcd PDB / guards, and the check returns false in that case, which stops maintenance. In the old version maintenance is allowed.

@razo7 razo7 force-pushed the better-webhook-cp-message branch from c13f37c to 316d043 Compare January 17, 2024 12:54
Assert if cluster was installed on OpenShift, since only there a quorum PDB exists, otherwise we have nothing to assert
@razo7 razo7 force-pushed the better-webhook-cp-message branch from 316d043 to 138830d Compare January 17, 2024 13:01
@razo7
Copy link
Member Author

razo7 commented Jan 17, 2024

/retest

)

// IsOpenshiftSupported will be set to true in case the operator was installed on OpenShift cluster
var IsOpenshiftSupported bool
Copy link
Member

@slintes slintes Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lazy and let Copilot do the review 😉 The issue is: do not use global variables!

I've reviewed the provided code and it appears to be correct. There are no syntax errors, the function ValidateIsOpenshift is properly checking if the cluster has the OpenShift config group, and the global variable IsOpenshiftSupported is being set correctly based on the presence of the OpenShift config group. However, there is a potential design issue that could be improved. The global variable IsOpenshiftSupported is being set within the ValidateIsOpenshift function. This could lead to unexpected behavior if this function is called multiple times, as the global variable would be overwritten each time. It would be better to return the result from the function and let the caller decide what to do with it. [...]

some more input from my side, skipping answers:

  • please provide an alternative to using the IsOpenshiftSupported global variable, without returning the result in ValidateIsOpenshift
  • make ValidateIsOpenshift private
  • no, isOpenshiftSupported still needs to be public, but not as global variable
  • now call validateIsOpenshift in NewOpenshiftValidator

final code:

package utils

import (
 "k8s.io/apimachinery/pkg/runtime/schema"
 "k8s.io/client-go/discovery"
 "k8s.io/client-go/rest"
)

type OpenshiftValidator struct {
 isOpenshiftSupported bool
}

func NewOpenshiftValidator(config *rest.Config) (*OpenshiftValidator, error) {
 v := &OpenshiftValidator{}
 err := v.validateIsOpenshift(config)
 if err != nil {
  return nil, err
 }
 return v, nil
}

func (v *OpenshiftValidator) IsOpenshiftSupported() bool {
 return v.isOpenshiftSupported
}

func (v *OpenshiftValidator) validateIsOpenshift(config *rest.Config) error {
 dc, err := discovery.NewDiscoveryClientForConfig(config)
 if err != nil {
  return err
 }
 apiGroups, err := dc.ServerGroups()
 if err != nil {
  return err
 }

 kind := schema.GroupVersionKind{Group: "config.openshift.io", Version: "v1", Kind: "ClusterVersion"}
 for _, apiGroup := range apiGroups.Groups {
  for _, supportedVersion := range apiGroup.Versions {
   if supportedVersion.GroupVersion == kind.GroupVersion().String() {
    v.isOpenshiftSupported = true
    return nil
   }
  }
 }
 return nil
}

WDYT?

Copy link
Member

@slintes slintes Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use global variables

but good idea to run the detection code once only 👍🏼

Global variable should be avoided and encapsulating the value in a better way was needed by using an initalizaion function, NewOpenshiftValidator, and modify SetupWebhookWithManager input
@razo7
Copy link
Member Author

razo7 commented Jan 21, 2024

/test 4.13-openshift-e2e

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one test is at a wrong place, otherwise lgtm

@@ -97,6 +101,10 @@ var _ = BeforeSuite(func() {
Expect(err).NotTo(HaveOccurred())
Expect(k8sClient).NotTo(BeNil())

openshiftCheck, err := utils.NewOpenshiftValidator(cfg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a test of the utils package, not? It shouldn't be in BeforeSuite but in an actual test. And ideally not in this api package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be in BeforeSuite but in an actual test. And ideally not in this api package.

Where would it make more sense to place it? I thought of placing it here as it would happen once per suite and Webhook unit tests and not in the BeforeEach of the unit tests.
But from your answer, I understand that api package does not seem right. Do you have any other place in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • every test should be in an It()
  • ideally unit test suites should be where the code is, so for this one in the utils package

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for this one in the utils package

Are you suggesting moving all the unit test logic and files for Webhook from api package to utils package?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, just the validator test

)

var _ = Describe("Check OpenShift Existance Validation", func() {
testEnv := &envtest.Environment{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not put this into BeforeSuite, just like we always do?
Maybe also configure the logger?

}
}
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing new line

Remove OpenShift validation from Webhook, and add a test for validating it in Utils package
@razo7 razo7 force-pushed the better-webhook-cp-message branch from f6381c9 to 9178a33 Compare January 24, 2024 14:15
@slintes
Copy link
Member

slintes commented Jan 24, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 24, 2024
@razo7
Copy link
Member Author

razo7 commented Jan 24, 2024

/unhold

@razo7
Copy link
Member Author

razo7 commented Jan 25, 2024

/retest

@openshift-merge-bot openshift-merge-bot bot merged commit 5cb0cee into medik8s:main Jan 25, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants