-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Maintenance Creation Check for Control Plane Nodes #110
Fix Maintenance Creation Check for Control Plane Nodes #110
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: razo7 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test 4.14-openshift-e2e |
1 similar comment
/test 4.14-openshift-e2e |
6ebe40b
to
3e0d303
Compare
/test 4.14-openshift-e2e |
/lgtm |
/retest |
// find the status of node condition Ready | ||
nodeConditionByType := make(map[corev1.NodeConditionType]corev1.NodeCondition) | ||
for _, nc := range node.Status.Conditions { | ||
nodeConditionByType[nc.Type] = nc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the value of using the nodeConditionByType
map, and not directly checking for the Ready
condition here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it looked nicer, but I see your point of being more explicit.
for _, nc := range node.Status.Conditions { | ||
nodeConditionByType[nc.Type] = nc | ||
} | ||
nodeConditionReady := nodeConditionByType[corev1.NodeReady].Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will panic when there is no Ready
condition (which can happen on new nodes AFAIK)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of this scenario, but adding a check on whether the condition exists is a good point.
/test 4.14-openshift-e2e |
/retest |
1 similar comment
/retest |
/test 4.14-openshift-e2e |
a2b2c11
to
e3caa57
Compare
Use IsEtcdDisruptionAllowed to check whether a node can be disrupted and it won't violate control plane etcd quorum prior to creating nm CR
The log messages from webook were less informative and specific. Adding the nodemaintenance CR and node name to the prints could improve that
Modify tests to search for new error and add cases for etcd guard pod
e3caa57
to
01c8263
Compare
Moving from blocking CR creation on unhealthy nodes to better checking of unhealthy nodes, and CP guard pods prior to CR creation and any etcd quorum violation medik8s/common#17 |
/retest |
// TODO do we need a fallback for k8s clusters? | ||
nodemaintenancelog.Info("etcd quorum guard PDB hasn't been found. Skipping master/control-plane quorum validation.") | ||
return nil | ||
canDisrupt, err := etcd.IsEtcdDisruptionAllowed(context.Background(), v.client, nodemaintenancelog, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are inventing new names again, which are less clear that what we already have (in the function name) 🤷🏼♂️
What's the advantage of "can" instead of "isAllowed"?
"Can" not only means if something is allowed, but it can (pun intended) also mean if you are able to do something. We are always able to disrupt etcd. But we should't always do that 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: you need to check if we are on OCP / OKD. On k8s we don't have the etcd PDB / guards, and the check returns false in that case, which stops maintenance. In the old version maintenance is allowed.
c13f37c
to
316d043
Compare
Assert if cluster was installed on OpenShift, since only there a quorum PDB exists, otherwise we have nothing to assert
316d043
to
138830d
Compare
/retest |
pkg/utils/validation.go
Outdated
) | ||
|
||
// IsOpenshiftSupported will be set to true in case the operator was installed on OpenShift cluster | ||
var IsOpenshiftSupported bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lazy and let Copilot do the review 😉 The issue is: do not use global variables!
I've reviewed the provided code and it appears to be correct. There are no syntax errors, the function ValidateIsOpenshift is properly checking if the cluster has the OpenShift config group, and the global variable IsOpenshiftSupported is being set correctly based on the presence of the OpenShift config group. However, there is a potential design issue that could be improved. The global variable IsOpenshiftSupported is being set within the ValidateIsOpenshift function. This could lead to unexpected behavior if this function is called multiple times, as the global variable would be overwritten each time. It would be better to return the result from the function and let the caller decide what to do with it. [...]
some more input from my side, skipping answers:
- please provide an alternative to using the IsOpenshiftSupported global variable, without returning the result in ValidateIsOpenshift
- make ValidateIsOpenshift private
- no, isOpenshiftSupported still needs to be public, but not as global variable
- now call validateIsOpenshift in NewOpenshiftValidator
final code:
package utils
import (
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/client-go/discovery"
"k8s.io/client-go/rest"
)
type OpenshiftValidator struct {
isOpenshiftSupported bool
}
func NewOpenshiftValidator(config *rest.Config) (*OpenshiftValidator, error) {
v := &OpenshiftValidator{}
err := v.validateIsOpenshift(config)
if err != nil {
return nil, err
}
return v, nil
}
func (v *OpenshiftValidator) IsOpenshiftSupported() bool {
return v.isOpenshiftSupported
}
func (v *OpenshiftValidator) validateIsOpenshift(config *rest.Config) error {
dc, err := discovery.NewDiscoveryClientForConfig(config)
if err != nil {
return err
}
apiGroups, err := dc.ServerGroups()
if err != nil {
return err
}
kind := schema.GroupVersionKind{Group: "config.openshift.io", Version: "v1", Kind: "ClusterVersion"}
for _, apiGroup := range apiGroups.Groups {
for _, supportedVersion := range apiGroup.Versions {
if supportedVersion.GroupVersion == kind.GroupVersion().String() {
v.isOpenshiftSupported = true
return nil
}
}
}
return nil
}
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not use global variables
but good idea to run the detection code once only 👍🏼
Global variable should be avoided and encapsulating the value in a better way was needed by using an initalizaion function, NewOpenshiftValidator, and modify SetupWebhookWithManager input
/test 4.13-openshift-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one test is at a wrong place, otherwise lgtm
api/v1beta1/webhook_suite_test.go
Outdated
@@ -97,6 +101,10 @@ var _ = BeforeSuite(func() { | |||
Expect(err).NotTo(HaveOccurred()) | |||
Expect(k8sClient).NotTo(BeNil()) | |||
|
|||
openshiftCheck, err := utils.NewOpenshiftValidator(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a test of the utils package, not? It shouldn't be in BeforeSuite but in an actual test. And ideally not in this api package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be in BeforeSuite but in an actual test. And ideally not in this api package.
Where would it make more sense to place it? I thought of placing it here as it would happen once per suite and Webhook unit tests and not in the BeforeEach of the unit tests.
But from your answer, I understand that api package does not seem right. Do you have any other place in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- every test should be in an
It()
- ideally unit test suites should be where the code is, so for this one in the
utils
package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for this one in the utils package
Are you suggesting moving all the unit test logic and files for Webhook from api package to utils package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, just the validator test
pkg/utils/validation_test.go
Outdated
) | ||
|
||
var _ = Describe("Check OpenShift Existance Validation", func() { | ||
testEnv := &envtest.Environment{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not put this into BeforeSuite, just like we always do?
Maybe also configure the logger?
pkg/utils/validation.go
Outdated
} | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing new line
Remove OpenShift validation from Webhook, and add a test for validating it in Utils package
f6381c9
to
9178a33
Compare
/lgtm |
/unhold |
/retest |
Fix etcd quorum check from looking only at DisruptionsAllowed to also looking for control plane node etcd guard pod. If there are no allowed disruptions and nm CR is for a node that is not disrupted, then we must not allow this CR creation as it would violate etcd quorum. Otherwise, when there is a failed guard pod (Ready status is False) or there is no guard pod for the node, then we allow the CR creation as it won't violate further the etcd quorum.
Furthermore, this etcd quorum check is only valid on OCP / OKD, since they have etcd quorum PDB. Thus, we won't run this validation on other platforms.
Originally the PR intended to block CR creation for any node, including workers that we currently support. It was decided to allow it for any node as long as the (control plane) node won't violate etcd quorum.
ECOPROJECT-1811