Surface service-controller LB provisioning failures through status #245

ironcladlou · 2019-06-06T12:40:12Z

When an LB is determined to be pending, try and surface more detail by analyzing
events in the operand namespace related to the LB service, and if an LB creation
failure event is detected, propagate the message into IngressController status
and provide a more specific reason.

Replace the use of indexers with simpler inline cache lookups now that the
manager cache is available.

TODO

Rebase on Bug 1717494: Refactor client and cache handling #244
Make a decision on whether this can be tested in the e2e suite (e.g. synthesize the service-controller event?)

ironcladlou · 2019-06-06T12:42:25Z

pkg/operator/controller/ingress_status.go

+
+		// Try and find a more specific reason for for the pending status.
+		createFailedReason := "CreatingLoadBalancerFailed"
+		failedLoadBalancerEvents := getEventsByReason(operandEvents, "service-controller", createFailedReason)


Perhaps these should be sorted by time, descending?

This would be useful once we check for more than 1 event type, but otherwise it is not strictly necessary since we ignore events once the LB is provisioned.

ironcladlou · 2019-06-06T20:45:03Z

pkg/operator/controller/ingress_status.go

@@ -249,33 +140,75 @@ func (c *ingressStatusCache) computeLoadBalancerStatus(ic *operatorv1.IngressCon
 	conditions = append(conditions, operatorv1.OperatorCondition{
 		Type:    operatorv1.LoadBalancerManagedIngressConditionType,
 		Status:  operatorv1.ConditionTrue,
-		Reason:  "HasLoadBalancerEndpointPublishingStrategy",
-		Message: "IngressController has LoadBalancer endpoint publishing strategy",
+		Reason:  "WantedByEndpointPublishingStrategy",


I threw this in for discussion. Reflecting on the k8s API guidelines I wonder if this condition type should have been "LoadBalancerUnmanaged=True" — too late!

Condition types should indicate state in the “abnormal-true” polarity. For example, if the condition indicates when a policy is invalid, the “is valid” case is probably the norm, so the condition should be called “Invalid”.

Condition types should indicate state in the “abnormal-true” polarity.

We already violate that guideline with Available.

Any thoughts on the Reason I'm proposing here? I can revert if the old one is better (and am open to new suggestions).

No preference. I'm not sure a reason is strictly necessary for the "normal" state, but what you have is fine.

ironcladlou · 2019-06-07T16:43:48Z

Given the difficulty of testing this, I propose we:

Merge since it might work (the theory is based on actual event data), and introduces no regressions
Right away, fix our stuff so IngressControllers appear in must-gather
Monitor CI for evidence of the new condition details in failed clusters (and fix bugs if we find cases where the condition should be there but isn't)
Separately, figure out how to automate testing

Miciah · 2019-06-07T16:41:56Z

pkg/operator/controller/controller.go

-			errs = append(errs, fmt.Errorf("failed to sync ingresscontroller status: %v", err))
+		operandEvents := &corev1.EventList{}
+		if err := r.cache.List(context.TODO(), operandEvents, client.InNamespace("openshift-ingress")); err != nil {
+			errs = append(errs, fmt.Errorf("failed to list events in namespace %q: %v", "openshift-ingress", err))


"openshift-ingress"→lbService.Namespace.

Okay, so this is really subtle and dangerous, but lbService can be nil. Would love to separately take some action on our prior discussions about:

result types vs. nil for dealing with things that are not found

coming up with a strategy for dealing with references that doesn't involve hard-coded namespaces or inferring from random resources

Basically I wonder if there's any local fix right here (where's the canonical place to discover the operand namespace in this context?).

Oh, snap, good point! Well, does it make sense to look for events if the service does not exist?

The lookup isn't dependent on the service, and the lookup is from a cache, so we could choose to provide whatever context we can (consistent with the other possibly-nil inputs).

If the service doesn't exist, then computeLoadBalancerStatus won't even look at the events, and logically, why would it? Do we anticipate that we might with future changes care about events that are not related to the service?

If the service doesn't exist, then computeLoadBalancerStatus won't even look at the events

computeLoadBalancerStatus won't, but this function doesn't know that. This function just knows computeLoadBalancerStatus wants events in a namespace. I propose this contract:

// syncIngressControllerStatus takes whatever you give it and updates status. // Give it whatever you have and it will do its best. func (r *reconciler) syncIngressControllerStatus(ic *operatorv1.IngressController, deployment *appsv1.Deployment, service *corev1.Service, operandEvents []corev1.Event) error {

😁

That seems reasonable... or would it be simpler for syncIngressControllerStatus to look up events itself?

could perhaps ask the same question regarding all the other arguments... for now unless there's a logic error maybe we can continue the style the discussion on subsequent PRs?

could perhaps ask the same question regarding all the other arguments...

No, syncIngressControllerStatus needs to know the ingress controller for which it is updating status, and ensureIngressController just created or got the deployment and service, which are sound reasons for ensureIngressController to pass those values to syncIngressControllerStatus, whereas listing events in ensureIngressController instead of in syncIngressControllerStatus gratuitously separates the logic of listing events from the logic that determines whether the events need to be listed.

That is not to say that your current approach is unacceptable; the above is only responding to the above-quoted assertion.

That said, if ensureIngressController does handling listing events but gets an error, how about passing a nil slice to syncIngressControllerStatus?

for now unless there's a logic error maybe we can continue the style the discussion on subsequent PRs?

That's fine.

Miciah · 2019-06-07T16:42:43Z

pkg/operator/controller/controller.go

+		if err := r.cache.List(context.TODO(), operandEvents, client.InNamespace("openshift-ingress")); err != nil {
+			errs = append(errs, fmt.Errorf("failed to list events in namespace %q: %v", "openshift-ingress", err))
+		} else {
+			if err := r.syncIngressControllerStatus(ci, deployment, lbService, operandEvents.Items); err != nil {


Why should failure to list events inhibit status updating? We update status even if getting the service fails.

Seems related to the stuff we've talked about in the area of #245 (comment) — I do agree with you. The status function should be able to function without the event set in this case.

However, there is a subtle difference between an empty event list from a successful API call and one that's empty because the API call failed. In the latter case, the downstream consumer (status function) has a reduced trust level in the input. Like, if the status function relies on the absence of a value to decide there's a serious problem, should we choose to call the status function with an "undefined" input?

In this case the decision won't affect something serious like availability...

I fixed what I think is the core complaint (event lookup failure no longer prevents status sync).

Miciah · 2019-06-07T16:47:30Z

pkg/operator/controller/ingress_status.go

@@ -249,33 +140,75 @@ func (c *ingressStatusCache) computeLoadBalancerStatus(ic *operatorv1.IngressCon
 	conditions = append(conditions, operatorv1.OperatorCondition{
 		Type:    operatorv1.LoadBalancerManagedIngressConditionType,
 		Status:  operatorv1.ConditionTrue,
-		Reason:  "HasLoadBalancerEndpointPublishingStrategy",
-		Message: "IngressController has LoadBalancer endpoint publishing strategy",
+		Reason:  "WantedByEndpointPublishingStrategy",


Condition types should indicate state in the “abnormal-true” polarity.

We already violate that guideline with Available.

Miciah · 2019-06-07T16:49:00Z

pkg/operator/controller/ingress_status.go

 		conditions = append(conditions, operatorv1.OperatorCondition{
 			Type:    operatorv1.LoadBalancerReadyIngressConditionType,
 			Status:  operatorv1.ConditionFalse,
-			Reason:  "LoadBalancerPending",
-			Message: "The LoadBalancer service is pending",
+			Reason:  "LoadBalancerNotFound",


How about "LoadBalancer"→"Service" since "LoadBalancer" could refer to either the service or the cloud LB?

Miciah · 2019-06-07T16:50:32Z

pkg/operator/controller/ingress_status.go

 		conditions = append(conditions, operatorv1.OperatorCondition{
 			Type:    operatorv1.LoadBalancerReadyIngressConditionType,
 			Status:  operatorv1.ConditionTrue,
 			Reason:  "LoadBalancerProvisioned",
 			Message: "The LoadBalancer service is provisioned",
 		})
-	default:
+	case isPending(service):


Why not make this the default case?

@Miciah if "Pending" is the default case, does the current default ConditionUnknown get removed?

Yes. It is dead code anyway.

I went ahead and removed the default case as it's unreachable, but I left isProvisioned and isPending. I believe isProvisioned and default (replacing isPending) would be a functional alternative, but even so I guessed having explicitly named cases would aid readability. What do you think?

It's fine. I wonder whether the compiler is smart enough (and the language semantics are flexible enough) to optimize the isPending call out.

Miciah · 2019-06-07T16:56:59Z

pkg/operator/controller/ingress_status_test.go

+	return corev1.Event{
+		Type:    "Warning",
+		Reason:  "CreatingLoadBalancerFailed",
+		Message: "failed to ensure load balancer for service openshift-ingress/router-default: TooManyLoadBalancers: Exceeded quotaof account",


Is "quotaof" a typo?

Miciah · 2019-06-07T17:05:07Z

pkg/operator/controller/ingress_status_test.go

-				pendingLBService("default"),
-				clusterIPservice("default"),
-			},
+			name:       "lb pending, no events",


Can this be "no events for current lb" with events: []corev1.Event{schedulerEvent(), failedCreateLBEvent("secondary")}?

Great improvement, fixed

danehans

I found a typo. Otherwise, lgtm.

danehans · 2019-06-10T16:03:36Z

pkg/operator/controller/ingress_status_test.go

+	return corev1.Event{
+		Type:    "Warning",
+		Reason:  "CreatingLoadBalancerFailed",
+		Message: "failed to ensure load balancer for service openshift-ingress/router-default: TooManyLoadBalancers: Exceeded quot aof account",


s/quot aof/quota of/

Fixed... again (my fix for the original had a typo 🤦‍♀️)

When an LB is determined to be pending, try and surface more detail by analyzing events in the operand namespace related to the LB service, and if an LB creation failure event is detected, propagate the message into IngressController status and provide a more specific reason. Replace the use of indexers with simpler inline cache lookups now that the manager cache is available.

ironcladlou · 2019-06-10T16:14:47Z

Went ahead and squashed.

Miciah · 2019-06-10T17:49:37Z

/lgtm

openshift-ci-robot · 2019-06-10T17:49:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,ironcladlou]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2019

openshift-ci-robot requested review from knobunc and Miciah June 6, 2019 12:40

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 6, 2019

ironcladlou commented Jun 6, 2019

View reviewed changes

ironcladlou force-pushed the lb-event-status branch from abc99f9 to 1a84149 Compare June 6, 2019 16:52

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 6, 2019

ironcladlou changed the title ~~WIP: Surface service-controller LB provisioning failures through status~~ Surface service-controller LB provisioning failures through status Jun 6, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2019

ironcladlou commented Jun 6, 2019

View reviewed changes

Miciah reviewed Jun 7, 2019

View reviewed changes

danehans reviewed Jun 10, 2019

View reviewed changes

ironcladlou force-pushed the lb-event-status branch from d765799 to 9813929 Compare June 10, 2019 16:14

openshift-ci-robot assigned Miciah Jun 10, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2019

openshift-merge-robot merged commit d935e48 into openshift:master Jun 10, 2019

Surface service-controller LB provisioning failures through status #245

Surface service-controller LB provisioning failures through status #245

Conversation

ironcladlou commented Jun 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironcladlou commented Jun 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironcladlou Jun 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danehans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironcladlou commented Jun 10, 2019

Miciah commented Jun 10, 2019

openshift-ci-robot commented Jun 10, 2019

ironcladlou commented Jun 6, 2019 •

edited

Loading

ironcladlou Jun 7, 2019 •

edited

Loading