Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface service-controller LB provisioning failures through status #245

Merged
merged 1 commit into from
Jun 10, 2019

Conversation

ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Jun 6, 2019

When an LB is determined to be pending, try and surface more detail by analyzing
events in the operand namespace related to the LB service, and if an LB creation
failure event is detected, propagate the message into IngressController status
and provide a more specific reason.

Replace the use of indexers with simpler inline cache lookups now that the
manager cache is available.

TODO

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2019
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 6, 2019

// Try and find a more specific reason for for the pending status.
createFailedReason := "CreatingLoadBalancerFailed"
failedLoadBalancerEvents := getEventsByReason(operandEvents, "service-controller", createFailedReason)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these should be sorted by time, descending?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be useful once we check for more than 1 event type, but otherwise it is not strictly necessary since we ignore events once the LB is provisioned.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 6, 2019
@ironcladlou ironcladlou changed the title WIP: Surface service-controller LB provisioning failures through status Surface service-controller LB provisioning failures through status Jun 6, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2019
@@ -249,33 +140,75 @@ func (c *ingressStatusCache) computeLoadBalancerStatus(ic *operatorv1.IngressCon
conditions = append(conditions, operatorv1.OperatorCondition{
Type: operatorv1.LoadBalancerManagedIngressConditionType,
Status: operatorv1.ConditionTrue,
Reason: "HasLoadBalancerEndpointPublishingStrategy",
Message: "IngressController has LoadBalancer endpoint publishing strategy",
Reason: "WantedByEndpointPublishingStrategy",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw this in for discussion. Reflecting on the k8s API guidelines I wonder if this condition type should have been "LoadBalancerUnmanaged=True" — too late!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition types should indicate state in the “abnormal-true” polarity. For example, if the condition indicates when a policy is invalid, the “is valid” case is probably the norm, so the condition should be called “Invalid”.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition types should indicate state in the “abnormal-true” polarity.

We already violate that guideline with Available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the Reason I'm proposing here? I can revert if the old one is better (and am open to new suggestions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference. I'm not sure a reason is strictly necessary for the "normal" state, but what you have is fine.

@ironcladlou
Copy link
Contributor Author

Given the difficulty of testing this, I propose we:

  1. Merge since it might work (the theory is based on actual event data), and introduces no regressions
  2. Right away, fix our stuff so IngressControllers appear in must-gather
  3. Monitor CI for evidence of the new condition details in failed clusters (and fix bugs if we find cases where the condition should be there but isn't)
  4. Separately, figure out how to automate testing

errs = append(errs, fmt.Errorf("failed to sync ingresscontroller status: %v", err))
operandEvents := &corev1.EventList{}
if err := r.cache.List(context.TODO(), operandEvents, client.InNamespace("openshift-ingress")); err != nil {
errs = append(errs, fmt.Errorf("failed to list events in namespace %q: %v", "openshift-ingress", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"openshift-ingress"lbService.Namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so this is really subtle and dangerous, but lbService can be nil. Would love to separately take some action on our prior discussions about:

  1. result types vs. nil for dealing with things that are not found
  2. coming up with a strategy for dealing with references that doesn't involve hard-coded namespaces or inferring from random resources

Basically I wonder if there's any local fix right here (where's the canonical place to discover the operand namespace in this context?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, snap, good point! Well, does it make sense to look for events if the service does not exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup isn't dependent on the service, and the lookup is from a cache, so we could choose to provide whatever context we can (consistent with the other possibly-nil inputs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the service doesn't exist, then computeLoadBalancerStatus won't even look at the events, and logically, why would it? Do we anticipate that we might with future changes care about events that are not related to the service?

Copy link
Contributor Author

@ironcladlou ironcladlou Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the service doesn't exist, then computeLoadBalancerStatus won't even look at the events

computeLoadBalancerStatus won't, but this function doesn't know that. This function just knows computeLoadBalancerStatus wants events in a namespace. I propose this contract:

// syncIngressControllerStatus takes whatever you give it and updates status.
// Give it whatever you have and it will do its best.
func (r *reconciler) syncIngressControllerStatus(ic *operatorv1.IngressController, deployment *appsv1.Deployment, service *corev1.Service, operandEvents []corev1.Event) error {

😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable... or would it be simpler for syncIngressControllerStatus to look up events itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could perhaps ask the same question regarding all the other arguments... for now unless there's a logic error maybe we can continue the style the discussion on subsequent PRs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could perhaps ask the same question regarding all the other arguments...

No, syncIngressControllerStatus needs to know the ingress controller for which it is updating status, and ensureIngressController just created or got the deployment and service, which are sound reasons for ensureIngressController to pass those values to syncIngressControllerStatus, whereas listing events in ensureIngressController instead of in syncIngressControllerStatus gratuitously separates the logic of listing events from the logic that determines whether the events need to be listed.

That is not to say that your current approach is unacceptable; the above is only responding to the above-quoted assertion.

That said, if ensureIngressController does handling listing events but gets an error, how about passing a nil slice to syncIngressControllerStatus?

for now unless there's a logic error maybe we can continue the style the discussion on subsequent PRs?

That's fine.

if err := r.cache.List(context.TODO(), operandEvents, client.InNamespace("openshift-ingress")); err != nil {
errs = append(errs, fmt.Errorf("failed to list events in namespace %q: %v", "openshift-ingress", err))
} else {
if err := r.syncIngressControllerStatus(ci, deployment, lbService, operandEvents.Items); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should failure to list events inhibit status updating? We update status even if getting the service fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems related to the stuff we've talked about in the area of #245 (comment) — I do agree with you. The status function should be able to function without the event set in this case.

However, there is a subtle difference between an empty event list from a successful API call and one that's empty because the API call failed. In the latter case, the downstream consumer (status function) has a reduced trust level in the input. Like, if the status function relies on the absence of a value to decide there's a serious problem, should we choose to call the status function with an "undefined" input?

In this case the decision won't affect something serious like availability...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed what I think is the core complaint (event lookup failure no longer prevents status sync).

@@ -249,33 +140,75 @@ func (c *ingressStatusCache) computeLoadBalancerStatus(ic *operatorv1.IngressCon
conditions = append(conditions, operatorv1.OperatorCondition{
Type: operatorv1.LoadBalancerManagedIngressConditionType,
Status: operatorv1.ConditionTrue,
Reason: "HasLoadBalancerEndpointPublishingStrategy",
Message: "IngressController has LoadBalancer endpoint publishing strategy",
Reason: "WantedByEndpointPublishingStrategy",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition types should indicate state in the “abnormal-true” polarity.

We already violate that guideline with Available.

conditions = append(conditions, operatorv1.OperatorCondition{
Type: operatorv1.LoadBalancerReadyIngressConditionType,
Status: operatorv1.ConditionFalse,
Reason: "LoadBalancerPending",
Message: "The LoadBalancer service is pending",
Reason: "LoadBalancerNotFound",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "LoadBalancer"→"Service" since "LoadBalancer" could refer to either the service or the cloud LB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

conditions = append(conditions, operatorv1.OperatorCondition{
Type: operatorv1.LoadBalancerReadyIngressConditionType,
Status: operatorv1.ConditionTrue,
Reason: "LoadBalancerProvisioned",
Message: "The LoadBalancer service is provisioned",
})
default:
case isPending(service):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this the default case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah if "Pending" is the default case, does the current default ConditionUnknown get removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is dead code anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and removed the default case as it's unreachable, but I left isProvisioned and isPending. I believe isProvisioned and default (replacing isPending) would be a functional alternative, but even so I guessed having explicitly named cases would aid readability. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine. I wonder whether the compiler is smart enough (and the language semantics are flexible enough) to optimize the isPending call out.

return corev1.Event{
Type: "Warning",
Reason: "CreatingLoadBalancerFailed",
Message: "failed to ensure load balancer for service openshift-ingress/router-default: TooManyLoadBalancers: Exceeded quotaof account",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "quotaof" a typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed

pendingLBService("default"),
clusterIPservice("default"),
},
name: "lb pending, no events",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be "no events for current lb" with events: []corev1.Event{schedulerEvent(), failedCreateLBEvent("secondary")}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement, fixed

Copy link
Contributor

@danehans danehans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a typo. Otherwise, lgtm.

return corev1.Event{
Type: "Warning",
Reason: "CreatingLoadBalancerFailed",
Message: "failed to ensure load balancer for service openshift-ingress/router-default: TooManyLoadBalancers: Exceeded quot aof account",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/quot aof/quota of/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed... again (my fix for the original had a typo 🤦‍♀️)

When an LB is determined to be pending, try and surface more detail by analyzing
events in the operand namespace related to the LB service, and if an LB creation
failure event is detected, propagate the message into IngressController status
and provide a more specific reason.

Replace the use of indexers with simpler inline cache lookups now that the
manager cache is available.
@ironcladlou
Copy link
Contributor Author

Went ahead and squashed.

@Miciah
Copy link
Contributor

Miciah commented Jun 10, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d935e48 into openshift:master Jun 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants