-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface service-controller LB provisioning failures through status #245
Surface service-controller LB provisioning failures through status #245
Conversation
|
||
// Try and find a more specific reason for for the pending status. | ||
createFailedReason := "CreatingLoadBalancerFailed" | ||
failedLoadBalancerEvents := getEventsByReason(operandEvents, "service-controller", createFailedReason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps these should be sorted by time, descending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be useful once we check for more than 1 event type, but otherwise it is not strictly necessary since we ignore events once the LB is provisioned.
abc99f9
to
1a84149
Compare
@@ -249,33 +140,75 @@ func (c *ingressStatusCache) computeLoadBalancerStatus(ic *operatorv1.IngressCon | |||
conditions = append(conditions, operatorv1.OperatorCondition{ | |||
Type: operatorv1.LoadBalancerManagedIngressConditionType, | |||
Status: operatorv1.ConditionTrue, | |||
Reason: "HasLoadBalancerEndpointPublishingStrategy", | |||
Message: "IngressController has LoadBalancer endpoint publishing strategy", | |||
Reason: "WantedByEndpointPublishingStrategy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I threw this in for discussion. Reflecting on the k8s API guidelines I wonder if this condition type should have been "LoadBalancerUnmanaged=True" — too late!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Condition types should indicate state in the “abnormal-true” polarity. For example, if the condition indicates when a policy is invalid, the “is valid” case is probably the norm, so the condition should be called “Invalid”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Condition types should indicate state in the “abnormal-true” polarity.
We already violate that guideline with Available
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on the Reason I'm proposing here? I can revert if the old one is better (and am open to new suggestions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No preference. I'm not sure a reason is strictly necessary for the "normal" state, but what you have is fine.
Given the difficulty of testing this, I propose we:
|
errs = append(errs, fmt.Errorf("failed to sync ingresscontroller status: %v", err)) | ||
operandEvents := &corev1.EventList{} | ||
if err := r.cache.List(context.TODO(), operandEvents, client.InNamespace("openshift-ingress")); err != nil { | ||
errs = append(errs, fmt.Errorf("failed to list events in namespace %q: %v", "openshift-ingress", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"openshift-ingress"
→lbService.Namespace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so this is really subtle and dangerous, but lbService
can be nil
. Would love to separately take some action on our prior discussions about:
- result types vs. nil for dealing with things that are not found
- coming up with a strategy for dealing with references that doesn't involve hard-coded namespaces or inferring from random resources
Basically I wonder if there's any local fix right here (where's the canonical place to discover the operand namespace in this context?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, snap, good point! Well, does it make sense to look for events if the service does not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lookup isn't dependent on the service, and the lookup is from a cache, so we could choose to provide whatever context we can (consistent with the other possibly-nil inputs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service doesn't exist, then computeLoadBalancerStatus
won't even look at the events, and logically, why would it? Do we anticipate that we might with future changes care about events that are not related to the service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service doesn't exist, then computeLoadBalancerStatus won't even look at the events
computeLoadBalancerStatus
won't, but this function doesn't know that. This function just knows computeLoadBalancerStatus
wants events in a namespace. I propose this contract:
// syncIngressControllerStatus takes whatever you give it and updates status.
// Give it whatever you have and it will do its best.
func (r *reconciler) syncIngressControllerStatus(ic *operatorv1.IngressController, deployment *appsv1.Deployment, service *corev1.Service, operandEvents []corev1.Event) error {
😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable... or would it be simpler for syncIngressControllerStatus
to look up events itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could perhaps ask the same question regarding all the other arguments... for now unless there's a logic error maybe we can continue the style the discussion on subsequent PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could perhaps ask the same question regarding all the other arguments...
No, syncIngressControllerStatus
needs to know the ingress controller for which it is updating status, and ensureIngressController
just created or got the deployment and service, which are sound reasons for ensureIngressController
to pass those values to syncIngressControllerStatus
, whereas listing events in ensureIngressController
instead of in syncIngressControllerStatus
gratuitously separates the logic of listing events from the logic that determines whether the events need to be listed.
That is not to say that your current approach is unacceptable; the above is only responding to the above-quoted assertion.
That said, if ensureIngressController
does handling listing events but gets an error, how about passing a nil slice to syncIngressControllerStatus
?
for now unless there's a logic error maybe we can continue the style the discussion on subsequent PRs?
That's fine.
if err := r.cache.List(context.TODO(), operandEvents, client.InNamespace("openshift-ingress")); err != nil { | ||
errs = append(errs, fmt.Errorf("failed to list events in namespace %q: %v", "openshift-ingress", err)) | ||
} else { | ||
if err := r.syncIngressControllerStatus(ci, deployment, lbService, operandEvents.Items); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should failure to list events inhibit status updating? We update status even if getting the service fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems related to the stuff we've talked about in the area of #245 (comment) — I do agree with you. The status function should be able to function without the event set in this case.
However, there is a subtle difference between an empty event list from a successful API call and one that's empty because the API call failed. In the latter case, the downstream consumer (status function) has a reduced trust level in the input. Like, if the status function relies on the absence of a value to decide there's a serious problem, should we choose to call the status function with an "undefined" input?
In this case the decision won't affect something serious like availability...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed what I think is the core complaint (event lookup failure no longer prevents status sync).
@@ -249,33 +140,75 @@ func (c *ingressStatusCache) computeLoadBalancerStatus(ic *operatorv1.IngressCon | |||
conditions = append(conditions, operatorv1.OperatorCondition{ | |||
Type: operatorv1.LoadBalancerManagedIngressConditionType, | |||
Status: operatorv1.ConditionTrue, | |||
Reason: "HasLoadBalancerEndpointPublishingStrategy", | |||
Message: "IngressController has LoadBalancer endpoint publishing strategy", | |||
Reason: "WantedByEndpointPublishingStrategy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Condition types should indicate state in the “abnormal-true” polarity.
We already violate that guideline with Available
.
conditions = append(conditions, operatorv1.OperatorCondition{ | ||
Type: operatorv1.LoadBalancerReadyIngressConditionType, | ||
Status: operatorv1.ConditionFalse, | ||
Reason: "LoadBalancerPending", | ||
Message: "The LoadBalancer service is pending", | ||
Reason: "LoadBalancerNotFound", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "LoadBalancer"→"Service" since "LoadBalancer" could refer to either the service or the cloud LB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
conditions = append(conditions, operatorv1.OperatorCondition{ | ||
Type: operatorv1.LoadBalancerReadyIngressConditionType, | ||
Status: operatorv1.ConditionTrue, | ||
Reason: "LoadBalancerProvisioned", | ||
Message: "The LoadBalancer service is provisioned", | ||
}) | ||
default: | ||
case isPending(service): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this the default case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Miciah if "Pending" is the default case, does the current default ConditionUnknown
get removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is dead code anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and removed the default
case as it's unreachable, but I left isProvisioned
and isPending
. I believe isProvisioned
and default
(replacing isPending
) would be a functional alternative, but even so I guessed having explicitly named cases would aid readability. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine. I wonder whether the compiler is smart enough (and the language semantics are flexible enough) to optimize the isPending
call out.
return corev1.Event{ | ||
Type: "Warning", | ||
Reason: "CreatingLoadBalancerFailed", | ||
Message: "failed to ensure load balancer for service openshift-ingress/router-default: TooManyLoadBalancers: Exceeded quotaof account", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "quotaof" a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed
pendingLBService("default"), | ||
clusterIPservice("default"), | ||
}, | ||
name: "lb pending, no events", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be "no events for current lb" with events: []corev1.Event{schedulerEvent(), failedCreateLBEvent("secondary")}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a typo. Otherwise, lgtm.
return corev1.Event{ | ||
Type: "Warning", | ||
Reason: "CreatingLoadBalancerFailed", | ||
Message: "failed to ensure load balancer for service openshift-ingress/router-default: TooManyLoadBalancers: Exceeded quot aof account", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/quot aof/quota of/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed... again (my fix for the original had a typo 🤦♀️)
When an LB is determined to be pending, try and surface more detail by analyzing events in the operand namespace related to the LB service, and if an LB creation failure event is detected, propagate the message into IngressController status and provide a more specific reason. Replace the use of indexers with simpler inline cache lookups now that the manager cache is available.
d765799
to
9813929
Compare
Went ahead and squashed. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ironcladlou, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When an LB is determined to be pending, try and surface more detail by analyzing
events in the operand namespace related to the LB service, and if an LB creation
failure event is detected, propagate the message into IngressController status
and provide a more specific reason.
Replace the use of indexers with simpler inline cache lookups now that the
manager cache is available.
TODO