Controllers should retry and re-queue based on status codes #283

alex-bezek · 2023-07-14T05:53:35Z

What happened

Today, when trying to make API calls to ngrok in some of our controllers, we receive a generic ngrok api error and thus re-que the event to be reconciled again immediately.

What you think should happen instead

We should instead see the ngrok api error and re-queue events appropriately.

If the error is >=500 we should re-queue as usual. This should mean there is some connectivity issue or a problem in the moment with our API that should be retried
if the error is specifically a 429, we should re-queue with an extended backoff time
for other 400 bad request errors, we should not re-queue the event. This is the case where you try to reserve an already reserved name, or something in general is not being accepted by our API and changes need to happen.

CK-Ward · 2023-07-24T18:18:14Z

@Wonderful-Space and @nikolay-ngrok both interested, will prioritize accordingly.

alex-bezek · 2023-07-26T00:20:34Z

In the issue description I had noted

for other 400 bad request errors, we should not re-queue the event

However, we recently found an issue in the httpsEdge controller where it currently is doing this (not retrying for 400's) and it's causing an issue.

To recap how the Ingress controller works overall a bit, there are multiple controllers all working asynchronously in concert to drive the system to an eventually consistent state. Specifically, when a user creates an ingress object, the ingress_controller converts this information into Domain, HttpsEdge, and Tunnel CRD's in the k8s api. Then each of these respective controllers work in parallel to reconcile these CRDs by reserving domains, creating edges (and the other related configurations), and spinning up ngrok-go tunnels.

This parallel and async nature means that the https_edge_controller could try to create the edge before the domain controller has reserved the domain. By default, if a controller returns an error during its reconcile loop, it will re-queue the event to retry later. This could be seen in this issue where we consistently saw a slew of errors and retries https://github.com/ngrok/kubernetes-ingress-controller/issues/195

Recently we updated the edge controller to not retry on certain ngrok api errors with the aim of not continually retrying bad route-module configurations. We now take the error and check if its "retryable" based on its >500
https://github.com/ngrok/kubernetes-ingress-controller/blob/41940b5d11532eb6393e3396ed95fb62a0aa68dd/internal/controllers/httpsedge_controller.go#L129-L137
https://github.com/ngrok/kubernetes-ingress-controller/blob/b603c501e74f0b394e8ee1c3e0b9cd8ea3a6716e/internal/errors/errors.go#L157-L172

We realized though that when creating an edge, during this async controller race, the edge controller can fail to create the edge if the domain isn't registered which is a 400 error and should be retried soon

"HTTP 400: The domain 'my-domain.ngrok.app:443' is not reserved.

This may be worth breaking out into its own PR to resolve or its own issue, but I thought it might be worth leaving a note here to take into account for any other possible API error codes that should be retried.

@nikolay-ngrok also brought up the concern that if we just retry errors like this all the time, we could put unnecessary pressure on the API. With the controller-runtime library, we can control if the event should requeue and how long it should wait based on https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Result . I think internally it also does some backoff, but we'd have to look into that. It may be worth looking into having a more robust exponential backoff for particular errors as talked about in some of this guides
https://stuartleeks.com/posts/error-back-off-with-controller-runtime/
https://danielmangum.com/posts/controller-runtime-client-go-rate-limiting/

alex-bezek added bug Something isn't working area/controller Issues dealing with the controller needs-triage Issues that need triage labels Jul 14, 2023

CK-Ward assigned nikolay-ngrok and CarlAmko and unassigned CarlAmko and nikolay-ngrok Jul 24, 2023

nikolay-ngrok self-assigned this Jul 26, 2023

nikolay-ngrok mentioned this issue Jul 31, 2023

unified reconciling and error handling #288

Merged

nikolay-ngrok closed this as completed in #288 Aug 1, 2023

jonstacks removed the needs-triage Issues that need triage label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controllers should retry and re-queue based on status codes #283

Controllers should retry and re-queue based on status codes #283

alex-bezek commented Jul 14, 2023 •

edited

Loading

CK-Ward commented Jul 24, 2023

alex-bezek commented Jul 26, 2023

Controllers should retry and re-queue based on status codes #283

Controllers should retry and re-queue based on status codes #283

Comments

alex-bezek commented Jul 14, 2023 • edited Loading

What happened

What you think should happen instead

CK-Ward commented Jul 24, 2023

alex-bezek commented Jul 26, 2023

alex-bezek commented Jul 14, 2023 •

edited

Loading