Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controllers should retry and re-queue based on status codes #283

Closed
alex-bezek opened this issue Jul 14, 2023 · 2 comments · Fixed by #288
Closed

Controllers should retry and re-queue based on status codes #283

alex-bezek opened this issue Jul 14, 2023 · 2 comments · Fixed by #288
Assignees
Labels
area/controller Issues dealing with the controller bug Something isn't working

Comments

@alex-bezek
Copy link
Collaborator

alex-bezek commented Jul 14, 2023

What happened

Today, when trying to make API calls to ngrok in some of our controllers, we receive a generic ngrok api error and thus re-que the event to be reconciled again immediately.

What you think should happen instead

We should instead see the ngrok api error and re-queue events appropriately.

  • If the error is >=500 we should re-queue as usual. This should mean there is some connectivity issue or a problem in the moment with our API that should be retried
  • if the error is specifically a 429, we should re-queue with an extended backoff time
  • for other 400 bad request errors, we should not re-queue the event. This is the case where you try to reserve an already reserved name, or something in general is not being accepted by our API and changes need to happen.
@alex-bezek alex-bezek added bug Something isn't working area/controller Issues dealing with the controller needs-triage Issues that need triage labels Jul 14, 2023
@CK-Ward
Copy link
Contributor

CK-Ward commented Jul 24, 2023

@Wonderful-Space and @nikolay-ngrok both interested, will prioritize accordingly.

@alex-bezek
Copy link
Collaborator Author

In the issue description I had noted

for other 400 bad request errors, we should not re-queue the event

However, we recently found an issue in the httpsEdge controller where it currently is doing this (not retrying for 400's) and it's causing an issue.

To recap how the Ingress controller works overall a bit, there are multiple controllers all working asynchronously in concert to drive the system to an eventually consistent state. Specifically, when a user creates an ingress object, the ingress_controller converts this information into Domain, HttpsEdge, and Tunnel CRD's in the k8s api. Then each of these respective controllers work in parallel to reconcile these CRDs by reserving domains, creating edges (and the other related configurations), and spinning up ngrok-go tunnels.

This parallel and async nature means that the https_edge_controller could try to create the edge before the domain controller has reserved the domain. By default, if a controller returns an error during its reconcile loop, it will re-queue the event to retry later. This could be seen in this issue where we consistently saw a slew of errors and retries https://github.com/ngrok/kubernetes-ingress-controller/issues/195

Recently we updated the edge controller to not retry on certain ngrok api errors with the aim of not continually retrying bad route-module configurations. We now take the error and check if its "retryable" based on its >500
https://github.com/ngrok/kubernetes-ingress-controller/blob/41940b5d11532eb6393e3396ed95fb62a0aa68dd/internal/controllers/httpsedge_controller.go#L129-L137
https://github.com/ngrok/kubernetes-ingress-controller/blob/b603c501e74f0b394e8ee1c3e0b9cd8ea3a6716e/internal/errors/errors.go#L157-L172

We realized though that when creating an edge, during this async controller race, the edge controller can fail to create the edge if the domain isn't registered which is a 400 error and should be retried soon

"HTTP 400: The domain 'my-domain.ngrok.app:443' is not reserved.

This may be worth breaking out into its own PR to resolve or its own issue, but I thought it might be worth leaving a note here to take into account for any other possible API error codes that should be retried.

@nikolay-ngrok also brought up the concern that if we just retry errors like this all the time, we could put unnecessary pressure on the API. With the controller-runtime library, we can control if the event should requeue and how long it should wait based on https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Result . I think internally it also does some backoff, but we'd have to look into that. It may be worth looking into having a more robust exponential backoff for particular errors as talked about in some of this guides
https://stuartleeks.com/posts/error-back-off-with-controller-runtime/
https://danielmangum.com/posts/controller-runtime-client-go-rate-limiting/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Issues dealing with the controller bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants