-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controllers should retry and re-queue based on status codes #283
Comments
@Wonderful-Space and @nikolay-ngrok both interested, will prioritize accordingly. |
In the issue description I had noted
However, we recently found an issue in the httpsEdge controller where it currently is doing this (not retrying for 400's) and it's causing an issue. To recap how the Ingress controller works overall a bit, there are multiple controllers all working asynchronously in concert to drive the system to an eventually consistent state. Specifically, when a user creates an ingress object, the ingress_controller converts this information into Domain, HttpsEdge, and Tunnel CRD's in the k8s api. Then each of these respective controllers work in parallel to reconcile these CRDs by reserving domains, creating edges (and the other related configurations), and spinning up ngrok-go tunnels. This parallel and async nature means that the https_edge_controller could try to create the edge before the domain controller has reserved the domain. By default, if a controller returns an error during its reconcile loop, it will re-queue the event to retry later. This could be seen in this issue where we consistently saw a slew of errors and retries https://github.com/ngrok/kubernetes-ingress-controller/issues/195 Recently we updated the edge controller to not retry on certain ngrok api errors with the aim of not continually retrying bad route-module configurations. We now take the error and check if its "retryable" based on its >500 We realized though that when creating an edge, during this async controller race, the edge controller can fail to create the edge if the domain isn't registered which is a 400 error and should be retried soon
This may be worth breaking out into its own PR to resolve or its own issue, but I thought it might be worth leaving a note here to take into account for any other possible API error codes that should be retried. @nikolay-ngrok also brought up the concern that if we just retry errors like this all the time, we could put unnecessary pressure on the API. With the controller-runtime library, we can control if the event should requeue and how long it should wait based on https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Result . I think internally it also does some backoff, but we'd have to look into that. It may be worth looking into having a more robust exponential backoff for particular errors as talked about in some of this guides |
What happened
Today, when trying to make API calls to ngrok in some of our controllers, we receive a generic ngrok api error and thus re-que the event to be reconciled again immediately.
What you think should happen instead
We should instead see the ngrok api error and re-queue events appropriately.
The text was updated successfully, but these errors were encountered: