Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled Exception on Error 429 - Too Many Requests #1011

Open
jkupferer opened this issue Mar 6, 2023 · 11 comments
Open

Unhandled Exception on Error 429 - Too Many Requests #1011

jkupferer opened this issue Mar 6, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@jkupferer
Copy link
Contributor

jkupferer commented Mar 6, 2023

Long story short

When a watch receives a 429 error (Too Many Requests) the watch fails rather than handling with backoff.

Kopf version

1.36.0

Kubernetes version

1.23.12.12

Python version

No response

Code

# Unable to reliably reproduce.

Logs

{"message": "Watcher for oauthaccesstokens.v1.oauth.openshift.io@none has failed: (None, None)", "exc_info": "... see below ...", "timestamp": "2023-02-27T15:47:27.234080+00:00", "severity": "error"}

# exec_info reformatted with line breaks for readability:

Traceback (most recent call last):
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/errors.py\", line 148, in check_response
    response.raise_for_status() 
  File \"/opt/app-root/lib64/python3.9/site-packages/aiohttp/client_reqrep.py\", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://172.30.0.1:443/apis/oauth.openshift.io/v1/oauthaccesstokens?watch=true&resourceVersion=799875280')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/aiokits/aiotasks.py\", line 108, in guard
    await coro
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_core/reactor/queueing.py\", line 175, in watcher
    async for raw_event in stream:
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/watching.py\", line 82, in infinite_watch
    async for raw_event in stream:
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/watching.py\", line 186, in continuous_watch
    async for raw_input in stream:
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/watching.py\", line 251, in watch_objs
    async for raw_input in api.stream(
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/api.py\", line 200, in stream
    response = await request(
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/auth.py\", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/api.py\", line 85, in request
    await errors.check_response(response)  # but do not parse it!
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/errors.py\", line 150, in check_response
    raise cls(payload, status=response.status) from e
kopf._cogs.clients.errors.APIClientError: (None, None)

Additional information

I believe a 429 Too Many Requests response should be handled with error backoff similar to an APIServerError?

@jkupferer jkupferer added the bug Something isn't working label Mar 6, 2023
@kriscode1
Copy link

I believe a 429 Too Many Requests response should be handled with error backoff similar to an APIServerError?

We have a similar issue that this solution would also assist with. We need our operator to manage a lot of pre-existing resources in the cluster. Since there's no limit to the number of requests made, the first run of the operator causes 429s because it tries to annotate many objects in a very short time.

@psontag
Copy link
Contributor

psontag commented Mar 28, 2023

This seems to be a duplicate of #962 which should be resolved with #963. The changes are not yet released though.

@a-m-w
Copy link

a-m-w commented Mar 31, 2023

@nolar Would it be possible to consider cutting a new release? Thanks in advance!

@dheeg
Copy link

dheeg commented Oct 26, 2023

Was somebody able to fix this error? Our operator (kopf 1.36.2) has to deal with several thousand resources as well on startup. From time to time the following lethal error occurs.

After it happened the finalizers are stuck and the operator basically can't manage the resource type anymore.

Watcher for grafanaalerts.v1.example.com@none has failed: (None, None)

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
    response.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://10.32.0.1:443/apis/example.com/v1/grafanaalerts?watch=true&resourceVersion=1771761318')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/usr/local/lib/python3.11/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
    response = await request(
               ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/api.py", line 85, in request
    await errors.check_response(response)  # but do not parse it!
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 150, in check_response
    raise cls(payload, status=response.status) from e
kopf._cogs.clients.errors.APIClientError: (None, None)

@dheeg
Copy link

dheeg commented Oct 26, 2023

@nolar Do you have an idea about this by chance? It looks like there are types of requests which will not be retried. Maybe this is also the reason behind #1019

@wochinge
Copy link

wochinge commented Jan 7, 2025

@nolar What do you think of exposing #1011 (comment) as variable?

Seems quite simple and it should be an easy fix to this issue

@nolar
Copy link
Owner

nolar commented Jan 7, 2025

@nolar What do you think of exposing #1011 (comment) as variable?

To expose as a variable that is used where and how? Sorry, I didn't fully get the proposed solution. Can you please clarify?

Having a limit variable (e.g. in settings) would require the centralized synchronization mechanism somewhere in memory, some kind of a semaphore with locks & timers. Kopf currently does not have anything of that kind, and I am not sure if this is a good idea to add such level of complexity (though doable — it is a relatively well-established pattern of a server-side rate limiter).


The original problem with many objects in the cluster causing too many requests can be indirectly addressed by using the custom storages (there are 2 of them in settings: diffbase & progress storage — see docs). This can be anything from a key-val storage (Redis), relational database (Postgres), or even a shared filesystem — persistent enough to outlive the operator's pod.

Out of the box, it is the objects' annotations — for simplicity — but annotations are not a requirement.


UPD: I am talking about patching requests. The rate limiting on watchers cannot be solved with a storage, of course. But that is already fixed with backoff intervals in settings.

@wochinge
Copy link

wochinge commented Jan 8, 2025

Thanks for the rapid answer and quick support!

The original problem with many objects in the cluster causing too many requests can be indirectly addressed by using the custom storages (there are 2 of them in settings: diffbase & progress storage — see docs).

Thanks for the pointer! If I understand the code correctly, then Kopf does the following:

  1. List all objects via Kubernetes API (here)
  2. Fetch the body of each found object to calculate diff and with that the implications on the handlers

That means upon startup we have number_of_namespaces x number_resources requests to the API and then number_found_resources requests to the storage (by default also the API). Is that correct?


Kopf currently does not have anything of that kind, and I am not sure if this is a good idea to add such level of complexity (though doable — it is a relatively well-established pattern of a server-side rate limiter).

Fully agree. Maybe my understanding has a flaw, let me try to explain:

  1. As far as I understand, Kopf uses one shared APIContext. This shared APIContext has one configured aiohttp.ClientSession (done here)
  2. aiohttp.TCPConnector allows setting the limit parameter (documentation here). With that we can restrict the number of simultaneous requests with aiohttp makes.

That for me means:

  • We can do client side throttling via aiohttp by setting the limit param (which by the way is 100 by default to avoid problems such as we have currently). This would avoid the burst of requests towards the K8s API and hence avoid running into any limits there
  • Apart from exposing the limit setting in Kopf for configuration we don't need to anything else in Kopf. aiohttp is handling everything for us.

Taking now all things together the solution for the above problem seems (again please correct my minimal understanding)

  1. Limit concurrent aiohttp requests to avoid burst requests to K8s API
  2. Configure network backoff with Jitter (as in this example) to avoid that every declined request is retried at the same time
  3. Use a separate AnnotationsDiffBaseStorage to avoid calling the K8s API for every handled object

@nolar
Copy link
Owner

nolar commented Jan 8, 2025

That means upon startup we have number_of_namespaces x number_resources requests to the API and then number_found_resources requests to the storage (by default also the API). Is that correct?

Kopf does 1 api request per resource kind, the "watching" request (GET …?watch=true), which streams the objects in it. All resource bodies are taken from that stream, not from individual GET requests (Kopf does not do GETs at all). The request is cluster-wide if the operator is started cluster-wide, and per-namepaced if the operator is namespace-scoped (as defined by the CLI options).

Kopf can also make a few initial requests — also via streaming — to scan the namespaces and CRDs available, in order to resolve the "globs" (masks) of namespaces & resources (if the masks are uncertain), such as kopf run -n myapp-* or alike. If there is no uncertainty, this is not performed.

All in all, the number of "reading" requests is very limited.

The number of requests to storage is indeed high, but in case of AnnotationsStorage(s), this is in-memory request to get a value from a huge dict of the resource body, it does not go to the API.

The number of "writing" API requests explodes usually when those objects are stored back to Kubernetes — via the PATCH requests, after their annotations were modified. This is why I propose an external storage: once Kopf has no need to patch the annotations, the "writing" requests will be gone too.


2. aiohttp.TCPConnector allows setting the limit parameter (documentation here). With that we can restrict the number of simultaneous requests with aiohttp makes.

Oh, I see now. Thanks for clarifying.

Yes, that can be made configurable. But keep in mind that it is the limit on the number of concurrent requests in any moment of time. If does not provide any timing limitations or priorities. As such, it can lead to 2 issues:

  1. Some watching streams will not start quickly as they will be blocked by other requests or watchign streams, so Kopf will remain blind to some resources.
  2. It does not protect against API limiting of 429 errors on the server side, as all these concurrent requests can be fast, thus going one after another and hitting the server limits.

Maybe separating the APIContext into 2 contexts: one for watch-streams, another for patches, would be a good idea — this will solve the 1st problem. But the solution is nevertheless partial.

What can be done, is exposing the APIContext subclassing to the developers somehow via settings, so that they can do whatever they want with the connection management.

The only objection in the past was that the API client is a hidden internal of Kopf, and there were already 2 major changes of the API client in the past, and there could be more in the future (in theory). Exposing this class and binding to aiohttp will make such settings forward-incompatible. But since the active development of Kopf is not happening now (I have no time, and no ideas what to do), this might be the option.

Here are my thoughts on this ;-) What do you think? Given a custom APIContext sub-class, will this help to implement API throttling?

@wochinge
Copy link

wochinge commented Jan 8, 2025

The number of requests to storage is indeed high, but in case of AnnotationsStorage(s), this is in-memory request to get a value from a huge dict of the resource body, it does not go to the API.

Ah, I see. So once we create the watcher we initially list all resources of the kind and then get all its data to fill the inmemory annotation storage.


Some watching streams will not start quickly as they will be blocked by other requests or watchign streams, so Kopf will remain blind to some resources.

I think that's fine. It will'll catch up eventually

It does not protect against API limiting of 429 errors on the server side, as all these concurrent requests can be fast, thus going one after another and hitting the server limits.

Yep, for that we (as in the kopf user) should implement the backoff with jitter

What can be done, is exposing the APIContext subclassing to the developers somehow via settings, so that they can do whatever they want with the connection management.

That's even more than we need but definitely the most powerful solution 👍🏻 Only ask I'd have: Can we extract the aiohttp.ClientSession creation to a separate method so that we can only override this specific method and don't need to reimplement the entire __init__?

@nolar
Copy link
Owner

nolar commented Jan 8, 2025

Ah, I see. So once we create the watcher we initially list all resources of the kind and then get all its data to fill the inmemory annotation storage.

Correct.

That's even more than we need but definitely the most powerful solution

The rationale is this: Kopf is a Kubernetes-specialized tool, not an HTTP client tool. Adding too much of the HTTP-specific nuances is undesired. This use-case above is very specific and I'd say rare (opinionated). So, instead of adding this functionality to Kopf, some hook should be exposed, allowing the users of Kopf to do all the sophisticated trickery specific to their cases & environments. And this is somewhere around APIContext, plus-minus.

So (thinking out loud), a few changes are needed:

  • Redefine APIContext to call self._make_aiohttp_client(), which is called in its APIContext.__init__() instead of self.session = aiohttp.ClientSession(…). Maybe also move some surroundings with headers, auth, etc into this new method.
  • Expose the context class to be overridden. Conceptually, there are 2 ways: as on-login handlers returning a credentials object (more dynamic); or as a factory/class in settings (more simple).
  • Use that class/factory in @authenticated(), in the line with async for key, info, context in vault.extended(APIContext, 'contexts'): … — for which is would require settings in place (via context vars, probably).
  • Add the setting itself to NetworkingSettings._api_context_factory or ._api_context_cls.

Maybe, the whole Vault can be redesigned in favour of simplicity: instead of providing a dynamic factory, just embed the factory as a constructor argument. Essentially, the "vault" is exactly the "connection pool" of APIContexts/sessions by credentials, with some invalidation & re-authentication logic in it.

It seems somewhat complex, but not complicated. I might take a look in the following weeks, but no promise (as I said, no time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants