Unhandled Exception on Error 429 - Too Many Requests #1011

jkupferer · 2023-03-06T21:34:42Z

Long story short

When a watch receives a 429 error (Too Many Requests) the watch fails rather than handling with backoff.

Kopf version

1.36.0

Kubernetes version

1.23.12.12

Python version

No response

Code

# Unable to reliably reproduce.

Logs

{"message": "Watcher for oauthaccesstokens.v1.oauth.openshift.io@none has failed: (None, None)", "exc_info": "... see below ...", "timestamp": "2023-02-27T15:47:27.234080+00:00", "severity": "error"}

# exec_info reformatted with line breaks for readability:

Traceback (most recent call last):
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/errors.py\", line 148, in check_response
    response.raise_for_status() 
  File \"/opt/app-root/lib64/python3.9/site-packages/aiohttp/client_reqrep.py\", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://172.30.0.1:443/apis/oauth.openshift.io/v1/oauthaccesstokens?watch=true&resourceVersion=799875280')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/aiokits/aiotasks.py\", line 108, in guard
    await coro
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_core/reactor/queueing.py\", line 175, in watcher
    async for raw_event in stream:
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/watching.py\", line 82, in infinite_watch
    async for raw_event in stream:
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/watching.py\", line 186, in continuous_watch
    async for raw_input in stream:
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/watching.py\", line 251, in watch_objs
    async for raw_input in api.stream(
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/api.py\", line 200, in stream
    response = await request(
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/auth.py\", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/api.py\", line 85, in request
    await errors.check_response(response)  # but do not parse it!
  File \"/opt/app-root/lib64/python3.9/site-packages/kopf/_cogs/clients/errors.py\", line 150, in check_response
    raise cls(payload, status=response.status) from e
kopf._cogs.clients.errors.APIClientError: (None, None)

Additional information

I believe a 429 Too Many Requests response should be handled with error backoff similar to an APIServerError?

The text was updated successfully, but these errors were encountered:

kriscode1 · 2023-03-28T13:56:54Z

I believe a 429 Too Many Requests response should be handled with error backoff similar to an APIServerError?

We have a similar issue that this solution would also assist with. We need our operator to manage a lot of pre-existing resources in the cluster. Since there's no limit to the number of requests made, the first run of the operator causes 429s because it tries to annotate many objects in a very short time.

psontag · 2023-03-28T14:08:38Z

This seems to be a duplicate of #962 which should be resolved with #963. The changes are not yet released though.

a-m-w · 2023-03-31T16:41:48Z

@nolar Would it be possible to consider cutting a new release? Thanks in advance!

dheeg · 2023-10-26T07:38:00Z

Was somebody able to fix this error? Our operator (kopf 1.36.2) has to deal with several thousand resources as well on startup. From time to time the following lethal error occurs.

After it happened the finalizers are stuck and the operator basically can't manage the resource type anymore.

Watcher for grafanaalerts.v1.example.com@none has failed: (None, None)

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
    response.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://10.32.0.1:443/apis/example.com/v1/grafanaalerts?watch=true&resourceVersion=1771761318')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/usr/local/lib/python3.11/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
    response = await request(
               ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/api.py", line 85, in request
    await errors.check_response(response)  # but do not parse it!
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 150, in check_response
    raise cls(payload, status=response.status) from e
kopf._cogs.clients.errors.APIClientError: (None, None)

dheeg · 2023-10-26T12:15:33Z

@nolar Do you have an idea about this by chance? It looks like there are types of requests which will not be retried. Maybe this is also the reason behind #1019

wochinge · 2025-01-07T21:51:51Z

@nolar What do you think of exposing #1011 (comment) as variable?

Seems quite simple and it should be an easy fix to this issue

nolar · 2025-01-07T23:04:13Z

@nolar What do you think of exposing #1011 (comment) as variable?

To expose as a variable that is used where and how? Sorry, I didn't fully get the proposed solution. Can you please clarify?

Having a limit variable (e.g. in settings) would require the centralized synchronization mechanism somewhere in memory, some kind of a semaphore with locks & timers. Kopf currently does not have anything of that kind, and I am not sure if this is a good idea to add such level of complexity (though doable — it is a relatively well-established pattern of a server-side rate limiter).

The original problem with many objects in the cluster causing too many requests can be indirectly addressed by using the custom storages (there are 2 of them in settings: diffbase & progress storage — see docs). This can be anything from a key-val storage (Redis), relational database (Postgres), or even a shared filesystem — persistent enough to outlive the operator's pod.

Out of the box, it is the objects' annotations — for simplicity — but annotations are not a requirement.

UPD: I am talking about patching requests. The rate limiting on watchers cannot be solved with a storage, of course. But that is already fixed with backoff intervals in settings.

wochinge · 2025-01-08T08:20:11Z

Thanks for the rapid answer and quick support!

The original problem with many objects in the cluster causing too many requests can be indirectly addressed by using the custom storages (there are 2 of them in settings: diffbase & progress storage — see docs).

Thanks for the pointer! If I understand the code correctly, then Kopf does the following:

List all objects via Kubernetes API (here)
Fetch the body of each found object to calculate diff and with that the implications on the handlers

That means upon startup we have number_of_namespaces x number_resources requests to the API and then number_found_resources requests to the storage (by default also the API). Is that correct?

Kopf currently does not have anything of that kind, and I am not sure if this is a good idea to add such level of complexity (though doable — it is a relatively well-established pattern of a server-side rate limiter).

Fully agree. Maybe my understanding has a flaw, let me try to explain:

As far as I understand, Kopf uses one shared APIContext. This shared APIContext has one configured aiohttp.ClientSession (done here)
aiohttp.TCPConnector allows setting the limit parameter (documentation here). With that we can restrict the number of simultaneous requests with aiohttp makes.

That for me means:

We can do client side throttling via aiohttp by setting the limit param (which by the way is 100 by default to avoid problems such as we have currently). This would avoid the burst of requests towards the K8s API and hence avoid running into any limits there
Apart from exposing the limit setting in Kopf for configuration we don't need to anything else in Kopf. aiohttp is handling everything for us.

Taking now all things together the solution for the above problem seems (again please correct my minimal understanding)

Limit concurrent aiohttp requests to avoid burst requests to K8s API
Configure network backoff with Jitter (as in this example) to avoid that every declined request is retried at the same time
Use a separate AnnotationsDiffBaseStorage to avoid calling the K8s API for every handled object

nolar · 2025-01-08T09:44:04Z

That means upon startup we have number_of_namespaces x number_resources requests to the API and then number_found_resources requests to the storage (by default also the API). Is that correct?

Kopf does 1 api request per resource kind, the "watching" request (GET …?watch=true), which streams the objects in it. All resource bodies are taken from that stream, not from individual GET requests (Kopf does not do GETs at all). The request is cluster-wide if the operator is started cluster-wide, and per-namepaced if the operator is namespace-scoped (as defined by the CLI options).

Kopf can also make a few initial requests — also via streaming — to scan the namespaces and CRDs available, in order to resolve the "globs" (masks) of namespaces & resources (if the masks are uncertain), such as kopf run -n myapp-* or alike. If there is no uncertainty, this is not performed.

All in all, the number of "reading" requests is very limited.

The number of requests to storage is indeed high, but in case of AnnotationsStorage(s), this is in-memory request to get a value from a huge dict of the resource body, it does not go to the API.

The number of "writing" API requests explodes usually when those objects are stored back to Kubernetes — via the PATCH requests, after their annotations were modified. This is why I propose an external storage: once Kopf has no need to patch the annotations, the "writing" requests will be gone too.

2. aiohttp.TCPConnector allows setting the limit parameter (documentation here). With that we can restrict the number of simultaneous requests with aiohttp makes.

Oh, I see now. Thanks for clarifying.

Yes, that can be made configurable. But keep in mind that it is the limit on the number of concurrent requests in any moment of time. If does not provide any timing limitations or priorities. As such, it can lead to 2 issues:

Some watching streams will not start quickly as they will be blocked by other requests or watchign streams, so Kopf will remain blind to some resources.
It does not protect against API limiting of 429 errors on the server side, as all these concurrent requests can be fast, thus going one after another and hitting the server limits.

Maybe separating the APIContext into 2 contexts: one for watch-streams, another for patches, would be a good idea — this will solve the 1st problem. But the solution is nevertheless partial.

What can be done, is exposing the APIContext subclassing to the developers somehow via settings, so that they can do whatever they want with the connection management.

The only objection in the past was that the API client is a hidden internal of Kopf, and there were already 2 major changes of the API client in the past, and there could be more in the future (in theory). Exposing this class and binding to aiohttp will make such settings forward-incompatible. But since the active development of Kopf is not happening now (I have no time, and no ideas what to do), this might be the option.

Here are my thoughts on this ;-) What do you think? Given a custom APIContext sub-class, will this help to implement API throttling?

wochinge · 2025-01-08T10:25:13Z

The number of requests to storage is indeed high, but in case of AnnotationsStorage(s), this is in-memory request to get a value from a huge dict of the resource body, it does not go to the API.

Ah, I see. So once we create the watcher we initially list all resources of the kind and then get all its data to fill the inmemory annotation storage.

Some watching streams will not start quickly as they will be blocked by other requests or watchign streams, so Kopf will remain blind to some resources.

I think that's fine. It will'll catch up eventually

It does not protect against API limiting of 429 errors on the server side, as all these concurrent requests can be fast, thus going one after another and hitting the server limits.

Yep, for that we (as in the kopf user) should implement the backoff with jitter

What can be done, is exposing the APIContext subclassing to the developers somehow via settings, so that they can do whatever they want with the connection management.

That's even more than we need but definitely the most powerful solution 👍🏻 Only ask I'd have: Can we extract the aiohttp.ClientSession creation to a separate method so that we can only override this specific method and don't need to reimplement the entire __init__?

nolar · 2025-01-08T11:08:45Z

Ah, I see. So once we create the watcher we initially list all resources of the kind and then get all its data to fill the inmemory annotation storage.

Correct.

That's even more than we need but definitely the most powerful solution

The rationale is this: Kopf is a Kubernetes-specialized tool, not an HTTP client tool. Adding too much of the HTTP-specific nuances is undesired. This use-case above is very specific and I'd say rare (opinionated). So, instead of adding this functionality to Kopf, some hook should be exposed, allowing the users of Kopf to do all the sophisticated trickery specific to their cases & environments. And this is somewhere around APIContext, plus-minus.

So (thinking out loud), a few changes are needed:

Redefine APIContext to call self._make_aiohttp_client(), which is called in its APIContext.__init__() instead of self.session = aiohttp.ClientSession(…). Maybe also move some surroundings with headers, auth, etc into this new method.
Expose the context class to be overridden. Conceptually, there are 2 ways: as on-login handlers returning a credentials object (more dynamic); or as a factory/class in settings (more simple).
Use that class/factory in @authenticated(), in the line with async for key, info, context in vault.extended(APIContext, 'contexts'): … — for which is would require settings in place (via context vars, probably).
Add the setting itself to NetworkingSettings._api_context_factory or ._api_context_cls.

Maybe, the whole Vault can be redesigned in favour of simplicity: instead of providing a dynamic factory, just embed the factory as a constructor argument. Essentially, the "vault" is exactly the "connection pool" of APIContexts/sessions by credentials, with some invalidation & re-authentication logic in it.

It seems somewhat complex, but not complicated. I might take a look in the following weeks, but no promise (as I said, no time).

jkupferer added the bug Something isn't working label Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhandled Exception on Error 429 - Too Many Requests #1011

Unhandled Exception on Error 429 - Too Many Requests #1011

jkupferer commented Mar 6, 2023 •

edited

Loading

kriscode1 commented Mar 28, 2023

psontag commented Mar 28, 2023

a-m-w commented Mar 31, 2023

dheeg commented Oct 26, 2023

dheeg commented Oct 26, 2023

wochinge commented Jan 7, 2025

nolar commented Jan 7, 2025 •

edited

Loading

wochinge commented Jan 8, 2025

nolar commented Jan 8, 2025

wochinge commented Jan 8, 2025

nolar commented Jan 8, 2025

Unhandled Exception on Error 429 - Too Many Requests #1011

Unhandled Exception on Error 429 - Too Many Requests #1011

Comments

jkupferer commented Mar 6, 2023 • edited Loading

Long story short

Kopf version

Kubernetes version

Python version

Code

Logs

Additional information

kriscode1 commented Mar 28, 2023

psontag commented Mar 28, 2023

a-m-w commented Mar 31, 2023

dheeg commented Oct 26, 2023

dheeg commented Oct 26, 2023

wochinge commented Jan 7, 2025

nolar commented Jan 7, 2025 • edited Loading

wochinge commented Jan 8, 2025

nolar commented Jan 8, 2025

wochinge commented Jan 8, 2025

nolar commented Jan 8, 2025

jkupferer commented Mar 6, 2023 •

edited

Loading

nolar commented Jan 7, 2025 •

edited

Loading