Sentry background worker is chronically blocking async event loop when many exceptions are raised #2824

cpvandehey · 2024-03-14T22:59:54Z

How do you use Sentry?

Self-hosted/on-premise

Version

1.40.6

Steps to Reproduce

Hello! And thanks for reading my ticket :)

The python sentry client is a synchronous client library that is retrofitted to fit the async model (by spinning off separate threads to avoid disrupting the event loop thread -- see background worker (1) for thread usage).

Under healthy conditions, the sentry client doesn’t need to make many web requests. However, if conditions become rocky and exceptions are frequently raised (caught or uncaught), the sentry client may become an extreme inhibitor to the app event loop (assuming high sample rate). This is due to the necessary OS thread context switching that effectively pauses/blocks the event loop to work on other threads (i.e the background worker (1)). This is not a recommended pattern (obviously) due to the costs of switching threads, but can be useful for quickly/lazily retrofitting sync code.

Relevant flow - in short:
Every time an exception is raised (caught or uncaught) in my code, a web request is immediately made to dump the data to sentry when sampled. Since sentry’s background worker is thread based (1), this will trigger an thread context switch and then a synchronous web request to dump the data to sentry. When applications receive many exceptions in a short period of time, this becomes a context switching nightmare.

Suggestion:
In an ideal world, sentry would asyncify its Background worker to use a task (1) and its transport layer (2) would use aiohttp. I don't think this is of super high complexity, but I could be wrong.

An immediate workaround could be made with more background worker control. If sentry’s background worker made web requests to dump data at configurable intervals, it would behave far more efficiently for event loops apps. At the moment, the background worker always dumps data immediately with regards to exceptions. In my opinion, if sentry is flushing data at app exit, having a 60 second timer to dump data would alleviate most of the symptoms I described above without ever losing data (albeit it would be up to 60 seconds slower).

(1) -

sentry-python/sentry_sdk/worker.py

Line 20 in 1b0e932

class BackgroundWorker(object):

(2) -

sentry-python/sentry_sdk/transport.py

Line 244 in 1b0e932

response = self._pool.request(

Expected Result

I expect to have less thread context switching when using sentry.

Actual Result

I see a lot of thread context switching when there are high exception rates.

antonpirker · 2024-03-18T11:34:55Z

Hey @cpvandehey ! Thanks for the great ticket!

sentrivana · 2024-03-18T11:37:11Z

Hey @cpvandehey, thanks for writing in. Definitely agree with you that our async support could definitely use some improvements (see e.g. #1735, #2007, #2184, and multiple other issues).

Using an aiohttp client and an asyncio task both sounds doable and would go a long way in making the SDK more async friendly.

antonpirker · 2024-03-18T11:51:23Z

We could detect if aiohttp is in the project and based on this enable the new async support automatically. (have not thought long about this if this could lead to problems though..)

cpvandehey · 2024-04-29T18:18:48Z

Hey @sentrivana / @antonpirker , any update on the progress for this? Happy to help

sentrivana · 2024-04-30T08:42:11Z

Hey @cpvandehey, no news on this, but PRs are always welcome if you feel like giving this a shot.

cpvandehey · 2024-07-03T18:06:10Z

I see the milestone for this task was removed. @antonpirker, should we still consider writing our own attempt?

sentrivana · 2024-07-04T08:00:06Z

Hey @cpvandehey, sorry for the confusion regarding the milestone. Previously we were (mis)using milestones to group issues together, but have now decided to abandon that system. Nothing has changed priority wise.

cpvandehey · 2024-07-29T16:48:42Z

Alright, I think im going to start to implement this. Stay tuned.

cpvandehey · 2024-07-31T23:23:14Z

Coming up for air after a few hours invested/tinkering. I realized a few things that I should discuss before proceeding:

BackgroundWorker is the most foundational class needing an async equivalent. Making this async friendly was fairly easy and I have this code ready -- naming it BackgroundWorkerTask for the time being. This will rely on the built in asyncio.Queue instead of the sentry_sdk._queue. I also threw away all the unnecessary locking logic since this is single threaded for async use cases.
Next level up the stack is the HttpTransport class. This layer references the BackgroundWorker object & stores it as an attribute called _worker. This, for the most part, is fairly straightforward to add an async equivalent. There are only 4 methods that access the _worker, so it makes it easy to create a BaseHttpTransport that has all common functionality and have two child classes i.e. HttpTransport and HttpTransportAsync that inherit that. Each of the children will have specific methods that interact with the aforementioned worker in async/sync fashion.
The next level up is when the complexity rises.. The Client class makes a call to make_transport and then holds a reference to it as self.transport. self.transport is used to make all the underlying requests. Although this code is dense, its fairly understandable & could be split into a parent class (BaseClient) for common functionality and 2 child classes (Client and AsyncClient) for some unique transport actions that would need to be awaited. Separately, but as important, is a class named Monitor that uses threads to check in on a running BackgroundWorker. There would be a need for MonitorAsync as well to check the health of BackgroundWorkerTask.
Lastly, Scope and Hub both seem to be the top level for configuring Client. Hub has a direct reference to the flush method that will use the client's flush method (needing to be an async method). Scope does not have any direct way to flush, but only adds to the queue. I certainly could use some pointers on how we would configure 2 separate clients here

Exhales

Like most async integrations, they seem easy at the surface, but end up touching a lot of the code. Im wondering if I am on the right track with what the python sentry folks want for this design. I would love for this to be collaborative and iterative. Let me know your thoughts on the approach above :)

antonpirker · 2024-08-01T12:27:12Z

Hey @cpvandehey !

Thanks for this great issue and your motivation! You are right, our async games is currently not the best, and we should, and will improve on it.

To your deep dive:

Imo everything your write about BackgroundWorker and HttpTransport makes sense.
The Client and the Scope will probably be a bit trickier, as you also noticed.
The Hub is deprecated and we will remove in the next major, so will not touch it ever again :-)

Currently we are in the middle of doing a big refactor, where we try to use OpenTelementry (OTel) under the hood for our performance instrumentation.

We should not do the OTel and the async refactoring at the same time, this will lead to lots of complexity and head aches.

So I proposal is, that we first finish the OTel refactor and then tackle the async refactor.
The Otel refactor will probably still take a couple of months (like 2-3, not 10-12).
Do you think you can wait a while until we get started with this?

As this is a huge task we should then create a milestone and split the task up in smaller chunks, that can be tackled by multiple people at the same time.

cpvandehey · 2024-08-01T15:53:02Z

Do you think you can wait a while until we get started with this?

yes

As this is a huge task we should then create a milestone and split the task up in smaller chunks, that can be tackled by multiple people at the same time.

sounds good!

cpvandehey · 2024-09-23T17:37:28Z

Hey Sentry folks.

Currently we are in the middle of doing a big refactor, where we try to use OpenTelementry (OTel) under the hood for our performance instrumentation.

Just bumping this ticket again. I assume the repo is in a better state to start this effort?

BYK · 2024-09-23T19:12:37Z

Hi @cpvandehey, thanks for the bump! I do agree the repo is in a better shape. Moreover I've started working on an experimental HTTP/2 transport using httpcore and looks like it has native async support for that part.

Since I'm lending some of my time to the Python SDK nowadays and working in a similar area, I think we can work together on the async support too.

I don't think I'm as well-versed as you are when it comes to async game in Python so I can use some of the code you say you've already written such as the async background worker (I wonder if we actually need it with async as all we need is a queue and the event loops should handle the worker logic, right?).

Anyway, this is hopefully coming and your involvement is much appreciated!

All our ingest endpoints support HTTP/2 and some even HTTP/3 which are significantly more efficient compared to HTTP/1.1 with multiplexing and, header compression, connection reuse and 0-RTT TLS. This patch adds an experimental HTTP2Transport with the help of httpcore library. It makes minimal changes to the original HTTPTransport that said with httpcore we should be able to implement asyncio support easily and remove the worker logic (see #2824). This should also open the door for future HTTP/3 support (see encode/httpx#275). --------- Co-authored-by: Ivana Kellyer <[email protected]>

getsantry bot added the Waiting for: Product Owner label Mar 14, 2024

getsantry bot added this to GitHub Issues with 👀 2 Mar 14, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Mar 14, 2024

getsantry bot removed the Waiting for: Product Owner label Mar 18, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Mar 18, 2024

antonpirker added enhancement Component: SDK Core Dealing with the core of the SDK labels Mar 18, 2024

sentrivana added the Component: Transport Dealing with the transport label Mar 18, 2024

sentrivana added this to the Better async support milestone Mar 18, 2024

getsantry bot added the Waiting for: Product Owner label Apr 29, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Apr 29, 2024

getsantry bot removed the Waiting for: Product Owner label Apr 30, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Apr 30, 2024

antonpirker removed this from the Better async support milestone Jun 20, 2024

getsantry bot added the Waiting for: Product Owner label Jul 3, 2024

getsantry bot added this to GitHub Issues with 👀 3 Jul 3, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jul 3, 2024

sentrivana added the Better async support label Jul 4, 2024

getsantry bot removed the Waiting for: Product Owner label Jul 4, 2024

getsantry bot added the Waiting for: Product Owner label Jul 29, 2024

sentrivana removed the Waiting for: Product Owner label Jul 30, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Jul 30, 2024

getsantry bot added the Waiting for: Product Owner label Jul 31, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jul 31, 2024

getsantry bot removed the Waiting for: Product Owner label Aug 1, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Aug 1, 2024

getsantry bot added the Waiting for: Product Owner label Aug 1, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Aug 1, 2024

szokeasaurusrex removed the Waiting for: Product Owner label Aug 5, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Aug 5, 2024

getsantry bot added the Waiting for: Product Owner label Sep 23, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Sep 23, 2024

getsantry bot removed the Waiting for: Product Owner label Sep 23, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Sep 23, 2024

BYK mentioned this issue Oct 1, 2024

feat: Add httpcore based HTTP2Transport #3588

Merged

stephanie-anderson removed the Enhancement label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentry background worker is chronically blocking async event loop when many exceptions are raised #2824

Sentry background worker is chronically blocking async event loop when many exceptions are raised #2824

cpvandehey commented Mar 14, 2024

antonpirker commented Mar 18, 2024

sentrivana commented Mar 18, 2024

antonpirker commented Mar 18, 2024

cpvandehey commented Apr 29, 2024 •

edited

Loading

sentrivana commented Apr 30, 2024

cpvandehey commented Jul 3, 2024

sentrivana commented Jul 4, 2024

cpvandehey commented Jul 29, 2024

cpvandehey commented Jul 31, 2024 •

edited

Loading

antonpirker commented Aug 1, 2024

cpvandehey commented Aug 1, 2024

cpvandehey commented Sep 23, 2024

BYK commented Sep 23, 2024 •

edited

Loading

Sentry background worker is chronically blocking async event loop when many exceptions are raised #2824

Sentry background worker is chronically blocking async event loop when many exceptions are raised #2824

Comments

cpvandehey commented Mar 14, 2024

How do you use Sentry?

Version

Steps to Reproduce

Expected Result

Actual Result

antonpirker commented Mar 18, 2024

sentrivana commented Mar 18, 2024

antonpirker commented Mar 18, 2024

cpvandehey commented Apr 29, 2024 • edited Loading

sentrivana commented Apr 30, 2024

cpvandehey commented Jul 3, 2024

sentrivana commented Jul 4, 2024

cpvandehey commented Jul 29, 2024

cpvandehey commented Jul 31, 2024 • edited Loading

antonpirker commented Aug 1, 2024

cpvandehey commented Aug 1, 2024

cpvandehey commented Sep 23, 2024

BYK commented Sep 23, 2024 • edited Loading

cpvandehey commented Apr 29, 2024 •

edited

Loading

cpvandehey commented Jul 31, 2024 •

edited

Loading

BYK commented Sep 23, 2024 •

edited

Loading