Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SentryEvent memory leak with large exception object graphs #2516

Closed
danports opened this issue Jul 27, 2023 · 16 comments · Fixed by #3355
Closed

SentryEvent memory leak with large exception object graphs #2516

danports opened this issue Jul 27, 2023 · 16 comments · Fixed by #3355
Assignees
Labels
Bug Something isn't working

Comments

@danports
Copy link

Package

Sentry

.NET Flavor

.NET Core

.NET Version

7.0

OS

Linux

SDK Version

3.34.0

Self-Hosted Sentry Version

No response

Steps to Reproduce

I'm having an issue where a process in which I've enabled the Sentry Microsoft.Extensions.Logging integration is consistently leaking memory. The issue seems to be most obvious in cases where the SentryEvent exception instance references a large object graph - say, a Microsoft.EntityFrameworkCore.DbUpdateConcurrencyException that references a DbContext with a large number of entities loaded.

Expected Result

I would expect the SentryEvent instances to be processed/sent/etc. and eventually freed.

Actual Result

Based on the process memory consumption (and memory dumps) it seems like the SentryEvent instances are never freed:
image

I'm wondering if the Sentry SDK is trying to serialize and send the DbUpdateConcurrencyException along with everything in the associated DbContext, fails due to limitations on event size, and ends up retrying forever.

A typical gcroot from dotnet-dump analyze looks like this:

          -> 7fa86fe3bc78     System.Threading.Thread
          -> 7fa8421108a0     System.Threading.ExecutionContext
          -> 7fa842110870     System.Threading.AsyncLocalValueMap+TwoElementAsyncLocalValueMap
          -> 7fa8420ab168     System.Collections.Generic.KeyValuePair<Sentry.Scope, Sentry.ISentryClient>[]
          -> 7fa84209c4e0     Sentry.SentryClient
          -> 7fa84209c5f0     Sentry.Internal.BackgroundWorker
          -> 7fa8420a72f8     System.Collections.Concurrent.ConcurrentQueue<Sentry.Protocol.Envelopes.Envelope>
          -> 7fa8420a7338     System.Collections.Concurrent.ConcurrentQueueSegment<Sentry.Protocol.Envelopes.Envelope>
          -> 7fa8420a7420     System.Collections.Concurrent.ConcurrentQueueSegment<Sentry.Protocol.Envelopes.Envelope>+Slot[]
          -> 7fa8663d47d0     Sentry.Protocol.Envelopes.Envelope
          -> 7fa8663d4668     System.Collections.Generic.List<Sentry.Protocol.Envelopes.EnvelopeItem>
          -> 7fa8663d4798     Sentry.Protocol.Envelopes.EnvelopeItem[]
          -> 7fa8663d4778     Sentry.Protocol.Envelopes.EnvelopeItem
          -> 7fa8663d4760     Sentry.Protocol.Envelopes.JsonSerializable
          -> 7fa866bde340     Sentry.SentryEvent
          -> 7fa866bdc618     Microsoft.EntityFrameworkCore.DbUpdateConcurrencyException
          -> 7fa866bdc8f0     System.Collections.Generic.List<Microsoft.EntityFrameworkCore.ChangeTracking.EntityEntry>
          -> 7fa866bdc930     Microsoft.EntityFrameworkCore.ChangeTracking.EntityEntry[]
          -> 7fa866bdc910     Microsoft.EntityFrameworkCore.ChangeTracking.EntityEntry
          -> 7fa87a64b3a8     Microsoft.EntityFrameworkCore.ChangeTracking.Internal.InternalEntityEntry
          -> 7fa87967c120     Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager
          -> 7fa8796723d0     <MyApplicationDbContext>
@danports danports added Bug Something isn't working Platform: .NET labels Jul 27, 2023
@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 Jul 27, 2023
@bitsandfoxes
Copy link
Contributor

Hey @danports and thanks for raising this!
Could you help me out creating a repro so I can test this locally?

@getsantry getsantry bot moved this from Waiting for: Product Owner to Waiting for: Community in GitHub Issues with 👀 Jul 27, 2023
@danports
Copy link
Author

Sure thing: https://github.com/danports/MemoryLeakTest

To run the repro project, you'll need to supply a valid DSN and MySQL connection string in Program.cs. (If the latter requirement is onerous, I could set up a Docker Compose file or some such you could run to launch a MySQL instance before launching the repro project.)

If you run the repro project as is, you'll see memory usage increase almost monotonically and indefinitely (though it will increase slowly for the first minute or so while it is creating all of the initial database entries); however, if you comment out the AddSentry registration, you'll see memory usage remain stable around 1-1.5GB, which suggests to me that the Sentry SDK is indeed leaking memory somehow.

I'm wondering if the Sentry SDK is trying to serialize and send the DbUpdateConcurrencyException along with everything in the associated DbContext, fails due to limitations on event size, and ends up retrying forever.

I think my initial theory is incorrect, since based on the Sentry debug logs, the posted event isn't all that large, and it is ingested correctly by the Sentry backend. So I'm not sure what the issue is - perhaps Sentry is hanging on to posted events for some reason?

@bitsandfoxes
Copy link
Contributor

Huge thanks for the repro! We'll look into this.

@danports
Copy link
Author

It looks like this may be a memory leak internal to the ConcurrentQueue used by Sentry's BackgroundWorker - as you can see here, the queue is empty, yet there are slots in its internals still holding references to the envelopes that were already successfully submitted:
image
So it seems like this may actually be a runtime issue. What I don't yet fully understand is why the combination of Sentry and Quartz.NET is necessary to reproduce this issue (see: quartznet/quartznet#2099 (comment)) - my guess is that the combination of the two results in a specific series of operations that triggers the leak in ConcurrentQueue whereas without Quartz.NET that doesn't happen.

It looks like the runtime issue is mono/mono#19665 with a potential fix that was rejected as causing other problems: dotnet/runtime#48296. I'm not sure if there is a way for Sentry to work around this issue, or if we'd have to wait for a solution to the runtime problem.

@danports
Copy link
Author

It looks like I may be able to work around this problem to some degree on my end. When the TestJob throws a DbUpdateConcurrencyException, EF Core logs it, then Quartz.NET catches it and throws a JobExecutionException, which also gets logged. I believe the fact that both exceptions are logged back-to-back may be what triggers the memory leak in the runtime. When I modify TestJob to catch the DbUpdateConcurrencyException, wrap it in a JobExecutionException, and throw that, the memory leak seems to go away. It's possible that wrapping the exception changes the code path in a way that avoids triggering the ConcurrentQueue memory leak. I have to admit I'm not completely satisfied with that answer (how exactly is that code path different??), but it is nice to have a workaround for now.

@danports
Copy link
Author

danports commented Aug 4, 2023

The workaround I described earlier has been effective in my use case, so I'll leave it to you guys to decide whether it's worth working around this in the Sentry codebase (e.g. by avoiding use of the leaky ConcurrentQueue structure in the first place). I'm guessing you don't have many customers running into this problem, or perhaps others have run into it but haven't traced it back to the root cause.

@jamescrosswell
Copy link
Collaborator

I'll leave it to you guys to decide whether it's worth working around this in the Sentry codebase (e.g. by avoiding use of the leaky ConcurrentQueue structure in the first place)

Depending on what functionality we're using from ConcurrentQueue, it might be easy to avoid using it.

@cliedeman
Copy link

think I may have encountered a simliar if not the same issue.

This is from an azure heap dump.
Generation 2 Heap
image

The value is in bytes. So approximately 31 megs of Sentry Spans.

@Dominent
Copy link

Dominent commented May 3, 2024

Sentry this is not ok, the bug is causing us to disable sentry. Is there any news when this will be fixed ?

And its mainly related to

image

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 May 3, 2024
@jamescrosswell
Copy link
Collaborator

@Dominent are you disabling Sentry or just the profiling integration? Are you also seeing a memory leak and, if so, are you also using Quartz and MySQL?

@jamescrosswell
Copy link
Collaborator

jamescrosswell commented May 8, 2024

The workaround I described earlier has been effective in my use case, so I'll leave it to you guys to decide whether it's worth working around this in the Sentry codebase (e.g. by avoiding use of the leaky ConcurrentQueue structure in the first place).

Looking into this further, I can reproduce the problem and, even though in theory it should only exist on Mono, we're seeing it in net7.0. To double check, I hacked together a ConcurrentQueueLite which, despite all it's performance shortcomings, does not have the same memory leak problem.

Options at this stage:

  1. Follow up on the issue in the dotnet/runtime repository (see comment in that thread)
  2. Possibly as a (hopefully temporary) workaround, we could build a ConcurrentQueueLite implementation that we would be happy with in production. Given how we're using this in the BackgroundWorker, we might get away with a fairly simple implementation that would be fit for our purposes.

@jamescrosswell
Copy link
Collaborator

jamescrosswell commented May 8, 2024

On another (perhaps unrelated) note: I can also create a memory leak if I change the implementation of the ConcurrentQueueLite.TryPeek to be:

    public bool TryPeek([NotNullWhen(true)] out T? item)
    {
        throw new NotImplementedException();
    }

Initially this made me think the memory leak might occur any time an exception was thrown and caught here. However, if I put a breakpoint there it never gets triggered when using ConcurrentQueue<T> (rather than the intentionally broken ConcurrentQueueLite<T>). I don't think this is directly related then. What was interesting though is that I can "fix" this memory leak if, when a fatal exception does occur, I empty the queue and refuse to accept any new items on the queue... which is kind of suspicious. It implies the memory leak may be related to references that are held by envelopes in the queue. It's not immediately apparent how/why that would work since we have a maximum of 30 items in the queue and they should all be immutable... excepting perhaps the original Exceptions that we're wrapping. The items in the queue are all envelopes containing references to a Microsoft.EntityFrameworkCore.DbUpdateConcurrencyException which holds all kinds of stuff (change tracking objects and the like). So maybe that's where the memory leak is coming from in both scenarios.

@github-project-automation github-project-automation bot moved this to Done in GDX May 9, 2024
@mrhocuong
Copy link

This issue was reported as fixed from version 4.6.0 but I still get the same issue after upgrading packages.
I removed Sentry from my application because it harmed all memory.
I'm waiting for the big enhancement to get it back.

@bitsandfoxes
Copy link
Contributor

Hey @mrhocuong, could you possibly give us some more information about how you use Sentry and what issues you see with memory consumption?
It's a bit hard to go on from because it harmed all memory.

@ampossardt
Copy link

I can also confirm that this issue is still happening for me, memory rises continuously with profiling enabled in my project.

I'm using version 4.8.0 of both the SDK and the profiling package. Removing the profiling integration seems to clear up the memory issues immediately.

o.AddIntegration(new ProfilingIntegration( TimeSpan.FromMilliseconds(500) ));

@bitsandfoxes bitsandfoxes reopened this Jul 1, 2024
@bitsandfoxes
Copy link
Contributor

Thanks for reaching out and letting us know!
This issue is about something else and unrelated to profiling. Please see #3199 for additional context.

@getsentry getsentry locked as resolved and limited conversation to collaborators Jul 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Something isn't working
Projects
Status: Done
Archived in project
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants