Increased CPU and memory overhead while profiling #3199

InterstellarStella · 2024-03-06T10:42:08Z

Package

Sentry.AspNetCore

.NET Flavor

.NET

.NET Version

.Net 7 / .Net 8

OS

Any (not platform specific)

SDK Version

4.1.2

Self-Hosted Sentry Version

No response

Steps to Reproduce

Enable profiling through the following setup:

SentrySdk.Init(options =>
{
    options.Dsn = sentryConfig.Dsn;
    options.TracesSampleRate = 0.1;
    options.ProfilesSampleRate = 0.1;
    options.Environment = sentryConfig.Environment;
    options.DiagnosticLevel = diagnosticLevel;
    options.AddIntegration(new ProfilingIntegration( TimeSpan.FromMilliseconds(500)));
});

Expected Result

The CPU overhead is about +5 to 10%.

Actual Result

The CPU overhead is around around +70%, which is seen through google cloud monitoring system.
The following Sentry packages are used:
Sentry 4.1.2
Sentry.AspNetCore 4.1.2
Sentry.Profiling 4.1.2

Slack conversation
Zendesk ticket

bruno-garcia · 2024-03-06T13:37:57Z

cc @vaind

Could we have some details about what the app is doing? With a repro we can look exactly what's going on and make improvements but with the currently provided information it's hard for us to figure out where the problem is.

Is this a web server? How many requests in parallel are running?

If this has high thrughput, changing sample rate to:

    options.TracesSampleRate = 0.001;
    options.ProfilesSampleRate = 0.001;

Does that reduce the CPU overhead considerably?

saschash · 2024-03-07T08:52:47Z

@bruno-garcia I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU)
In fact, it was about the same as with sampling rate 1.0, which is weird...
The behaviour is, it first jumps to 60% and then stabilises at 40%.
It is a web server, but in sandbox environment, and we are running 20 workers in parallel

I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on

bruno-garcia · 2024-03-10T21:37:30Z

I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU)
In fact, it was about the same as with sampling rate 1.0, which is weird...

This didn't come up during our tests so I'm a bit surprised but I wonder if it might be the overhead of the .NET profiler itself which unfortunately has a high frequency and we down sample after the fact.

I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on

I wonder if this is to get the transaction name automatically.

Do you get the overhead without Profiling turned on? Just keep transactions (TracesSampleRate set to some value)

By any chance in the profiles in Sentry, do you see any frames from Sentry that seem to be taking significant part time?

saschash · 2024-03-11T12:14:33Z

Thanks @bruno-garcia. Without profiling it works like a charm without any noticable overhead.
I could not find anything from Sentry taking a lot of time, in fact I couldn't see any sentry frames at all - which is weird, right, I should be able to see sentry-related process there...

bruno-garcia · 2024-03-19T13:51:54Z

sorry for the delay here. any chance you can get a repro to help us debug this?
@vaind any ideas to further investigate?

vaind · 2024-03-19T20:18:17Z

I've tried reproducing this locally with our aspnetcore sample and ab but haven't seen any significant overhead from enabling profiling.

I've launched the example with dotnet run -c Release with this branch and then tested various ab confs. E.g. with ab -c 10 -n 1000 (10 concurrent requests, 1000 total request), the CPU usage I'm seeing on my PC be it with or without profiling is about 60 % and the average response time is 80 ms without profiling and about 82 ms with profiling 100 % transactions.

Some reproducable example would be of great help here. Or any more details on what is actually happening. Maybe you can run dotnet-trace manually (with Sentry profiling disabled) and see if that has any effect on the CPU usage?

saschash · 2024-03-24T10:27:04Z

Thanks @vaind. I will try to reproduce it on some minimum functionality branch - maybe will be able to catch it ...

haneytron · 2024-05-23T15:32:43Z

I just ran into this issue also. Last night I installed the Sentry profiler. With profiling enabled at 0.2 with tracing at 0.2 my app which normally runs at 5-10% CPU was redlining at 100% to the point where HTTP requests were timing out and I was getting 502 gateway errors and health check alerts.

I disabled the profiler (keeping only tracing) and things returned to normal. There's definitely something up with the .NET Core profiler.

My app is a web app, .NET Core 8, hosted on Azure, uses standard MVC + some SignalR for live data feeds. Backed by MSSQL.

In the above image you can pinpoint where I deployed the profiler, and also where I just turned it off.

Here's my Sentry config:

// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
    sentryConfig.Dsn = "<REDACTED>";
    // Don't need debug messages
    sentryConfig.Debug = false;
    // Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
    sentryConfig.TracesSampleRate = 0.2;
    // Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
    //sentryConfig.ProfilesSampleRate = 0.2;
    // Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
    //sentryConfig.AddIntegration(new ProfilingIntegration(
        // During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
        //TimeSpan.FromMilliseconds(500)
    //));
});

haneytron · 2024-06-04T15:33:26Z

No change disabling all HostedService instances either. I'm going to disable profiling and see what the baseline CPU is.

bitsandfoxes · 2024-06-04T16:59:42Z

Thanks for the continuous trying and updates @haneytron.

haneytron · 2024-06-05T00:38:53Z

So removing the profiling drops CPU back to normal. I think the best thing I can try next is just to deploy a very basic app (literally nothing but a hello world in MVC) to the same slot and see what happens.

haneytron · 2024-06-05T03:11:54Z

I've hacked my app down to literally the bare bones, a single route that just shows a 404 page, and the issues persist. Here are some interesting graphs from today:

Graphs

CPU - you can see pretty clearly where I first fired it up, and also where I disabled profiling, then later re-enabled it.

Memory - about double the usual working set when profiling is active.

Gen 0, 1, 2 GC are not exciting or out of the ordinary.

Server Errors: Ignore the initial spike, I had things configured incorrectly. 0 errors all day.

Total (HTTP) requests - again not exciting. Not a prod site instance, only traffic is me randomly playing.

Thread Count: not exciting, fairly stable.

Average Response Time - VERY exciting!

Consistently low with profiling off, 30+ second bursts when on.

What's Next

Two things next.

1. I'm gonna enable it on my prod site again, right now.

I'll leave it until it explodes or 24 hours goes by, and we'll see if things are working properly and well OR the issue actually remains and these graphs from Azure Portal are garbage. Honestly, they might be, perf counters are funky at best in my experience.

I'll report back tomorrow with the results!

2. You can try replicating my setup

You could also try to replicate my app and hosting env:

.NET Core 8 (MVC Web App)
Azure hosted, app service, Linux, B1 Basic plan (~$13 USD/month), US East 2 region
For this demo I manually published, no CI/CD

Here is the entirely of my current Program.cs

using CodeSession;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Sentry.Profiling;
using System;

var builder = WebApplication.CreateBuilder(args);

var isDev = builder.Environment.IsDevelopment();

builder.Host.ConfigureServices((ctx, services) =>
{
    // Enable MVC
    services.AddControllersWithViews(options =>
    {
        options.SuppressAsyncSuffixInActionNames = true;
        options.EnableEndpointRouting = true;
    });
    if (!isDev)
    {
        services.AddHttpsRedirection(options =>
        {
            // Prevents an error that occurs in the TRACE log and reports to Azure infrequently, seemingly at app startup
            options.HttpsPort = 443;
        });
    }
});

// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
    // For Sentry config later on
    var sentryOptions = new SentryOptions();
    ctx.Configuration.GetSection("SentrySettings").Bind(sentryOptions);

    sentryConfig.Dsn = sentryOptions.Dsn;
    // Don't need debug messages
    sentryConfig.Debug = false;
    // Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
    sentryConfig.TracesSampleRate = 0.2;
    // Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
    sentryConfig.ProfilesSampleRate = 0.2;
    // Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
    sentryConfig.AddIntegration(new ProfilingIntegration(
        // During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
        TimeSpan.FromMilliseconds(500)
    ));
});

// Build it
var app = builder.Build();

// Redirect from HTTP to HTTPS
app.UseHttpsRedirection();

app.UseRouting();

// Default to 404
app.MapFallbackToController(action: "NotFound", controller: "Error");

// Run it
await app.RunAsync();

haneytron · 2024-06-05T13:32:44Z

Welp, bad news this AM. Multiple 502 and 503 errors (with NOTHING logged to console, app logs, or HTTP logs in Azure in terms of details), and multiple app restarts (as a result of my health checks I think) as well.

500 errors, starting when I deployed last night just after my last update:

Here are GSheets of the Console and HTTP logs for last 24 hours, limited to last 1000 logs, and all PII scrubbed:

https://docs.google.com/spreadsheets/d/1dqrHKF8DrBACtgoaQYxb7H59-8FO7SQhdYfVHogaZNI/edit#gid=940056293
https://docs.google.com/spreadsheets/d/1Wb9ZyWwiMUUplf6vOyNb37s7V6W5Q-Lvcj1YIErP2ek/edit#gid=869357884

Note that all of the 503, 502, 499s happen with the SignalR endpoints (/session/live/* routes) so my best guess is SignalR / WebSocket usage is an issue with the profiler.

Let me know how I can be of further help. Disabling profiling for now.

vaind · 2024-06-05T13:47:47Z

Thx for all the updates. Any chance you could try running dotnet-trace profiler (with Sentry builtin profiling disabled at the time) on your app and see how that influences cpu usage? https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-trace#collect-a-trace-with-dotnet-trace

haneytron · 2024-06-06T16:48:51Z

@vaind ran a trace for almost 3 hours, ending just now:

root@b4ee333c5f1c:~/site/wwwroot# dotnet-trace collect --profile cpu-sampling -p 106 -o /dev/null

Provider Name                           Keywords            Level               Enabled By
Microsoft-DotNETCore-SampleProfiler     0x0000F00000000000  Informational(4)    --profile 
Microsoft-Windows-DotNETRuntime         0x00000014C14FCCBD  Informational(4)    --profile 

Process        : /usr/share/dotnet/dotnet
Output File    : /dev/null
[00:02:50:37]   Recording trace 0.00     (B)
Press <Enter> or <Ctrl+C> to exit...     (B)
Stopping the trace. This may take several minutes depending on the application being traced.

Trace completed.

(output to dev/null to avoid file storage issues and I/O penalties)

Result in CPU was high, but also profiling @ 100%:

Response times remained good (the longer ones are incorrect measuring on long-lived WebSockets SignalR conns):

No 5xx or 499 thrown during this period:

Average memory working set was constant w/ profiler active (no 2x jump like with the Sentry profiler):

Console / App / HTTP logs are clean. No app service restarts since I turned off the Sentry profiler yesterday afternoon.

I suspect SignalR / WebSockets issues w/ the Sentry profiler remain.

vaind · 2024-06-07T10:40:40Z

I've identified at least one memory leak in perfview EventPipe processing where it calls GC.KeepAlive() on objects so they never get freed. However, I can still see the memory growing, although not on the managed side so there may be something else going on as well.

vaind · 2024-06-27T19:01:30Z

TLDR: not resolved yet

I'll post a PR decreasing the circular buffer we use when setting up a diagnostic session but other than that, the main problem seems to be with object pinning in perfview and changing that would require changing quite a few things. In short, EventPipeScource doesn't copy incoming buffer but instead maps structs (EventPipeEventHeader) directly onto the buffer (unsafe but OK if pinned). To amplify the problem, parts of these structs are then referenced directly (as pointers) and it's not clear to me whether objects containing these copied pointers would live outside the event handles (it seems like they would). I'm not sure about the scope of the changes needed to make this work without pinning (or without a pinning handle leak while still staying safe), but I have a feeling it would require a lot of effort.

vaind · 2024-07-24T06:49:32Z

FYI, I've raised an issue over at dotnet/runtime#105132 for the unmanaged memory growth

bruno-garcia · 2024-10-01T16:58:57Z

@filipnavara has made a fix related to a leak.

Fix memory leak when TraceEvent modifies the _eventRecord->ExtendedData field microsoft/perfview#2114

I've updated our fork to include the fix. We'll bump the submodule here and make a release.

vaind · 2024-10-02T10:57:14Z

according to my manual testing (running a sample app from #3645 for hours), the memory leak seem to be resolved by #3492

bruno-garcia · 2024-10-03T23:56:01Z

Woo! Thank you @vaind and @filipnavara

I'm closing this issue as resolved.

If you have any issues with our profiling offer, please raise a ticket. We'd like to make sure this is working properly to everyone!

getsantry bot added the Waiting for: Product Owner label Mar 6, 2024

github-project-automation bot added this to GDX Mar 6, 2024

getsantry bot added this to GitHub Issues with 👀 2 Mar 6, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Mar 6, 2024

bitsandfoxes added Product: Profiling and removed Waiting for: Product Owner labels Mar 6, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Mar 6, 2024

getsantry bot added the Waiting for: Product Owner label Mar 7, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Mar 7, 2024

getsantry bot removed the Waiting for: Product Owner label Mar 10, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Mar 10, 2024

getsantry bot added the Waiting for: Product Owner label Mar 11, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Mar 11, 2024

bitsandfoxes changed the title ~~[.NET Profiling] Increased CPU and Memory overhead~~ Increased CPU and memory overhead while profiling Mar 11, 2024

getsantry bot removed the Waiting for: Product Owner label Mar 19, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Mar 19, 2024

getsantry bot added the Waiting for: Product Owner label Mar 24, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Mar 24, 2024

bitsandfoxes removed the Waiting for: Product Owner label Mar 27, 2024

getsantry bot removed the status in GitHub Issues with 👀 2 Mar 27, 2024

getsantry bot added the Waiting for: Product Owner label May 23, 2024

getsantry bot added this to GitHub Issues with 👀 3 May 23, 2024

getsantry bot removed the Waiting for: Product Owner label Jun 4, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Jun 4, 2024

jamescrosswell mentioned this issue Jun 4, 2024

Spike: memory dumps for high memory consumption #3366

Closed

getsantry bot added the Waiting for: Product Owner label Jun 5, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jun 5, 2024

getsantry bot removed the Waiting for: Product Owner label Jun 5, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Jun 5, 2024

getsantry bot added the Waiting for: Product Owner label Jun 6, 2024

getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jun 6, 2024

getsantry bot removed the Waiting for: Product Owner label Jun 7, 2024

getsantry bot removed the status in GitHub Issues with 👀 3 Jun 7, 2024

bitsandfoxes mentioned this issue Jun 10, 2024

Memory Leak when using Sentry Profiling #3407

Closed

bitsandfoxes mentioned this issue Jul 1, 2024

SentryEvent memory leak with large exception object graphs #2516

Closed

jamescrosswell added the Bug Something isn't working label Jul 10, 2024

jamescrosswell mentioned this issue Jul 10, 2024

Replace ConcurrentBag<T> with a simpler data structure #3433

Closed

bruno-garcia closed this as completed Oct 3, 2024

getsentry locked as resolved and limited conversation to collaborators Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased CPU and memory overhead while profiling #3199

Increased CPU and memory overhead while profiling #3199

InterstellarStella commented Mar 6, 2024

bruno-garcia commented Mar 6, 2024

saschash commented Mar 7, 2024

bruno-garcia commented Mar 10, 2024

saschash commented Mar 11, 2024

bruno-garcia commented Mar 19, 2024

vaind commented Mar 19, 2024

saschash commented Mar 24, 2024 •

edited

Loading

haneytron commented May 23, 2024 •

edited

Loading

haneytron commented Jun 4, 2024

bitsandfoxes commented Jun 4, 2024

haneytron commented Jun 5, 2024

haneytron commented Jun 5, 2024

haneytron commented Jun 5, 2024

vaind commented Jun 5, 2024 •

edited

Loading

haneytron commented Jun 6, 2024

vaind commented Jun 7, 2024

vaind commented Jun 27, 2024

vaind commented Jul 24, 2024

bruno-garcia commented Oct 1, 2024 •

edited

Loading

vaind commented Oct 2, 2024

bruno-garcia commented Oct 3, 2024

Increased CPU and memory overhead while profiling #3199

Increased CPU and memory overhead while profiling #3199

Comments

InterstellarStella commented Mar 6, 2024

Package

.NET Flavor

.NET Version

OS

SDK Version

Self-Hosted Sentry Version

Steps to Reproduce

Expected Result

Actual Result

bruno-garcia commented Mar 6, 2024

saschash commented Mar 7, 2024

bruno-garcia commented Mar 10, 2024

saschash commented Mar 11, 2024

bruno-garcia commented Mar 19, 2024

vaind commented Mar 19, 2024

saschash commented Mar 24, 2024 • edited Loading

haneytron commented May 23, 2024 • edited Loading

haneytron commented Jun 4, 2024

bitsandfoxes commented Jun 4, 2024

haneytron commented Jun 5, 2024

haneytron commented Jun 5, 2024

Graphs

Average Response Time - VERY exciting!

What's Next

1. I'm gonna enable it on my prod site again, right now.

2. You can try replicating my setup

haneytron commented Jun 5, 2024

vaind commented Jun 5, 2024 • edited Loading

haneytron commented Jun 6, 2024

vaind commented Jun 7, 2024

vaind commented Jun 27, 2024

vaind commented Jul 24, 2024

bruno-garcia commented Oct 1, 2024 • edited Loading

vaind commented Oct 2, 2024

bruno-garcia commented Oct 3, 2024

saschash commented Mar 24, 2024 •

edited

Loading

haneytron commented May 23, 2024 •

edited

Loading

vaind commented Jun 5, 2024 •

edited

Loading

bruno-garcia commented Oct 1, 2024 •

edited

Loading