Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased CPU and memory overhead while profiling #3199

Closed
InterstellarStella opened this issue Mar 6, 2024 · 36 comments
Closed

Increased CPU and memory overhead while profiling #3199

InterstellarStella opened this issue Mar 6, 2024 · 36 comments
Labels

Comments

@InterstellarStella
Copy link

Package

Sentry.AspNetCore

.NET Flavor

.NET

.NET Version

.Net 7 / .Net 8

OS

Any (not platform specific)

SDK Version

4.1.2

Self-Hosted Sentry Version

No response

Steps to Reproduce

Enable profiling through the following setup:

SentrySdk.Init(options =>
{
    options.Dsn = sentryConfig.Dsn;
    options.TracesSampleRate = 0.1;
    options.ProfilesSampleRate = 0.1;
    options.Environment = sentryConfig.Environment;
    options.DiagnosticLevel = diagnosticLevel;
    options.AddIntegration(new ProfilingIntegration( TimeSpan.FromMilliseconds(500)));
});

Expected Result

The CPU overhead is about +5 to 10%.

Actual Result

The CPU overhead is around around +70%, which is seen through google cloud monitoring system.
The following Sentry packages are used:
Sentry 4.1.2
Sentry.AspNetCore 4.1.2
Sentry.Profiling 4.1.2

Slack conversation
Zendesk ticket

@bruno-garcia
Copy link
Member

cc @vaind

Could we have some details about what the app is doing? With a repro we can look exactly what's going on and make improvements but with the currently provided information it's hard for us to figure out where the problem is.

Is this a web server? How many requests in parallel are running?

If this has high thrughput, changing sample rate to:

    options.TracesSampleRate = 0.001;
    options.ProfilesSampleRate = 0.001;

Does that reduce the CPU overhead considerably?

@saschash
Copy link

saschash commented Mar 7, 2024

@bruno-garcia I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU)
In fact, it was about the same as with sampling rate 1.0, which is weird...
The behaviour is, it first jumps to 60% and then stabilises at 40%.
It is a web server, but in sandbox environment, and we are running 20 workers in parallel

I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Mar 7, 2024
@bruno-garcia
Copy link
Member

I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU)
In fact, it was about the same as with sampling rate 1.0, which is weird...

This didn't come up during our tests so I'm a bit surprised but I wonder if it might be the overhead of the .NET profiler itself which unfortunately has a high frequency and we down sample after the fact.

I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on

I wonder if this is to get the transaction name automatically.

Do you get the overhead without Profiling turned on? Just keep transactions (TracesSampleRate set to some value)

By any chance in the profiles in Sentry, do you see any frames from Sentry that seem to be taking significant part time?

@saschash
Copy link

Thanks @bruno-garcia. Without profiling it works like a charm without any noticable overhead.
I could not find anything from Sentry taking a lot of time, in fact I couldn't see any sentry frames at all - which is weird, right, I should be able to see sentry-related process there...

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Mar 11, 2024
@bitsandfoxes bitsandfoxes changed the title [.NET Profiling] Increased CPU and Memory overhead Increased CPU and memory overhead while profiling Mar 11, 2024
@bruno-garcia
Copy link
Member

sorry for the delay here. any chance you can get a repro to help us debug this?
@vaind any ideas to further investigate?

@vaind
Copy link
Collaborator

vaind commented Mar 19, 2024

I've tried reproducing this locally with our aspnetcore sample and ab but haven't seen any significant overhead from enabling profiling.

I've launched the example with dotnet run -c Release with this branch and then tested various ab confs. E.g. with ab -c 10 -n 1000 (10 concurrent requests, 1000 total request), the CPU usage I'm seeing on my PC be it with or without profiling is about 60 % and the average response time is 80 ms without profiling and about 82 ms with profiling 100 % transactions.

Some reproducable example would be of great help here. Or any more details on what is actually happening. Maybe you can run dotnet-trace manually (with Sentry profiling disabled) and see if that has any effect on the CPU usage?

@saschash
Copy link

saschash commented Mar 24, 2024

Thanks @vaind. I will try to reproduce it on some minimum functionality branch - maybe will be able to catch it ...

@haneytron
Copy link

haneytron commented May 23, 2024

I just ran into this issue also. Last night I installed the Sentry profiler. With profiling enabled at 0.2 with tracing at 0.2 my app which normally runs at 5-10% CPU was redlining at 100% to the point where HTTP requests were timing out and I was getting 502 gateway errors and health check alerts.

I disabled the profiler (keeping only tracing) and things returned to normal. There's definitely something up with the .NET Core profiler.

My app is a web app, .NET Core 8, hosted on Azure, uses standard MVC + some SignalR for live data feeds. Backed by MSSQL.

image

In the above image you can pinpoint where I deployed the profiler, and also where I just turned it off.

Here's my Sentry config:

// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
    sentryConfig.Dsn = "<REDACTED>";
    // Don't need debug messages
    sentryConfig.Debug = false;
    // Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
    sentryConfig.TracesSampleRate = 0.2;
    // Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
    //sentryConfig.ProfilesSampleRate = 0.2;
    // Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
    //sentryConfig.AddIntegration(new ProfilingIntegration(
        // During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
        //TimeSpan.FromMilliseconds(500)
    //));
});

@haneytron
Copy link

No change disabling all HostedService instances either. I'm going to disable profiling and see what the baseline CPU is.

@bitsandfoxes
Copy link
Contributor

Thanks for the continuous trying and updates @haneytron.

@haneytron
Copy link

So removing the profiling drops CPU back to normal. I think the best thing I can try next is just to deploy a very basic app (literally nothing but a hello world in MVC) to the same slot and see what happens.

image

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jun 5, 2024
@haneytron
Copy link

I've hacked my app down to literally the bare bones, a single route that just shows a 404 page, and the issues persist. Here are some interesting graphs from today:

Graphs

CPU - you can see pretty clearly where I first fired it up, and also where I disabled profiling, then later re-enabled it.

image

Memory - about double the usual working set when profiling is active.

image

Gen 0, 1, 2 GC are not exciting or out of the ordinary.

image

Server Errors: Ignore the initial spike, I had things configured incorrectly. 0 errors all day.

image

Total (HTTP) requests - again not exciting. Not a prod site instance, only traffic is me randomly playing.

image

Thread Count: not exciting, fairly stable.

image

Average Response Time - VERY exciting!

Consistently low with profiling off, 30+ second bursts when on.

image

What's Next

Two things next.

1. I'm gonna enable it on my prod site again, right now.

I'll leave it until it explodes or 24 hours goes by, and we'll see if things are working properly and well OR the issue actually remains and these graphs from Azure Portal are garbage. Honestly, they might be, perf counters are funky at best in my experience.

I'll report back tomorrow with the results!

2. You can try replicating my setup

You could also try to replicate my app and hosting env:

.NET Core 8 (MVC Web App)
Azure hosted, app service, Linux, B1 Basic plan (~$13 USD/month), US East 2 region
For this demo I manually published, no CI/CD

Here is the entirely of my current Program.cs

using CodeSession;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Sentry.Profiling;
using System;

var builder = WebApplication.CreateBuilder(args);

var isDev = builder.Environment.IsDevelopment();

builder.Host.ConfigureServices((ctx, services) =>
{
    // Enable MVC
    services.AddControllersWithViews(options =>
    {
        options.SuppressAsyncSuffixInActionNames = true;
        options.EnableEndpointRouting = true;
    });
    if (!isDev)
    {
        services.AddHttpsRedirection(options =>
        {
            // Prevents an error that occurs in the TRACE log and reports to Azure infrequently, seemingly at app startup
            options.HttpsPort = 443;
        });
    }
});

// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
    // For Sentry config later on
    var sentryOptions = new SentryOptions();
    ctx.Configuration.GetSection("SentrySettings").Bind(sentryOptions);

    sentryConfig.Dsn = sentryOptions.Dsn;
    // Don't need debug messages
    sentryConfig.Debug = false;
    // Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
    sentryConfig.TracesSampleRate = 0.2;
    // Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
    sentryConfig.ProfilesSampleRate = 0.2;
    // Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
    sentryConfig.AddIntegration(new ProfilingIntegration(
        // During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
        TimeSpan.FromMilliseconds(500)
    ));
});

// Build it
var app = builder.Build();

// Redirect from HTTP to HTTPS
app.UseHttpsRedirection();

app.UseRouting();

// Default to 404
app.MapFallbackToController(action: "NotFound", controller: "Error");

// Run it
await app.RunAsync();

@haneytron
Copy link

Welp, bad news this AM. Multiple 502 and 503 errors (with NOTHING logged to console, app logs, or HTTP logs in Azure in terms of details), and multiple app restarts (as a result of my health checks I think) as well.

500 errors, starting when I deployed last night just after my last update:

image

Here are GSheets of the Console and HTTP logs for last 24 hours, limited to last 1000 logs, and all PII scrubbed:

https://docs.google.com/spreadsheets/d/1dqrHKF8DrBACtgoaQYxb7H59-8FO7SQhdYfVHogaZNI/edit#gid=940056293
https://docs.google.com/spreadsheets/d/1Wb9ZyWwiMUUplf6vOyNb37s7V6W5Q-Lvcj1YIErP2ek/edit#gid=869357884

Note that all of the 503, 502, 499s happen with the SignalR endpoints (/session/live/* routes) so my best guess is SignalR / WebSocket usage is an issue with the profiler.

Let me know how I can be of further help. Disabling profiling for now.

@vaind
Copy link
Collaborator

vaind commented Jun 5, 2024

Thx for all the updates. Any chance you could try running dotnet-trace profiler (with Sentry builtin profiling disabled at the time) on your app and see how that influences cpu usage? https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-trace#collect-a-trace-with-dotnet-trace

@haneytron
Copy link

@vaind ran a trace for almost 3 hours, ending just now:

root@b4ee333c5f1c:~/site/wwwroot# dotnet-trace collect --profile cpu-sampling -p 106 -o /dev/null

Provider Name                           Keywords            Level               Enabled By
Microsoft-DotNETCore-SampleProfiler     0x0000F00000000000  Informational(4)    --profile 
Microsoft-Windows-DotNETRuntime         0x00000014C14FCCBD  Informational(4)    --profile 

Process        : /usr/share/dotnet/dotnet
Output File    : /dev/null
[00:02:50:37]   Recording trace 0.00     (B)
Press <Enter> or <Ctrl+C> to exit...     (B)
Stopping the trace. This may take several minutes depending on the application being traced.

Trace completed.

(output to dev/null to avoid file storage issues and I/O penalties)

Result in CPU was high, but also profiling @ 100%:

image

Response times remained good (the longer ones are incorrect measuring on long-lived WebSockets SignalR conns):

image

No 5xx or 499 thrown during this period:

image

Average memory working set was constant w/ profiler active (no 2x jump like with the Sentry profiler):

image

Console / App / HTTP logs are clean. No app service restarts since I turned off the Sentry profiler yesterday afternoon.

I suspect SignalR / WebSockets issues w/ the Sentry profiler remain.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jun 6, 2024
@vaind
Copy link
Collaborator

vaind commented Jun 7, 2024

I've identified at least one memory leak in perfview EventPipe processing where it calls GC.KeepAlive() on objects so they never get freed. However, I can still see the memory growing, although not on the managed side so there may be something else going on as well.

@vaind
Copy link
Collaborator

vaind commented Jun 27, 2024

TLDR: not resolved yet

I'll post a PR decreasing the circular buffer we use when setting up a diagnostic session but other than that, the main problem seems to be with object pinning in perfview and changing that would require changing quite a few things. In short, EventPipeScource doesn't copy incoming buffer but instead maps structs (EventPipeEventHeader) directly onto the buffer (unsafe but OK if pinned). To amplify the problem, parts of these structs are then referenced directly (as pointers) and it's not clear to me whether objects containing these copied pointers would live outside the event handles (it seems like they would). I'm not sure about the scope of the changes needed to make this work without pinning (or without a pinning handle leak while still staying safe), but I have a feeling it would require a lot of effort.

@vaind
Copy link
Collaborator

vaind commented Jul 24, 2024

FYI, I've raised an issue over at dotnet/runtime#105132 for the unmanaged memory growth

@bruno-garcia
Copy link
Member

bruno-garcia commented Oct 1, 2024

@filipnavara has made a fix related to a leak.

I've updated our fork to include the fix. We'll bump the submodule here and make a release.

@vaind
Copy link
Collaborator

vaind commented Oct 2, 2024

according to my manual testing (running a sample app from #3645 for hours), the memory leak seem to be resolved by #3492

@bruno-garcia
Copy link
Member

Woo! Thank you @vaind and @filipnavara

I'm closing this issue as resolved.

If you have any issues with our profiling offer, please raise a ticket. We'd like to make sure this is working properly to everyone!

@getsentry getsentry locked as resolved and limited conversation to collaborators Oct 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Status: Done
Archived in project
Archived in project
Development

No branches or pull requests

7 participants