Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high CPU usage during NullReferenceException translation phase #61583

Closed
pongba opened this issue Nov 15, 2021 · 4 comments
Closed

Very high CPU usage during NullReferenceException translation phase #61583

pongba opened this issue Nov 15, 2021 · 4 comments

Comments

@pongba
Copy link

pongba commented Nov 15, 2021

Hi CoreCLR team,

We recently ran into an issue where try-catching NullReferenceException in a tight for-loop, on all CPU cores, led to severe contention in a spinlock usage in g_SavedExceptionInfo.Enter().

CLR uses windows vectored exception handling to translate access violation to ~0x0000000000000000 into managed exception (NullReferenceException) which then becomes catchable by managed code. During the translation phase, an auxiliary global object g_SavedExceptionInfo is used to pass the exception record from vectored exception handler to FixContextForFaultingExceptionFrame which eventually releases the lock.

When many threads are frequently triggering NullReferenceException, g_SavedExceptionInfo.Enter() will have severe contention, burning an extraordinary amount of CPU time.

Two questions here:

  1. Why a global object g_SavedExceptionInfo is used, instead of a thread_local object, which won't need lock to protect, hence contention-free.
  2. Why a critical section (that will spin first) is used in g_SavedExceptionInfo.Enter()? If a spin-less lock mechanism is used, the CPU churn won't happen. I understand spin lock is faster in most circumstances, but it's extraordinarily slow & expensive in some corner cases, which, in backend world, sort of happens as a routine. Not using spinlock here makes the code slower, but since we're already throwing exception (therefore not exactly on the happy path), being a tiny bit slower shouldn't be a big deal?
@dotnet-issue-labeler dotnet-issue-labeler bot added area-ExceptionHandling-coreclr untriaged New issue has not been triaged by the area owner labels Nov 15, 2021
@Clockwork-Muse
Copy link
Contributor

Your premise is misguided, because:

  1. In the general case you shouldn't be throwing exceptions often in the first place.
  2. NullReferenceException indicates a bug in code that needs to be fixed (Especially in a hot loop). Either find and fix whatever is causing the NRE to be thrown, or bug whoever wrote it to fix it.

Given that:
2. You're only going to have lock contention in extremely rare cases, normally - the spinlock is there for safety concerns, normally the lock won't be awaited at all.

@pongba
Copy link
Author

pongba commented Nov 16, 2021

Your premise is misguided, because:

  1. In the general case you shouldn't be throwing exceptions often in the first place.

Yes. But there's a difference between "when it does happen, cpu cores are killed" and "when it does happen, the system still survives". The difference here is important for online services.

  1. NullReferenceException indicates a bug in code that needs to be fixed (Especially in a hot loop). Either find and fix whatever is causing the NRE to be thrown, or bug whoever wrote it to fix it.

Agreed. But there's still a difference of what consequences a bug can cause. Current CoreCLR implementation of NullReferenceException can lead to huge CPU churn, while conceivable alternative implementation, while not changing the essence of the bug, does not.

Given that: 2. You're only going to have lock contention in extremely rare cases, normally - the spinlock is there for safety concerns, normally the lock won't be awaited at all.

extremely rare cases are what backend engineers deals with routinely, unfortunately. So anything that can help with the bottomline when extreme cases happen, is very meaningful.

@Clockwork-Muse
Copy link
Contributor

extremely rare cases are what backend engineers deals with routinely, unfortunately. So anything that can help with the bottomline when extreme cases happen, is very meaningful.

... yeah, and normally they deal with it by figuring out what is causing the exceptions in the first place, because that usually indicates a bug (or that the code should be switched to a TryXXX version). Your ask here is to treat the symptom of the problem (a potential slowdown), but the actual problem is that you're throwing so many concurrent exceptions. It's like that old joke;

A man goes to a doctor. "Doctor, my finger hurts when I slam it in the door." "Well, stop doing that".

Not that improvements aren't made. But due to a number of design decisions exceptions in C# are relatively "expensive" - even if this one area is changed (and it yields a benefit), you may still have performance issues until you stop throwing so many exceptions.

Yes. But there's a difference between "when it does happen, cpu cores are killed" and "when it does happen, the system still survives". The difference here is important for online services.

CPU churn is the least of your worries. NPE (along with several other exceptions) are "halt-and-catch-fire" errors - there's no reasonable way to automatically recover, and depending on where it's being thrown from your application could be in a generally unsafe state. Not in the "The VM thinks memory is corrupt" sense, but in the "this object is missing a required field" sense, which could be worse. So stop throwing them.

@janvorli janvorli added area-ExceptionHandling-coreclr and removed area-ExceptionHandling-coreclr untriaged New issue has not been triaged by the area owner labels Jun 14, 2022
@janvorli janvorli added this to the 7.0.0 milestone Jun 14, 2022
@janvorli
Copy link
Member

This should be fixed by #70428. The g_savedExceptionInfo and the spinlock is gone.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants