Very high CPU usage during NullReferenceException translation phase #61583

pongba · 2021-11-15T02:49:14Z

Hi CoreCLR team,

We recently ran into an issue where try-catching NullReferenceException in a tight for-loop, on all CPU cores, led to severe contention in a spinlock usage in g_SavedExceptionInfo.Enter().

CLR uses windows vectored exception handling to translate access violation to ~0x0000000000000000 into managed exception (NullReferenceException) which then becomes catchable by managed code. During the translation phase, an auxiliary global object g_SavedExceptionInfo is used to pass the exception record from vectored exception handler to FixContextForFaultingExceptionFrame which eventually releases the lock.

When many threads are frequently triggering NullReferenceException, g_SavedExceptionInfo.Enter() will have severe contention, burning an extraordinary amount of CPU time.

Two questions here:

Why a global object g_SavedExceptionInfo is used, instead of a thread_local object, which won't need lock to protect, hence contention-free.
Why a critical section (that will spin first) is used in g_SavedExceptionInfo.Enter()? If a spin-less lock mechanism is used, the CPU churn won't happen. I understand spin lock is faster in most circumstances, but it's extraordinarily slow & expensive in some corner cases, which, in backend world, sort of happens as a routine. Not using spinlock here makes the code slower, but since we're already throwing exception (therefore not exactly on the happy path), being a tiny bit slower shouldn't be a big deal?

Clockwork-Muse · 2021-11-15T03:22:58Z

Your premise is misguided, because:

In the general case you shouldn't be throwing exceptions often in the first place.
NullReferenceException indicates a bug in code that needs to be fixed (Especially in a hot loop). Either find and fix whatever is causing the NRE to be thrown, or bug whoever wrote it to fix it.

Given that:
2. You're only going to have lock contention in extremely rare cases, normally - the spinlock is there for safety concerns, normally the lock won't be awaited at all.

pongba · 2021-11-16T22:59:03Z

Your premise is misguided, because:

In the general case you shouldn't be throwing exceptions often in the first place.

Yes. But there's a difference between "when it does happen, cpu cores are killed" and "when it does happen, the system still survives". The difference here is important for online services.

NullReferenceException indicates a bug in code that needs to be fixed (Especially in a hot loop). Either find and fix whatever is causing the NRE to be thrown, or bug whoever wrote it to fix it.

Agreed. But there's still a difference of what consequences a bug can cause. Current CoreCLR implementation of NullReferenceException can lead to huge CPU churn, while conceivable alternative implementation, while not changing the essence of the bug, does not.

Given that: 2. You're only going to have lock contention in extremely rare cases, normally - the spinlock is there for safety concerns, normally the lock won't be awaited at all.

extremely rare cases are what backend engineers deals with routinely, unfortunately. So anything that can help with the bottomline when extreme cases happen, is very meaningful.

Clockwork-Muse · 2021-11-17T06:14:51Z

extremely rare cases are what backend engineers deals with routinely, unfortunately. So anything that can help with the bottomline when extreme cases happen, is very meaningful.

... yeah, and normally they deal with it by figuring out what is causing the exceptions in the first place, because that usually indicates a bug (or that the code should be switched to a TryXXX version). Your ask here is to treat the symptom of the problem (a potential slowdown), but the actual problem is that you're throwing so many concurrent exceptions. It's like that old joke;

A man goes to a doctor. "Doctor, my finger hurts when I slam it in the door." "Well, stop doing that".

Not that improvements aren't made. But due to a number of design decisions exceptions in C# are relatively "expensive" - even if this one area is changed (and it yields a benefit), you may still have performance issues until you stop throwing so many exceptions.

Yes. But there's a difference between "when it does happen, cpu cores are killed" and "when it does happen, the system still survives". The difference here is important for online services.

CPU churn is the least of your worries. NPE (along with several other exceptions) are "halt-and-catch-fire" errors - there's no reasonable way to automatically recover, and depending on where it's being thrown from your application could be in a generally unsafe state. Not in the "The VM thinks memory is corrupt" sense, but in the "this object is missing a required field" sense, which could be worse. So stop throwing them.

janvorli · 2022-07-11T14:40:29Z

This should be fixed by #70428. The g_savedExceptionInfo and the spinlock is gone.

dotnet-issue-labeler bot added area-ExceptionHandling-coreclr untriaged New issue has not been triaged by the area owner labels Nov 15, 2021

janvorli added area-ExceptionHandling-coreclr and removed area-ExceptionHandling-coreclr untriaged New issue has not been triaged by the area owner labels Jun 14, 2022

janvorli added this to the 7.0.0 milestone Jun 14, 2022

janvorli closed this as completed Jul 11, 2022

ghost locked as resolved and limited conversation to collaborators Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very high CPU usage during NullReferenceException translation phase #61583

Very high CPU usage during NullReferenceException translation phase #61583

pongba commented Nov 15, 2021 •

edited

Loading

Clockwork-Muse commented Nov 15, 2021

pongba commented Nov 16, 2021

Clockwork-Muse commented Nov 17, 2021

janvorli commented Jul 11, 2022

Very high CPU usage during NullReferenceException translation phase #61583

Very high CPU usage during NullReferenceException translation phase #61583

Comments

pongba commented Nov 15, 2021 • edited Loading

Clockwork-Muse commented Nov 15, 2021

pongba commented Nov 16, 2021

Clockwork-Muse commented Nov 17, 2021

janvorli commented Jul 11, 2022

pongba commented Nov 15, 2021 •

edited

Loading