Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard members of MonoType union & fix related bugs #111645

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

kg
Copy link
Member

@kg kg commented Jan 20, 2025

There is a class of issue where we fail to properly check the type of a MonoType before accessing the members of its data union, which can lead to type confusion or null pointer dereference bugs.

This PR adds dedicated access helpers for every member of the union that do a type check, renames the union to make direct accesses obvious, and removes almost all direct uses of the union members.

The process of adding the access helpers and auditing related code revealed multiple bugs; I've either fixed those bugs in this PR or added FIXME comments describing the bug.

For accesses that were obviously correct (I could see a type check right above the accessor call) I used the _unchecked variant to optimize out the type check.

As a follow-up later on in the .NET 10 cycle we can remove the type check from these accessors to make them more efficient if we need to. I don't think we'll need to, though.

@kg
Copy link
Member Author

kg commented Jan 21, 2025

tracing\eventpipe\gcdump\gcdump\gcdump.cmd test failure is a real bug in mono.

@@ -1296,6 +1296,92 @@ m_type_is_byref (const MonoType *type)
return type->byref__;
}

MONO_NEVER_INLINE void
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to have this in production than I believe we should adopt a similar pattern as we have when using MONO_CLASS_DEF_PRIVATE, either extend that or make a similar thing for MonoType, that way we would run all these additional checks when running ENABLE_CHECKED_BUILD_PRIVATE_TYPES (we should have CI legs doing that), but use inline wrappers with zero overhead on release builds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to do actual full checks in Release unless benchmarks show that it meaningfully regresses performance, at least during the preview period, in order to flush out any remaining issues of this type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still add the same support for types as we have for class so we could do a MONO_TYPE_DEF_PRIVATE under ENABLED_CHECKED_BUILD_PRIVATE_TYPES to catch errors bypassing the getters on CI. We could also add support for the full check vs zero overhead wrappers as part of that and then just decide to initially opt into full checks during preview (and monitor perf hit during that time) and then have the option to move back to zero wrappers closer to release, but keep the full checks on CI builds running with the ENABLED_CHECKED_BUILD_PRIVATE_TYPES.

@kg
Copy link
Member Author

kg commented Jan 22, 2025

I went through and audited all the union accesses I could find and changed the ones that were blatantly and correctly guarded to _unchecked which should remove a lot of overhead while also giving us a way to add debug/checked guards if we want them for some reason later on.

I found some more bugs (mostly around SZARRAY handling) and added FIXME comments if I couldn't trivially fix them. I will probably try to fix all the FIXMEs I've found so far.

@kg
Copy link
Member Author

kg commented Jan 24, 2025

The AOT compiler failure isn't generating usable stacks for some reason so I'll have to try and reproduce it locally in Debug configuration and hopefully get stacks there. I'm not sure if the aot compiler lacking stack traces when it crashes is intentional, I'm surprised by it.

EDIT: Lesson learned: If you fix a bug in mono_allocate_stack_slots, the function mono_allocate_stack_slots2 probably has the same bug and you should fix it there too

@kg
Copy link
Member Author

kg commented Jan 25, 2025

Build linux-x64 Release AllSubsets_Mono_LLVMFULLAOT_RuntimeTests llvmfullaot and its sibling look hung, but maybe they're just normally slow, or they're very slow because of the changes in this PR? I'm not familiar with these lanes and I can't tell whether the stdout on both of them stops at 1999 by coincidence or because azdo is truncating the actual output.

@kg kg marked this pull request as ready for review January 28, 2025 02:06
@kg
Copy link
Member Author

kg commented Jan 28, 2025

I took a middle ground between @lateralusX 's suggestion and my personal preference, by renaming the union but not having a configuration where direct union access is allowed. Having two modes would have complicated the code a lot so I decided not to do it yet. If we decide we need direct access for some reason in a wider set of code I can do the work to enable it, but I think it's better to enforce use of the getters and setters everywhere.

Most uses of the getters ended up being unchecked because they were obviously guarded by a switch case or an if, so I expect the performance impact to be minor in practice. If this leads to unacceptable performance regressions my plan is to just disable the checks by default in release builds as we approach the later previews.

My audit revealed a handful of places where it wasn't obvious whether the union was being used safely beforehand, but also wasn't immediately obvious that it was being used incorrectly. As a result I couldn't "fix" them easily and left them guarded, so customers may hit failures in their applications if those locations turn out to be bugs. i.e. a function with a name indicating it's used to manipulate generic parameters that isn't doing a type check before accessing the union as if it's a generic parameter.

There's one bare use of the raw union still left in the code because the union member was being passed to some sort of macro/function as an lvalue, and I couldn't come up with a straightforward fix for that.

Something I'd love to do is have even the _unchecked accessors do checks in debug builds, just in case my audits were incorrect and I applied _unchecked somewhere I shouldn't have. But I don't know if it's worth the trouble, so I didn't bother doing it. Would love to hear people's thoughts on that.

Copy link
Member

@lateralusX lateralusX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gone through all changes, mainly commenting on more places that could potentially use unchecked instead of checked gets since we already have checks of type before making calls to the types get function.

At first, I was a little reluctant to split into checked/unchecked, because it could introduce bugs by using the unchecked in wrong places or refactoring that might falsify the use of unchecked in a piece of code. Just having one implementation with full checks enabled in debug/checked build running on all our internal testing and then having a zero cost implementation in release would have mitigated this issue, but if we believe we would like to keep some of the checks in release builds, then we would need to split into two functions like you have in this PR, but then I believe it will be crucial to make the unchecked version working as the checked version in debug/checked builds so we can capture errors using the unchecked version udner false assumptions.

src/mono/mono/component/debugger-agent.c Outdated Show resolved Hide resolved
src/mono/mono/component/debugger-agent.c Outdated Show resolved Hide resolved
src/mono/mono/component/debugger-agent.c Outdated Show resolved Hide resolved
src/mono/mono/component/debugger-agent.c Outdated Show resolved Hide resolved
src/mono/mono/component/debugger-agent.c Outdated Show resolved Hide resolved
src/mono/mono/metadata/sre.c Outdated Show resolved Hide resolved
src/mono/mono/metadata/sre.c Outdated Show resolved Hide resolved
src/mono/mono/metadata/sre.c Outdated Show resolved Hide resolved
src/mono/mono/mini/method-to-ir.c Outdated Show resolved Hide resolved
src/mono/mono/mini/mini-generic-sharing.c Outdated Show resolved Hide resolved
@kg
Copy link
Member Author

kg commented Jan 31, 2025

Gone through all changes, mainly commenting on more places that could potentially use unchecked instead of checked gets since we already have checks of type before making calls to the types get function.

At first, I was a little reluctant to split into checked/unchecked, because it could introduce bugs by using the unchecked in wrong places or refactoring that might falsify the use of unchecked in a piece of code. Just having one implementation with full checks enabled in debug/checked build running on all our internal testing and then having a zero cost implementation in release would have mitigated this issue, but if we believe we would like to keep some of the checks in release builds, then we would need to split into two functions like you have in this PR, but then I believe it will be crucial to make the unchecked version working as the checked version in debug/checked builds so we can capture errors using the unchecked version udner false assumptions.

Thanks for the detailed review!
I will make the change so that the unchecked versions are still checked in debug/checked builds.

I forgot to mention that for a few modules I decided not to optimize them with unchecked - sre, metadata, debugger-agent. It doesn't seem like they are performance critical enough to be worth the effort/risk. But since you already audited them (sorry!) I'll apply the changes there.

@kg
Copy link
Member Author

kg commented Feb 5, 2025

Looks like one of the new set of changes broke something. Maybe the bug fix in that async bit, not sure... will have to dig in and bisect.

@kg
Copy link
Member Author

kg commented Feb 5, 2025

I've spent a while trying and can't reproduce any of these CI failures locally, so I'm going to have to bisect using CI...

EDIT: Looks like it was one of the last two commits of changes, thankfully.

@kg kg force-pushed the guarded-monotype branch from 171c74a to 4c6df61 Compare February 5, 2025 20:19
@kg
Copy link
Member Author

kg commented Feb 5, 2025

I think the cause of the CI crashes and the reason I couldn't reproduce them was uninitialized stack memory. I found two places where we allocate a MonoType on the stack and don't zero the whole thing, we just fill in the fields we care about and try to use it. I accidentally removed part of that initialization in a corner case, which I think was exposing a sufficiently trashed pointer to cause us to dereference random addresses elsewhere.

@kg
Copy link
Member Author

kg commented Feb 7, 2025

I think this is ready.
llvmfullaot has one aot compile failure that I think is probably latent and nobody noticed:

2025-02-06T14:02:08.9864554Z   Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/Regression/Regression_2/Runtime_68568.dll
2025-02-06T14:02:08.9864965Z   AOTID 62F7688C-21FA-9BEF-3EE4-F3B99EE9DDB6
2025-02-06T14:02:08.9865436Z   Can't find custom attr constructor image: /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/Regression/Regression_2/Runtime_68568.dll mtoken: 0x0a000002 due to: Method not found: void Xunit.SkipOnMonoAttribute..ctor(string)
2025-02-06T14:02:08.9866008Z   Failed to load custom attributes from method int Runtime_68568:Main () due to Method not found: void Xunit.SkipOnMonoAttribute..ctor(string)
2025-02-06T14:02:09.0067008Z   FullAOT cannot continue if there are loader errors.
2025-02-06T14:02:09.0075096Z /__w/1/s/src/mono/msbuild/aot-compile.proj(19,9): error MSB3073: The command "/__w/1/s/artifacts/tests/coreclr/linux.x64.Release/Tests/Core_Root/corerun /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/Regression/Regression_2/Runtime_68568.dll" exited with code 1.

I'm fairly certain I didn't cause it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants