Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached interface dispatch for coreclr #111771

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

davidwrighton
Copy link
Member

@davidwrighton davidwrighton commented Jan 24, 2025

Enabling cached interface dispatch as an options for CoreCLR (should reduce memory usage/remove RWX pages, at the cost of reducing performance)

  • Current implementation is only enabled in release builds for Apple platforms with restrictions on code generation
  • On Debug/Checked builds of X64/Arm64 platforms it is possible to enable the feature by setting the DOTNET_UseCachedInterfaceDispatch environment variable to 1. (NOTE: Enabling this feature requires running on a processor which supports 128 bit compare and swap, which has implications on Linux X64 builds, and would have implications for Loongarch/RiscV if we enable the code there.)
  • The strategy is to re-use the existing VirtualCallStubManager infrastructure for all non-code-generation driven lookups, but to replace the stub generation logic with the CachedInterfaceDispatch paths from NativeAOT.
  • In addition, to support this, we need to extend the size of a Dispatch cell embedded in R2R images, so various parts of that logic are now capable of generating double pointer aligned dispatch cells when commanded. Infrastructure to set the right behavior for targetting apple platforms has not yet been implemented although the general purpose support is in place.

Known issues addressed before making a non-draft PR

  • Env flag for swapping between cached interface dispatch and VSD when both features are enabled in the code
  • Testing of normal scenarios
  • Testing of diagnostic scenarios (No regressions in VSD mode, Cached Interface Dispatch also successfully steps)
  • Consider enabling the resolve cache for cached interface dispatch scenarios (will not do until perf results show that the current scheme is slow)
  • Enable support for R2R with cached interface dispatch
  • Make the Indirection cell size 2 pointers instead of 4
  • Free dispatch cache infrastructure on collectible assembly destruction
  • Allocate cache chunks with LoaderHeap instead of malloc
  • Move PalInterlockedCompareExchange128 to the PAL or minipal
  • Actually correct construction of indirection cell for virtual dispatch on vtables

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

…te that this requires adding the -mcx16 switch to clang, so that cmpxchg16b instruction gets generated, which is an increase in the baseline CPU required by CoreCLR on Linux, and isn't likely to be OK for shipping publicly
…veAOT cached interface dispatch implementation (as it isn't actually used)

Update IsIPinVirtualStub to check the AVLocations, not the stub entry points
@davidwrighton davidwrighton reopened this Jan 24, 2025
- Enable generating double pointer indirection cells in R2R files using
  command line switch.
- Fix VTableOffset calculation
- Add logic in ExternalMethodFixupWorker to handle the double pointer
  indirection cells.
…llocating the memory for the dispatch using the LoaderHeap

Also tweak a collectible assembly test to actually use cached interface dispatch
src/coreclr/pal/inc/pal.h Outdated Show resolved Hide resolved
@davidwrighton davidwrighton changed the title [DRAFT] Cached interface dispatch for coreclr Cached interface dispatch for coreclr Feb 3, 2025
@davidwrighton davidwrighton marked this pull request as ready for review February 3, 2025 23:46
src/coreclr/clrfeatures.cmake Outdated Show resolved Hide resolved
Comment on lines 54 to 56
else()
set(FEATURE_CORECLR_VIRTUAL_STUB_DISPATCH 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set(FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH 1)
else()
set(FEATURE_CORECLR_VIRTUAL_STUB_DISPATCH 1)
set(FEATURE_CACHED_INTERFACE_DISPATCH 1)
else()
set(FEATURE_VIRTUAL_STUB_DISPATCH 1)

reg = REG_R11;
regMask = RBM_R11;
}
reg = REG_R11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAGGHH.. reading this has sent me down the rabbit hole of sadness which has told me that I need to move both CoreCLR and NativeAOT to use r10, and not to r11 as that is the only way to keep CFG from exploding sadly. (Or I can keep the difference between CoreCLR and NativeAOT).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of getting these unified as you may guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather... this won't break stuff, but the codegen for CFG is required to be a bit non-ideal in native aot..OTOH, it appears with some quick testing that this doesn't actually effect the codegen in any case, since our CFG generation logic isn't what I'd call great right now, so I'd like to keep moving everything to r11. If we want to use r10 we can swap back all of the amd64 stuff at once.

DispatchToken token = VirtualCallStubManager::GetTokenFromFromOwnerAndSlot(ownerType, slot);

INTERFACE_DISPATCH_CACHED_OR_VSD(
return FALSE; // R2R interface dispatch currently only supports fixups with a single pointer, return FALSE to skip using the method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has changes to add this support. Did you hit any roadblocks with making this actually work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not. But the support is only for fixups which are used directly for dispatch. This code path is used when gathering a fixup as a function pointer to use directly such as embedded into a delegate or something. I believe that this logic may only actually used for R2R versioning where we replace a non-virtual method with a virtual method and construct a delegate. I'm not entirely sure. At the very least I should write a clearer comment. I'll also see if I can actually trigger this scenario. It's quite possible that the existing R2R behavior with VSD is actually not compliant with our latest VM logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually completely dead code. The VIRTUAL_ENTRY_SLOT path was specific to NGEN. (So are the VIRTUAL_ENTRY_DEF/REF_TOKEN paths, but those might actually be useful for R2R as a load simplification, so I'm not touching those. I'm going to delete the logic that implements VIRTUAL_ENTRY_SLOT.

@@ -76,3 +76,16 @@ class VMToOSInterface
// true if it succeeded, false if it failed
static bool ReleaseRWMapping(void* pStart, size_t size);
};

#if defined(HOST_64BIT) && defined(FEATURE_CACHED_INTERFACE_DISPATCH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we added the InterlockedCompareExchange128 to the VMToOSInterface above and implemented it for Unix / Windows in the respective subfolders.
When I have started with this minipal, the intent was that all stuff that needs to be platform specific will go there. Once we get rid of coreclr PAL, this was meant to be the only place for platform specific code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is a case where we want to have a shared set of PAL apis between CoreCLR and NativeAOT, and NativeAOT hasn't moved towards the VmToOSInterface idea (And this change is already WAY too big to add that to it.) I'd welcome some rationalization after this is all merged.

@@ -1,9 +1,11 @@
set(GC_DIR ../../gc)
set(SHARED_RUNTIME_DIR ../../shared_runtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit - we don't have any folders that have names with underscores, it would be nice to stay consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to hear opinions on the directory structure from several people here. @AaronRobinsonMSFT might also be interested to comment on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jit and gc are shared too, but they do not have shared_ in the name. I am not a fan of the shared_ prefix.

Maybe just call it runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime seems appropriate. I agree that a prefix would be nice, but I agree with @janvorli's consistency argument. @jkotas's dislike of shared_ is something I share too. I think shared is also dangerous as it breeds a dumping ground akin to utils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime it is. I'll make the changes soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be leaving the CMAKE name of the directory as SHARED_RUNTIME_DIR though, as RUNTIME_DIR is already taken for src/coreclr/nativeaot/Runtime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename RUNTIME_DIR -> NATIVEAOT_RUNTIME_DIR and SHARED_RUNTIME_DIR -> RUNTIME_DIR? There are only 25 instances of RUNTIME_DIR spread over 4 files.

Comment on lines +55 to +58
# Allow 16 byte compare-exchange (cmpxchg16b)
add_compile_options($<${FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH}:-mcx16>)
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check for both UNIX/AMD64 and if FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH is set?

I'm definitely not a CMake expert, but I think this syntax means if FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH is non-zero, then apply the flag. If so, can't we remove the explicit if?

My push back here if we want to expand this, then at present we need to update two locations instead of one, where FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH is set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can combine all the options into the generator expression, but it usually becomes pretty unreadable with all the $< that get accumulated. So I usually use it just for the build configs or stuff where the expression is simple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. My point is about the need for the if check at all. Isn't the generator already conditioned on ${FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH} being true? If so, why do we need the check above it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's right, I've missed this point. We don't need the if around that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to set this flag on Windows or when targeting any other processor type than X64. So we still need the if. FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH is sometimes enabled for Windows and it is really intended to be used for Arm64 in most cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying that FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH is valid without the 16 byte compare exchange?

Copy link
Member Author

@davidwrighton davidwrighton Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm saying that the -mcx16 switch is only valid/useful for the targeting X64 on unix platforms. I can't find ANY documentation that hints that this switch is useful on architectures other than X64. On Arm64, the baseline hardware we support requires 16 byte compare and swap, and the needed instructions are unconditionally available (via libatomic, they will either be the ldaxp and stxp pair or the caspal instruction, and on Mac/iOS the compiler will always general the caspal instruction. On Winodws, the -mcx16 switch generates a compiler warning and isn't useful, as all hardware which runs supported versions of Windows has the compare exchange 16 byte instruction.

if (pVcsMgr->cache_entry_heap != NULL) heapsToEnumerate.Push(pVcsMgr->cache_entry_heap);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // FEATURE_VIRTUAL_STUB_DISPATCH

@@ -3613,14 +3613,16 @@ ClrDataAccess::TraverseVirtCallStubHeap(CLRDATA_ADDRESS pAppDomain, VCSHeapType
break;

case CacheEntryHeap:
#ifdef FEATURE_VIRTUAL_STUB_DISPATCH
pLoaderHeap = pVcsMgr->cache_entry_heap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a small comment here on why the CacheEntryHeap is not used in the caching logic? I'm ignorant of this area and it feels odd that the VirtualCallStubManager is used, but ignored in some cases. Perhaps an assert or some other breadcrumb to indicate why this isn't appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use it, and possibly should, but as I noted in the PR description, I intentionally disabled this feature since it was more work to enable, and this change is already too big for easy development. If we decide that this sort of caching is advantageous, it would be appropriate in a separate PR to enable it for the cached interface dispatch scenario.

@@ -3663,7 +3665,9 @@ static const char *LoaderAllocatorLoaderHeapNames[] =
"FixupPrecodeHeap",
"NewStubPrecodeHeap",
"IndcellHeap",
#ifdef FEATURE_VIRTUAL_STUB_DISPATCH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds to the confusion from above considering case CacheEntryHeap: remains, but we define out the string.

"CacheEntryHeap",
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // FEATURE_VIRTUAL_STUB_DISPATCH

pLoaderHeaps[i++] = HOST_CDADDR(pVcsMgr->cache_entry_heap);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // FEATURE_VIRTUAL_STUB_DISPATCH

}
static inline DispatchToken FromCachedInterfaceDispatchToken(UINT_PTR token)
{
return DispatchToken(token >> 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return DispatchToken(token >> 1);
_ASSERTE(IsCachedInterfaceDispatchToken(token));
return DispatchToken(token >> 1);

if (cellCacheHeader != NULL)
{
InterfaceDispatch_DiscardCacheHeader(cellCacheHeader);
pDispatchCell->m_pCache = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd. We have a getter, GetCache(), above, but then we unilaterally manipulate the field. We should either always access the field or do it through methods/abstractions.

#ifdef FEATURE_CACHED_INTERFACE_DISPATCH
if (VirtualCallStubManager::isCachedInterfaceDispatchStubAVLocation(f_IP))
return TRUE;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // FEATURE_CACHED_INTERFACE_DISPATCH

src/coreclr/vm/prestub.cpp Show resolved Hide resolved
@@ -1,9 +1,11 @@
set(GC_DIR ../../gc)
set(SHARED_RUNTIME_DIR ../../shared_runtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime seems appropriate. I agree that a prefix would be nice, but I agree with @janvorli's consistency argument. @jkotas's dislike of shared_ is something I share too. I think shared is also dangerous as it breeds a dumping ground akin to utils.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants