-
Notifications
You must be signed in to change notification settings - Fork 2.7k
JIT: align Tier1 methods at 16 byte boundaries for xarch [WIP] #21518
Conversation
Align Tier1, small and IBC hot methods to 16 byte boundaries for x64 and x86. Consensus from various folks I polled was that this isn't as helpful for arm architectures, so for now this is xarch only. This ensures that instruction prefetch pulls in as much code as possible. It should also improve performance stability in some benchmarks, as well as opening the door for possible loop-top aligment padding. Resolves #16873.
Keeping this as [WIP] for now since I'd like to see some of the other R2R and tiering perf changes land first, and evaluation of this change may be tricky, and nothing depends on this. And I'll be on vacation a fair amount in the coming weeks. I don't expect 16 byte alignment to universally improve perf as whatever method alignment we get now on a given run may work out better. But over time this alignment should be more perf stable and also gives the jit the opportunity to internally pad to avoid bad instruction fetch issues for hot loops and eventually end up in an overall better place. I plan to make multiple measurements on different machines and over time to try and show results are indeed more stable and (for benchmarks that consistently stand out as better or worse) drill in and try to understand why. I don't think this will cause much extra code space fragmentation -- though aligning small methods may have a notable impact. So another goal is to actually measure the size impact, especially on realistic apps. Using the performance repo I have done a couple of base and diff runs of local builds on a Skylake i6700 and compared them . Data below is filtered by absolute difference of at least 1ms as results for shorter running tests seem to swing wildly. @adamsitnik if you want to experiment with this change and see why there's such volatility in the shorter tests I'd love to see any insights you uncover. cc @dotnet/jit-contrib @fiigii BASE vs DIFF 1
BASE vs DIFF 2
|
For AMD machines, the optimization manuals going back to 2014 (so family 15h and onward) indicate that they have a 32-byte aligned fetch window. This could potentially be beneficial for the Azure Lv2-Series VMs (powered by the EPYC-series processors). |
@AndyAyersMS awesome! I am going to clone your fork and re-run the tests on my machines. @AndreyAkinshin you might be also interested in this change in case you need to update your book before publishing it ;) |
@AndyAyersMS I did a bit VTune profiling for this change on That causes lots of caller-saving code in Disasm of GetCalculated3 from VTune
|
Ah,
coreclr/src/System.Private.CoreLib/shared/System/Numerics/Vector.cs Lines 2341 to 2348 in 12a8870
|
@fiigii, It's worth noting that, as of right now, the types in
I actually hit this the other day when doing some investigation on https://github.com/dotnet/corefx/issues/31425. The fix to get the SIMD intrinsics to start respecting the
|
@tannergooding Thanks for noting. But this problem is not related to "intrinsic", let's discuss that in dotnet/corefx#31425 😄 |
It turns out for x64 we're already doing 16 byte alignment of methods. The jit requests default alignment from the runtime, and so the alignment is ultimately governed by the definition of So impacts seen above are possibly just normal variation give 16 byte alignment -- note that there is some known perf impact on intel processors from even stronger alignments, say 32 bytes, which may explain some of the above variability. We could of course consider asking for stronger alignments but it may make sense to look at adjusting internal alignments first to take advantage of the 16 byte alignments we already have. So am going to close this one out. |
Align Tier1, small and IBC hot methods to 16 byte boundaries for x64 and x86.
Consensus from various folks I polled was that this isn't as helpful for arm
architectures, so for now this is xarch only.
This ensures that instruction prefetch pulls in as much code as possible.
It should also improve performance stability in some benchmarks, as well as
opening the door for possible loop-top aligment padding.
Resolves #16873.