-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Intrinsicify .SequenceCompareTo(byte, ...) #22127
Conversation
Test extensions dotnet/corefx#34742 |
342cbe4
to
cde8b12
Compare
Added performance numbers and rebased Different
AlmostSameButDifferent
Same
|
@tannergooding Does this look good to you? |
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx @dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx @dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx |
We should always test all the paths of hardware-specific code. For this PR, we can turn off AVX to test SSE2 path and turn off all intrinsic to test the software fallback. However, the S.N.Vector path cannot be covered on x86/x64... |
@benaadams can you run with both with cache misses and without cache misses? Also with high numbers like 32K to 1M (with a few in-between) |
Different
AlmostSameButDifferent
Same
If you can come up with a benchmark for that where killing the cache doesn't dominate. However, it should effect both similarly (with memory access latency coming to the fore over instructions speed), does it add value?* (*as its straightforward sequential access) At a guess the |
Could maybe switch to the cpu compare intrinsics above a certain length? However I don't know much about their behaviour or usage /cc @fiigii |
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test please |
Looks like infra error on
Only renamed a label 5f6dcb8 since the previous CI pass so doubt its related |
@benaadams We have a good starting place IMHO. This is testing difference at byte BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-009812
[Host] : .NET Core 3.0.0-preview-27323-8 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7301), 64bit RyuJIT
Job-IKWERK : .NET Core 3.0.0-preview-27323-8 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7301), 64bit RyuJIT
Jit=RyuJit Platform=X64 Runtime=Core
|
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test please |
Before you say anything, this is my first time with the 3.0 codegen so I am a bit sketchy in the details of how Intrisics are treated on the 3.0 JIT. So, I have a few questions regarding the assembler code I am looking at. Also I see calls everywhere, which doesnt make sense to me... could tiered JIT be playing me a hardball maybe? |
Can switch it off with the env variable:
|
Thanks I was looking at how to do that, and couldnt find the relevant bit... |
@AndyAyersMS Shall we consider https://github.com/dotnet/coreclr/issues/14474 for 3.0? The current Tier0 does not guarantee "lower compilation cost"... |
I do not think that would help. The method you want to do the AggressiveOptimization for here is SequenceCompareTo that is not marked as I think it may be a good idea to mark these low-level optimized methods like |
I guess that is related to alignment, but need more detailed data (e.g., cache-line split counter via VTune) for future analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@fiigii I still think doing some cheap/obvious opts in Tier0 makes sense, but am reluctant at this point to make changes in Tier0 for 3.0. I don't think we're ready to have Tier0 to diverge from minopts just yet (to keep the test matrix somewhat contained) and we don't want to risk changes in minopts... we have made what appeared to be simple/safe changes there in the past and been surprised to find we'd gotten it wrong, eg #13245 lead to #16928 and #20047. @benaadams you'd want the callers of intrinsic methods to be optimized aggressively by default. Seems cleaner to just call out the specific methods that should always be optimized. |
Added PR: #22191 |
The majority of the hardware intrinsics are force-expanded and have to be inlined anyways (as is evident by actual codgen). It looks like the actually inefficiency is from the JIT not dropping the "dead" code-paths. |
95e56b0
to
603a0f8
Compare
603a0f8
to
80205fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks for adding the clarifying comments.
* Speedup .SequenceCompareTo(byte, ...) * Rename jump location * Better annotations for clarity Signed-off-by: dotnet-bot <[email protected]>
* Speedup .SequenceCompareTo(byte, ...) * Rename jump location * Better annotations for clarity Signed-off-by: dotnet-bot <[email protected]>
* Speedup .SequenceCompareTo(byte, ...) * Rename jump location * Better annotations for clarity Signed-off-by: dotnet-bot <[email protected]>
* Speedup .SequenceCompareTo(byte, ...) * Rename jump location * Better annotations for clarity Signed-off-by: dotnet-bot <[email protected]>
* Speedup .SequenceCompareTo(byte, ...) * Rename jump location * Better annotations for clarity Signed-off-by: dotnet-bot <[email protected]>
* Speedup .SequenceCompareTo(byte, ...) * Rename jump location * Better annotations for clarity Signed-off-by: dotnet-bot <[email protected]>
@AndyAyersMS If we want to diverge Tier0 from minopts in the future, can we consider to create a new compiler for Tier0? Perhaps, a simpler (and faster) one-pass compiler (similar to Rotor's FJIT) is better for this pure JIT environment. |
* Speedup .SequenceCompareTo(byte, ...) * Rename jump location * Better annotations for clarity Commit migrated from dotnet/coreclr@665593e
Using benchmarks from https://github.com/dotnet/coreclr/issues/22078#issuecomment-455843668
Different
AlmostSameButDifferent
Same
Fixes https://github.com/dotnet/coreclr/issues/22078
/cc @CarolEidt @fiigii @tannergooding @ahsonkhan
/cc @redknightlois @ayende