-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) #22118
Conversation
Working on perf numbers |
Some interesting speed bumps in there; might be able to go better |
Array length 512
public class Program
{
private static byte[] s_source;
private const int Iters = 100;
public static void Main(string[] args)
{
var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
}
[Params(
0, 1, 2, 3, 4, 7,
8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86,
126, 127, 128, 129, 130, 131,
250, 251, 252, 253, 254, 255)]
public int Position { get; set; }
[Benchmark(OperationsPerInvoke = Iters)]
public int IndexOf()
{
int total = 0;
byte value = (byte)Position;
ReadOnlySpan<byte> span = s_source;
for (int i = 0; i < Iters; i++)
{
total += span.IndexOf(value);
}
return total;
}
[GlobalSetup]
public void Setup()
{
var e = Enumerable.Range(0, 255).Select(b => (byte)b);
s_source = e.Concat(e).ToArray();
}
} |
Array length = Length, position length-1 (asm < 32 length is identical)
public class Program
{
private static byte[] s_source;
private const int Iters = 100;
public static void Main(string[] args)
{
var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
}
[Params(
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86,
126, 127, 128, 129, 130, 131,
250, 251, 252, 253, 254, 255)]
public int Length { get; set; }
[Benchmark(OperationsPerInvoke = Iters)]
public int IndexOf()
{
int total = 0;
byte value = (byte)(Length - 1);
ReadOnlySpan<byte> span = s_source;
for (int i = 0; i < Iters; i++)
{
total += span.IndexOf(value);
}
return total;
}
[GlobalSetup]
public void Setup()
{
s_source = Enumerable.Range(0, Length).Select(b => (byte)b).ToArray();
}
} |
Not sure if worried about Mostly seems good other than what looks to be an infra issue?: Test Pri0 Windows_NT x64 checked Job
|
e2177ec
to
497cb8c
Compare
So it can be used by other types than byte
Same change would work for |
Added |
@fiigii @tannergooding Could you please take a look? |
@@ -199,10 +200,22 @@ public static unsafe int IndexOf(ref byte searchSpace, byte value, int length) | |||
IntPtr index = (IntPtr)0; // Use IntPtr for arithmetic to avoid unnecessary 64->32->64 truncations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why this code is currently using IntPtr
rather than the:
#if BIT64
using nint = System.Int64;
#else
using nint = System.Int32;
#endif
code that everywhere else seems to prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is left-over from times when this lived in CoreFX and it was not compiled bitness-specific. It would be good for readability to switch this over to nuint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a bit messy to clean that up in this PR, will do a follow up
nLength = (IntPtr)((Vector<byte>.Count - unaligned) & (Vector<byte>.Count - 1)); | ||
if (length >= Vector128<byte>.Count * 2) | ||
{ | ||
int unaligned = (int)Unsafe.AsPointer(ref searchSpace) & (Vector128<byte>.Count - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really nice if the JIT would just optimize Unsafe.AsPointer(ref searchSpace) % Vector128<byte>.Count
, or if we had a helper function for this type of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding - is there an open issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, I'll take a look in a little bit (and will log one if we don't). It's probably worth noting that this optimization is only applicable for unsigned
types (IIRC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the JIT already does this optimization for unsigned types. This is possibly just something that is normally missed due to most inputs being signed by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it do it for an int
that becomes a const when inlined? (as would be this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll test that case after lunch. It seems like it would be valid to do when the constant is known to be positive.
Vector256<byte> comparison = Vector256.Create(value); | ||
do | ||
{ | ||
Vector256<byte> search = Unsafe.ReadUnaligned<Vector256<byte>>(ref Unsafe.AddByteOffset(ref searchSpace, index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ReadUnaligned
rather than LoadVector256
? Is it to avoid pinning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No pinning here... :)
Its over raw byte data which is generally the longest type of data, so probably best to not get in the way of the GC; if it wants to compact the heap, rather than create a giant fixed block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, the initial assumption was that HWIntrinsics
would likely be used with longer inputs and, in those cases, that pinning would be desirable (so alignment, cache-coherency, etc could be preserved).
If that isn't the case, we might want to revisit whether or not we also expose Load
overloads that take ref T
rather than just T*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably still the case for directly using them; but everything is getting plumbed though the SpanHelper methods so they are generic methods for all data sizes. e.g. String, Array, Span, ReadOnlySpan and CompareInfo, all use SpanHelpers now rather than their own implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing the data would be required to ensure that the alignment check remains valid.
The user can pin the data if they want the alignment to remain fixed (though they may non-know to do so); Kestrel's arrays are already pre-pinned so a second pin here would add no advantage, native memory pinning here would add no advantage, lengths < 32 pinning would add no advantage; but in all those cases it would be extra unnecessary work to fix the data (GC considerations aside).
Is there a significant performance benefit to using LoadVector256
over ReadUnaligned
here?
Otherwise the unalignment issue will only effect the method if GC moves the data; and then, well you just had a GC so that will probably have more impact than being misaligned afterwards; and its also likely a low frequency event vs the number of times these methods are called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the index argument of the API be? We do not have native int public type yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the index argument of the API be? We do not have native int public type yet...
Presumably just IntPtr
, as the existing AddByteOffset
method takes: public static ref T AddByteOffset<T>(ref T source, IntPtr byteOffset)
.
This emits an IL signature using native int
and works correctly with languages (like F#) that treat IntPtr
as a native sized integer. This should also work with C#, unless they decide to create a new wrapper type (rather than use partial erasure) or unless they decide that the nint
type should have a modreq
/modopt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also work with C#, unless they decide to create a new wrapper type
I do not think there was a decision made on this one yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, what C# does is still pending further language discussion. However, we have already exposed a number of similar APIs that take IntPtr
(mostly on S.R.CS.Unsafe
), and the worst case scenario is that we may want to expose an additional convenience overload (likely just implemented in managed code) that casts from whatever C# uses to IntPtr
.
It could also just be internal for the time being, to make our own code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added local LoadVector256
, LoadVector128
, LoadVector
and LoadUIntPtr
to see how it looks; makes it more readable.
I would second the suggestions to factor out some of the common code. Otherwise it LGTM overall. |
I also second the suggestions to factor out common code, but there are some points you may need to pay attention:
|
430a825
to
78e1f89
Compare
Skipped bounds check from software fallback acdd8e9; don't see any other changes to make; any feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one question.
Found7: | ||
return (int)(byte*)(index + 7); | ||
return (int)(byte*)(offset + 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a variable to avoid the multiple goto targets? For example
if (uValue == Unsafe.AddByteOffset(ref searchSpace, offset + i))
{
delta = i;
goto Found;
}
...
Found:
return (int)(byte*)(offset + delta);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?
Currently these sections looks like this:
movzx r11, byte ptr [rcx+r11]
cmp r11d, eax
je G_M10207_IG14
lea r11, [r9+2]
movzx r11, byte ptr [rcx+r11]
cmp r11d, eax
je G_M10207_IG15
lea r11, [r9+3]
movzx r11, byte ptr [rcx+r11]
cmp r11d, eax
je G_M10207_IG16
lea r11, [r9+4]
movzx r11, byte ptr [rcx+r11]
cmp r11d, eax
je G_M10207_IG17
lea r11, [r9+5]
movzx r11, byte ptr [rcx+r11]
cmp r11d, eax
je G_M10207_IG18
lea r11, [r9+6]
movzx r11, byte ptr [rcx+r11]
cmp r11d, eax
je G_M10207_IG19
lea r11, [r9+7]
movzx r11, byte ptr [rcx+r11]
cmp r11d, eax
je G_M10207_IG20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?
Hmm, I always prefer smaller code-size over the "tricky" loop-unrolling (if the loop-body does not have long latency instructions). Sometimes, loop-unrolling can have a bit more beautiful perf data on microbenchmarks, but I believe I-cache is the most precious resource for the large real applications (e.g., ASP.NET servers).
Additionally, if a loop body is small enough, that would be specially optimized on Intel architectures (please see "Loop Stream Detector" sections in Intel optimization manual). I think most of the fast chain loops could be really small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness the exit points are
G_M10207_IG11:
mov eax, -1
G_M10207_IG12:
vzeroupper
ret
G_M10207_IG13:
mov eax, r9d
jmp SHORT G_M10207_IG21
G_M10207_IG14:
lea rax, [r9+1]
jmp SHORT G_M10207_IG21
G_M10207_IG15:
lea rax, [r9+2]
jmp SHORT G_M10207_IG21
G_M10207_IG16:
lea rax, [r9+3]
jmp SHORT G_M10207_IG21
G_M10207_IG17:
lea rax, [r9+4]
jmp SHORT G_M10207_IG21
G_M10207_IG18:
lea rax, [r9+5]
jmp SHORT G_M10207_IG21
G_M10207_IG19:
lea rax, [r9+6]
jmp SHORT G_M10207_IG21
G_M10207_IG20:
lea rax, [r9+7]
G_M10207_IG21:
vzeroupper
ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow up turning them to regular for
loops might be worth investigating then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current micro-benchmark oriented performance culture in the .NET Core repos favors bigger streamlined code because of it tends to produce better results in microbenchmarks. It is a common wisdom (at Microsoft at least) that smaller code runs faster in real workloads because of the factors @fiigii mentioned. The code bloating optimizations are worth it in like 1% of the cases. You have seen me pushing back on the more extreme cases of code bloat. This case is on the edge. My gut-feel is that this one is probably good as it is. We would need to be able to get data about performance in real workloads to tell with confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow up turning them to regular for loops might be worth investigating then?
Yes, that should be a follow-up work, not in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely worth revisiting; lots of remarks that amount to the CPU might not be happy with 8 conditional branches in quick succession e.g.
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four
branches in a 16-byte chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though not sure how you could; as here the jump are 6 bytes and the compares are 4 bytes... be a push to get 4 compares and 4 jumps in 16 bytes?
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
coreclr-ci passed, first time I've seen that happen! 😃 |
* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Signed-off-by: dotnet-bot <[email protected]>
* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Signed-off-by: dotnet-bot <[email protected]>
* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Signed-off-by: dotnet-bot <[email protected]>
* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Signed-off-by: dotnet-bot <[email protected]>
* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Signed-off-by: dotnet-bot <[email protected]>
* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Signed-off-by: dotnet-bot <[email protected]>
* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Commit migrated from dotnet/coreclr@07d1e6b
Learnings from #22019; improvement on #21073
Applied to
MoveMask
/vpmovmskb
used for equality in the vectorized path already has flagged the bits that match; so if we use the specific intrinsics rather than generic Vector, we just need to determine the bit set, rather than do further processing to determine the element offset.Performance measurements:
Array length 512 significant improvements (up to more than double) for found item in any position. #22118 (comment)
Array lengths >= 32 with item in last position significant improvements (up to more than double) #22118 (comment)
/cc @CarolEidt @fiigii @tannergooding @ahsonkhan