Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) #22118

benaadams · 2019-01-21T17:29:40Z

Learnings from #22019; improvement on #21073

Applied to

internal static partial class SpanHelpers
{
    static int IndexOf(ref byte searchSpace, byte value, int length);
    static int IndexOfAny(ref byte searchSpace, byte value0, byte value1, int length);
    static int IndexOfAny(ref byte searchSpace, byte value0, byte value1, byte value2, int length);
}

MoveMask/vpmovmskb used for equality in the vectorized path already has flagged the bits that match; so if we use the specific intrinsics rather than generic Vector, we just need to determine the bit set, rather than do further processing to determine the element offset.

-; Total bytes of code 597, prolog size 5 for method SpanHelpers:IndexOf(byref,ubyte,int):int
+; Total bytes of code 549, prolog size 5 for method SpanHelpers:IndexOf(byref,ubyte,int):int

Performance measurements:

Array length 512 significant improvements (up to more than double) for found item in any position. #22118 (comment)

Array lengths >= 32 with item in last position significant improvements (up to more than double) #22118 (comment)

/cc @CarolEidt @fiigii @tannergooding @ahsonkhan

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

benaadams · 2019-01-21T17:37:36Z

Working on perf numbers

benaadams · 2019-01-21T20:42:38Z

Some interesting speed bumps in there; might be able to go better

benaadams · 2019-01-21T21:09:33Z

Array length 512

 Method   | Pos |    Current |         PR | % Improvement |
----------|-----|------------|------------|---------------|
  IndexOf |   0 |   6.403 ns |   4.461 ns |        +43.5% |
  IndexOf |   1 |   6.256 ns |   4.516 ns |        +38.5% |
  IndexOf |   2 |   6.192 ns |   4.494 ns |        +37.7% |
  IndexOf |   3 |   6.425 ns |   4.460 ns |        +44.0% |
  IndexOf |   4 |   6.394 ns |   4.420 ns |        +44.6% |
  IndexOf |   7 |   6.176 ns |   4.424 ns |        +39.6% |
  IndexOf |   8 |   6.265 ns |   4.464 ns |        +40.3% |
  IndexOf |   9 |   6.202 ns |   4.564 ns |        +35.8% |
  IndexOf |  10 |   6.274 ns |   4.458 ns |        +40.7% |
  IndexOf |  11 |   6.198 ns |   4.607 ns |        +34.5% |
  IndexOf |  12 |   6.537 ns |   4.469 ns |        +46.2% |
  IndexOf |  13 |   7.124 ns |   4.421 ns |        +61.1% |
  IndexOf |  14 |   7.530 ns |   4.429 ns |        +70.0% |
  IndexOf |  15 |   7.055 ns |   4.401 ns |        +60.3% |
  IndexOf |  16 |   9.521 ns |   4.421 ns |       +115.3% |
  IndexOf |  17 |   9.510 ns |   4.863 ns |        +95.5% |
  IndexOf |  18 |   9.452 ns |   4.531 ns |       +108.6% |
  IndexOf |  19 |   9.518 ns |   4.535 ns |       +109.8% |
  IndexOf |  20 |   9.511 ns |   4.499 ns |       +111.4% |
  IndexOf |  21 |   9.501 ns |   4.322 ns |       +119.8% |
  IndexOf |  22 |   9.458 ns |   4.418 ns |       +114.0% |
  IndexOf |  23 |   9.278 ns |   4.579 ns |       +102.6% |
  IndexOf |  24 |   9.904 ns |   4.637 ns |       +113.5% |
  IndexOf |  25 |   9.697 ns |   4.667 ns |       +107.7% |
  IndexOf |  26 |   9.720 ns |   4.459 ns |       +117.9% |
  IndexOf |  27 |   9.819 ns |   4.380 ns |       +124.1% |
  IndexOf |  28 |   9.795 ns |   4.520 ns |       +116.7% |
  IndexOf |  29 |   9.584 ns |   4.628 ns |       +107.0% |
  IndexOf |  30 |   9.751 ns |   4.457 ns |       +118.7% |
  IndexOf |  31 |   9.778 ns |   4.520 ns |       +116.3% |
  IndexOf |  32 |  10.110 ns |   6.077 ns |        +66.3% |
  IndexOf |  33 |   9.933 ns |   6.129 ns |        +62.0% |
  IndexOf |  34 |   9.922 ns |   6.183 ns |        +60.4% |
  IndexOf |  35 |  10.034 ns |   6.155 ns |        +63.0% |
  IndexOf |  36 |   9.994 ns |   5.244 ns |        +90.5% |
  IndexOf |  37 |  10.059 ns |   6.211 ns |        +61.9% |
  IndexOf |  38 |  10.101 ns |   5.085 ns |        +98.6% |
  IndexOf |  39 |  10.096 ns |   6.255 ns |        +61.4% |
  IndexOf |  40 |  10.147 ns |   5.560 ns |        +82.5% |
  IndexOf |  41 |  10.449 ns |   6.150 ns |        +69.9% |
  IndexOf |  42 |  10.161 ns |   5.542 ns |        +83.3% |
  IndexOf |  43 |  10.296 ns |   5.075 ns |       +102.8% |
  IndexOf |  44 |  10.240 ns |   4.993 ns |       +105.0% |
  IndexOf |  45 |  10.385 ns |   4.956 ns |       +109.5% |
  IndexOf |  46 |  10.246 ns |   6.147 ns |        +66.6% |
  IndexOf |  47 |  10.427 ns |   4.988 ns |       +109.0% |
  IndexOf |  48 |  10.877 ns |   6.229 ns |        +74.6% |
  IndexOf |  49 |  10.688 ns |   6.090 ns |        +75.5% |
  IndexOf |  50 |  10.895 ns |   6.019 ns |        +81.0% |
  IndexOf |  51 |  10.919 ns |   6.206 ns |        +75.9% |
  IndexOf |  52 |  10.591 ns |   6.012 ns |        +76.1% |
  IndexOf |  53 |  10.772 ns |   6.232 ns |        +72.8% |
  IndexOf |  54 |  10.760 ns |   6.212 ns |        +73.2% |
  IndexOf |  55 |  10.898 ns |   5.241 ns |       +107.9% |
  IndexOf |  56 |  10.776 ns |   4.983 ns |       +116.2% |
  IndexOf |  57 |  11.472 ns |   6.150 ns |        +86.5% |
  IndexOf |  58 |  10.756 ns |   7.093 ns |        +51.6% |
  IndexOf |  59 |  10.946 ns |   4.891 ns |       +123.7% |
  IndexOf |  60 |  10.745 ns |   5.260 ns |       +104.2% |
  IndexOf |  61 |  10.889 ns |   6.158 ns |        +76.8% |
  IndexOf |  62 |  10.755 ns |   6.050 ns |        +77.7% |
  IndexOf |  63 |  10.777 ns |   4.971 ns |       +116.7% |
  IndexOf |  64 |  11.007 ns |   5.782 ns |        +90.3% |
  IndexOf |  65 |  11.187 ns |   6.599 ns |        +69.5% |
  IndexOf |  66 |  11.009 ns |   5.690 ns |        +93.4% |
  IndexOf |  67 |  11.129 ns |   5.653 ns |        +96.8% |
  IndexOf |  68 |  11.129 ns |   5.775 ns |        +92.7% |
  IndexOf |  69 |  10.839 ns |   6.633 ns |        +63.4% |
  IndexOf |  70 |  11.207 ns |   5.734 ns |        +95.4% |
  IndexOf |  71 |  11.139 ns |   5.611 ns |        +98.5% |
  IndexOf |  72 |  11.345 ns |   5.701 ns |        +99.0% |
  IndexOf |  73 |  11.559 ns |   6.728 ns |        +71.8% |
  IndexOf |  74 |  11.396 ns |   5.614 ns |       +102.9% |
  IndexOf |  75 |  11.271 ns |   6.672 ns |        +68.9% |
  IndexOf |  76 |  11.282 ns |   6.741 ns |        +67.3% |
  IndexOf |  77 |  11.889 ns |   5.737 ns |       +107.2% |
  IndexOf |  78 |  11.292 ns |   6.755 ns |        +67.1% |
  IndexOf |  79 |  11.417 ns |   6.746 ns |        +69.2% |
  IndexOf |  80 |  11.843 ns |   5.734 ns |       +106.5% |
  IndexOf |  81 |  11.679 ns |   6.817 ns |        +71.3% |
  IndexOf |  82 |  11.755 ns |   5.670 ns |       +107.3% |
  IndexOf |  83 |  11.757 ns |   5.736 ns |       +104.9% |
  IndexOf |  84 |  11.785 ns |   5.743 ns |       +105.2% |
  IndexOf |  85 |  11.625 ns |   6.828 ns |        +70.2% |
  IndexOf |  86 |  11.900 ns |   6.753 ns |        +76.2% |
  IndexOf | 126 |  10.102 ns |   6.467 ns |        +56.2% |
  IndexOf | 127 |  10.030 ns |   6.393 ns |        +56.8% |
  IndexOf | 128 |  10.144 ns |   7.130 ns |        +42.2% |
  IndexOf | 129 |  10.342 ns |   7.978 ns |        +29.6% |
  IndexOf | 130 |  10.213 ns |   8.016 ns |        +27.4% |
  IndexOf | 131 |  10.170 ns |   7.112 ns |        +42.9% |
  IndexOf | 250 |  13.517 ns |   9.686 ns |        +39.5% |
  IndexOf | 251 |  13.416 ns |   9.778 ns |        +37.2% |
  IndexOf | 252 |  13.473 ns |   7.693 ns |        +75.1% |
  IndexOf | 253 |  13.629 ns |   9.824 ns |        +38.7% |
  IndexOf | 254 |  13.355 ns |   9.702 ns |        +37.6% |
  IndexOf | 255 |  26.023 ns |  22.202 ns |        +17.2% |

public class Program
{
    private static byte[] s_source;
    private const int Iters = 100;

    public static void Main(string[] args)
    {
        var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    [Params(
        0, 1, 2, 3, 4, 7, 
        8, 9, 10, 11, 12, 13, 14, 15, 
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86,
        126, 127, 128, 129, 130, 131,
        250, 251, 252, 253, 254, 255)]
    public int Position { get; set; }

    [Benchmark(OperationsPerInvoke = Iters)]
    public int IndexOf()
    {
        int total = 0;
        byte value = (byte)Position;
        ReadOnlySpan<byte> span = s_source;
        for (int i = 0; i < Iters; i++)
        {
            total += span.IndexOf(value);
        }

        return total;
    }

    [GlobalSetup]
    public void Setup()
    {
        var e = Enumerable.Range(0, 255).Select(b => (byte)b);
        s_source = e.Concat(e).ToArray();
    }
}

benaadams · 2019-01-21T23:24:20Z

Array length = Length, position length-1 (asm < 32 length is identical)

        | Length |   Current |        PR |  Change |
--------|--------|-----------|-----------|---------| 
IndexOf |     32 | 10.656 ns |  4.341 ns | +145.4% |
IndexOf |     33 | 11.519 ns |  6.885 ns |  +67.3% |
IndexOf |     34 | 11.598 ns |  7.919 ns |  +46.4% |
IndexOf |     35 | 12.137 ns |  9.070 ns |  +33.8% |
IndexOf |     36 | 11.576 ns |  6.869 ns |  +68.5% |
IndexOf |     37 | 12.408 ns |  7.759 ns |  +59.9% |
IndexOf |     38 | 12.698 ns |  8.847 ns |  +43.5% |
IndexOf |     39 | 13.556 ns |  9.545 ns |  +42.0% |
IndexOf |     40 | 12.697 ns |  7.737 ns |  +64.1% |
IndexOf |     41 | 13.432 ns |  8.837 ns |  +51.9% |
IndexOf |     42 | 13.722 ns |  9.884 ns |  +38.8% |
IndexOf |     43 | 14.293 ns | 10.869 ns |  +31.5% |
IndexOf |     44 | 13.778 ns |  8.467 ns |  +62.7% |
IndexOf |     45 | 14.098 ns |  9.696 ns |  +45.4% |
IndexOf |     46 | 14.577 ns | 10.531 ns |  +38.4% |
IndexOf |     47 | 14.968 ns | 11.591 ns |  +29.1% |
IndexOf |     48 | 14.570 ns |  9.746 ns |  +49.4% |
IndexOf |     49 | 15.488 ns | 10.501 ns |  +47.4% |
IndexOf |     50 | 15.768 ns | 11.235 ns |  +40.3% |
IndexOf |     51 | 15.911 ns | 12.203 ns |  +30.3% |
IndexOf |     52 | 15.725 ns | 11.275 ns |  +39.4% |
IndexOf |     53 | 16.409 ns | 11.763 ns |  +39.4% |
IndexOf |     54 | 16.784 ns | 12.757 ns |  +31.5% |
IndexOf |     55 | 17.012 ns | 13.623 ns |  +24.8% |
IndexOf |     56 | 16.614 ns | 11.852 ns |  +40.1% |
IndexOf |     57 | 17.657 ns | 12.829 ns |  +37.6% |
IndexOf |     58 | 17.540 ns | 12.969 ns |  +35.2% |
IndexOf |     59 | 18.289 ns | 13.645 ns |  +34.0% |
IndexOf |     60 | 17.954 ns | 13.319 ns |  +34.7% |
IndexOf |     61 | 18.590 ns | 13.571 ns |  +36.9% |
IndexOf |     62 | 18.875 ns | 14.356 ns |  +31.4% |
IndexOf |     63 | 19.249 ns | 15.241 ns |  +26.2% |
IndexOf |     64 | 13.423 ns |  5.675 ns | +136.5% |
IndexOf |     65 | 14.752 ns |  7.573 ns |  +94.7% |
IndexOf |     66 | 17.021 ns |  8.455 ns | +101.3% |
IndexOf |     67 | 17.204 ns |  9.857 ns |  +74.5% |
IndexOf |     68 | 14.591 ns |  7.044 ns | +107.1% |
IndexOf |     69 | 16.171 ns |  8.305 ns |  +94.7% |
IndexOf |     70 | 17.683 ns | 10.001 ns |  +76.8% |
IndexOf |     71 | 17.657 ns |  9.873 ns |  +78.8% |
IndexOf |     72 | 15.210 ns |  7.629 ns |  +99.3% |
IndexOf |     73 | 15.957 ns |  9.059 ns |  +76.1% |
IndexOf |     74 | 17.029 ns | 10.633 ns |  +60.1% |
IndexOf |     75 | 18.380 ns | 11.072 ns |  +66.0% |
IndexOf |     76 | 16.562 ns |  8.931 ns |  +85.4% |
IndexOf |     77 | 17.548 ns |  9.685 ns |  +81.1% |
IndexOf |     78 | 18.575 ns | 10.918 ns |  +70.1% |
IndexOf |     79 | 19.531 ns | 11.723 ns |  +66.6% |
IndexOf |     80 | 11.346 ns |  9.681 ns |  +17.1% |
IndexOf |     81 | 12.372 ns | 11.069 ns |  +11.7% |
IndexOf |     82 | 15.016 ns | 11.506 ns |  +30.5% |
IndexOf |     83 | 14.934 ns | 13.004 ns |  +14.8% |
IndexOf |     84 | 11.788 ns | 10.907 ns |   +8.0% |
IndexOf |     85 | 13.532 ns | 11.802 ns |  +14.6% |
IndexOf |     86 | 15.145 ns | 13.030 ns |  +16.2% |
IndexOf |    126 | 17.694 ns | 15.026 ns |  +17.7% |
IndexOf |    127 | 17.632 ns | 15.711 ns |  +12.2% |
IndexOf |    128 | 14.848 ns |  6.428 ns | +130.9% |
IndexOf |    129 | 16.267 ns |  8.638 ns |  +88.3% |
IndexOf |    130 | 19.239 ns | 10.284 ns |  +87.0% |
IndexOf |    131 | 19.029 ns | 10.843 ns |  +75.4% |
IndexOf |    250 | 19.764 ns | 16.818 ns |  +17.5% |
IndexOf |    251 | 21.105 ns | 16.881 ns |  +25.0% |
IndexOf |    252 | 17.472 ns | 15.944 ns |   +9.5% |
IndexOf |    253 | 19.112 ns | 16.290 ns |  +17.3% |
IndexOf |    254 | 20.948 ns | 17.231 ns |  +21.5% |
IndexOf |    255 | 21.242 ns | 17.344 ns |  +22.4% |

public class Program
{
    private static byte[] s_source;
    private const int Iters = 100;

    public static void Main(string[] args)
    {
        var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    [Params(
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86,
        126, 127, 128, 129, 130, 131,
        250, 251, 252, 253, 254, 255)]
    public int Length { get; set; }

    [Benchmark(OperationsPerInvoke = Iters)]
    public int IndexOf()
    {
        int total = 0;
        byte value = (byte)(Length - 1);
        ReadOnlySpan<byte> span = s_source;
        for (int i = 0; i < Iters; i++)
        {
            total += span.IndexOf(value);
        }

        return total;
    }

    [GlobalSetup]
    public void Setup()
    {
        s_source = Enumerable.Range(0, Length).Select(b => (byte)b).ToArray();
    }
}

benaadams · 2019-01-22T00:07:20Z

Not sure if worried about coreclr-ci yet?

Mostly seems good other than what looks to be an infra issue?:

Test Pri0 Windows_NT x64 checked Job

tools\Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(87,5): error : 
RestApiException: An Unexpected error occured when processing the request. [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.WorkItem.ListInternalAsync(String job, CancellationToken cancellationToken) in /_/src/Microsoft.DotNet.Helix/Client/CSharp/generated-code/WorkItem.cs:line 89 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.WorkItem.ListAsync(String job, CancellationToken cancellationToken) in /_/src/Microsoft.DotNet.Helix/Client/CSharp/generated-code/WorkItem.cs:line 47 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.HelixApi.RetryAsync[T](Func`1 function, Action`1 logRetry, Func`2 isRetryable) [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixWait.WaitForHelixJobAsync(String jobName) in /_/src/Microsoft.DotNet.Helix/Sdk/HelixWait.cs:line 70 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixWait.ExecuteCore() in /_/src/Microsoft.DotNet.Helix/Sdk/HelixWait.cs:line 60 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixTask.Execute() in /_/src/Microsoft.DotNet.Helix/Sdk/HelixTask.cs:line 43 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
[F:\vsagent\11\s\tests\helixpublishwitharcade.proj]

So it can be used by other types than byte

benaadams · 2019-01-22T05:06:39Z

Same change would work for .IndexOf(char), .IndexOfAny(byte,...), .IndexOfAny(char,...) and .SequenceCompareTo(...)

benaadams · 2019-01-22T05:36:41Z

Added .IndexOfAny(byte,...)

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

jkotas · 2019-01-22T16:35:49Z

@fiigii @tannergooding Could you please take a look?

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

tannergooding · 2019-01-22T17:09:48Z

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

@@ -199,10 +200,22 @@ public static unsafe int IndexOf(ref byte searchSpace, byte value, int length)
            IntPtr index = (IntPtr)0; // Use IntPtr for arithmetic to avoid unnecessary 64->32->64 truncations


Do we know why this code is currently using IntPtr rather than the:

#if BIT64 using nint = System.Int64; #else using nint = System.Int32; #endif

code that everywhere else seems to prefer?

This is left-over from times when this lived in CoreFX and it was not compiled bitness-specific. It would be good for readability to switch this over to nuint.

Is a bit messy to clean that up in this PR, will do a follow up

tannergooding · 2019-01-22T17:14:09Z

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

-                nLength = (IntPtr)((Vector<byte>.Count - unaligned) & (Vector<byte>.Count - 1));
+                if (length >= Vector128<byte>.Count * 2)
+                {
+                    int unaligned = (int)Unsafe.AsPointer(ref searchSpace) & (Vector128<byte>.Count - 1);


It would be really nice if the JIT would just optimize Unsafe.AsPointer(ref searchSpace) % Vector128<byte>.Count, or if we had a helper function for this type of code.

@tannergooding - is there an open issue for this?

Not sure, I'll take a look in a little bit (and will log one if we don't). It's probably worth noting that this optimization is only applicable for unsigned types (IIRC).

Looks like the JIT already does this optimization for unsigned types. This is possibly just something that is normally missed due to most inputs being signed by default.

Does it do it for an int that becomes a const when inlined? (as would be this case)

I'll test that case after lunch. It seems like it would be valid to do when the constant is known to be positive.

tannergooding · 2019-01-22T17:16:22Z

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

+                        Vector256<byte> comparison = Vector256.Create(value);
+                        do
+                        {
+                            Vector256<byte> search = Unsafe.ReadUnaligned<Vector256<byte>>(ref Unsafe.AddByteOffset(ref searchSpace, index));


Why ReadUnaligned rather than LoadVector256? Is it to avoid pinning?

No pinning here... :)

Its over raw byte data which is generally the longest type of data, so probably best to not get in the way of the GC; if it wants to compact the heap, rather than create a giant fixed block.

Hmmm, the initial assumption was that HWIntrinsics would likely be used with longer inputs and, in those cases, that pinning would be desirable (so alignment, cache-coherency, etc could be preserved).

If that isn't the case, we might want to revisit whether or not we also expose Load overloads that take ref T rather than just T*.

That's probably still the case for directly using them; but everything is getting plumbed though the SpanHelper methods so they are generic methods for all data sizes. e.g. String, Array, Span, ReadOnlySpan and CompareInfo, all use SpanHelpers now rather than their own implementations.

Fixing the data would be required to ensure that the alignment check remains valid.

The user can pin the data if they want the alignment to remain fixed (though they may non-know to do so); Kestrel's arrays are already pre-pinned so a second pin here would add no advantage, native memory pinning here would add no advantage, lengths < 32 pinning would add no advantage; but in all those cases it would be extra unnecessary work to fix the data (GC considerations aside).

Is there a significant performance benefit to using LoadVector256 over ReadUnaligned here?

Otherwise the unalignment issue will only effect the method if GC moves the data; and then, well you just had a GC so that will probably have more impact than being misaligned afterwards; and its also likely a low frequency event vs the number of times these methods are called.

What would the index argument of the API be? We do not have native int public type yet...

What would the index argument of the API be? We do not have native int public type yet...

Presumably just IntPtr, as the existing AddByteOffset method takes: public static ref T AddByteOffset<T>(ref T source, IntPtr byteOffset).

This emits an IL signature using native int and works correctly with languages (like F#) that treat IntPtr as a native sized integer. This should also work with C#, unless they decide to create a new wrapper type (rather than use partial erasure) or unless they decide that the nint type should have a modreq/modopt.

This should also work with C#, unless they decide to create a new wrapper type

I do not think there was a decision made on this one yet.

Right, what C# does is still pending further language discussion. However, we have already exposed a number of similar APIs that take IntPtr (mostly on S.R.CS.Unsafe), and the worst case scenario is that we may want to expose an additional convenience overload (likely just implemented in managed code) that casts from whatever C# uses to IntPtr.

It could also just be internal for the time being, to make our own code more readable.

Added local LoadVector256, LoadVector128, LoadVector and LoadUIntPtr to see how it looks; makes it more readable.

CarolEidt · 2019-01-22T17:51:45Z

I would second the suggestions to factor out some of the common code. Otherwise it LGTM overall.

fiigii · 2019-01-22T20:28:47Z

I also second the suggestions to factor out common code, but there are some points you may need to pay attention:

It is probably okay to calling into the helper functions without inlining since the helper functions seem not to take vector parameters (that means no additional vector copies). This is good for code size (I-cache), but with a little bit calling-overhead. It is worthwhile to collect perf data (e.g., VTune) to verify.
Please make sure the caller functions do not have vector variables living after the call-sites to avoid caller-saving vector registers.

src/System.Private.CoreLib/shared/System/SpanHelpers.cs

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

benaadams · 2019-01-23T17:52:21Z

Skipped bounds check from software fallback acdd8e9; don't see any other changes to make; any feedback?

tannergooding

LGTM

fiigii

LGTM, just one question.

fiigii · 2019-01-23T18:57:59Z

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs

        Found7:
-            return (int)(byte*)(index + 7);
+            return (int)(byte*)(offset + 7);


Can we use a variable to avoid the multiple goto targets? For example

if (uValue == Unsafe.AddByteOffset(ref searchSpace, offset + i)) { delta = i; goto Found; } ... Found: return (int)(byte*)(offset + delta);

Not sure, pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?

Currently these sections looks like this:

movzx r11, byte ptr [rcx+r11] cmp r11d, eax je G_M10207_IG14 lea r11, [r9+2] movzx r11, byte ptr [rcx+r11] cmp r11d, eax je G_M10207_IG15 lea r11, [r9+3] movzx r11, byte ptr [rcx+r11] cmp r11d, eax je G_M10207_IG16 lea r11, [r9+4] movzx r11, byte ptr [rcx+r11] cmp r11d, eax je G_M10207_IG17 lea r11, [r9+5] movzx r11, byte ptr [rcx+r11] cmp r11d, eax je G_M10207_IG18 lea r11, [r9+6] movzx r11, byte ptr [rcx+r11] cmp r11d, eax je G_M10207_IG19 lea r11, [r9+7] movzx r11, byte ptr [rcx+r11] cmp r11d, eax je G_M10207_IG20

pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?

Hmm, I always prefer smaller code-size over the "tricky" loop-unrolling (if the loop-body does not have long latency instructions). Sometimes, loop-unrolling can have a bit more beautiful perf data on microbenchmarks, but I believe I-cache is the most precious resource for the large real applications (e.g., ASP.NET servers).

Additionally, if a loop body is small enough, that would be specially optimized on Intel architectures (please see "Loop Stream Detector" sections in Intel optimization manual). I think most of the fast chain loops could be really small.

For completeness the exit points are

G_M10207_IG11: mov eax, -1 G_M10207_IG12: vzeroupper ret G_M10207_IG13: mov eax, r9d jmp SHORT G_M10207_IG21 G_M10207_IG14: lea rax, [r9+1] jmp SHORT G_M10207_IG21 G_M10207_IG15: lea rax, [r9+2] jmp SHORT G_M10207_IG21 G_M10207_IG16: lea rax, [r9+3] jmp SHORT G_M10207_IG21 G_M10207_IG17: lea rax, [r9+4] jmp SHORT G_M10207_IG21 G_M10207_IG18: lea rax, [r9+5] jmp SHORT G_M10207_IG21 G_M10207_IG19: lea rax, [r9+6] jmp SHORT G_M10207_IG21 G_M10207_IG20: lea rax, [r9+7] G_M10207_IG21: vzeroupper ret

A follow up turning them to regular for loops might be worth investigating then?

The current micro-benchmark oriented performance culture in the .NET Core repos favors bigger streamlined code because of it tends to produce better results in microbenchmarks. It is a common wisdom (at Microsoft at least) that smaller code runs faster in real workloads because of the factors @fiigii mentioned. The code bloating optimizations are worth it in like 1% of the cases. You have seen me pushing back on the more extreme cases of code bloat. This case is on the edge. My gut-feel is that this one is probably good as it is. We would need to be able to get data about performance in real workloads to tell with confidence.

A follow up turning them to regular for loops might be worth investigating then?

Yes, that should be a follow-up work, not in this PR.

Definitely worth revisiting; lots of remarks that amount to the CPU might not be happy with 8 conditional branches in quick succession e.g.

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four
branches in a 16-byte chunk.

Though not sure how you could; as here the jump are 6 bytes and the compares are 4 bytes... be a push to get 4 compares and 4 jumps in 16 bytes?

benaadams · 2019-01-23T20:17:42Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

benaadams · 2019-01-23T21:31:46Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

benaadams · 2019-01-23T23:20:24Z

coreclr-ci passed, first time I've seen that happen! 😃

* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Signed-off-by: dotnet-bot <[email protected]>

* Speedup SpanHelpers.IndexOf(byte) * 128 * 2 alignment * Move TrailingZeroCountFallback to common SpanHelpers So it can be used by other types than byte * Speedup SpanHelpers.IndexOfAny(byte, ...) * Indent for support flags * More helpers, constency in local names/formatting, feedback * Skip bounds check in software fallback Commit migrated from dotnet/coreclr@07d1e6b

Speedup SpanHelpers.IndexOf(byte)

04819de

benaadams commented Jan 21, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs Outdated Show resolved Hide resolved

128 * 2 alignment

0e8c1af

benaadams force-pushed the SpanHelpers.IndexOf branch from e2177ec to 497cb8c Compare January 22, 2019 02:21

Move TrailingZeroCountFallback to common SpanHelpers

497cb8c

So it can be used by other types than byte

Speedup SpanHelpers.IndexOfAny(byte, ...)

7866e81

Indent for support flags

aebbd19

benaadams changed the title ~~Speedup SpanHelpers.IndexOf(byte)~~ Speedup SpanHelpers.IndexOf{Any}(byte, ...) Jan 22, 2019

benaadams mentioned this pull request Jan 22, 2019

Intrinsicify .SequenceCompareTo(byte, ...) #22127

Merged

gfoidl reviewed Jan 22, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs Outdated Show resolved Hide resolved

jkotas requested review from fiigii and tannergooding January 22, 2019 16:35

tannergooding reviewed Jan 22, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs Show resolved Hide resolved

tannergooding reviewed Jan 22, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs Outdated Show resolved Hide resolved

tannergooding reviewed Jan 22, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs Outdated Show resolved Hide resolved

tannergooding reviewed Jan 22, 2019

View reviewed changes

am11 reviewed Jan 22, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.cs Outdated Show resolved Hide resolved

Feedback

c38fc75

Use LoadByte helper

9b7f3d8

jkotas reviewed Jan 23, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs Outdated Show resolved Hide resolved

benaadams added 2 commits January 23, 2019 02:50

Missed some

34291f0

Drop LoadByte

78e1f89

benaadams force-pushed the SpanHelpers.IndexOf branch from 430a825 to 78e1f89 Compare January 23, 2019 02:55

Skip bounds check in software fallback

acdd8e9

tannergooding approved these changes Jan 23, 2019

View reviewed changes

Add comment

9ba65b5

fiigii approved these changes Jan 23, 2019

View reviewed changes

jkotas approved these changes Jan 23, 2019

View reviewed changes

jkotas merged commit 07d1e6b into dotnet:master Jan 24, 2019

benaadams deleted the SpanHelpers.IndexOf branch January 24, 2019 20:21

This was referenced Jan 26, 2019

[WIP] Intrinsicify SpanHelpers.Char #22187

Closed

Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} #21073

Merged

grant-d mentioned this pull request Jan 31, 2019

Consolidate implementation of ExtractBit #22225

Closed

benaadams changed the title ~~Speedup SpanHelpers.IndexOf{Any}(byte, ...)~~ Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) Feb 3, 2019

This was referenced Jan 31, 2020

Bug: BitOps.TrailingZeroCount has inconsistent software fallback dotnet/runtime#11944

Closed

Proposal: Expose Bit Manipulation functions dotnet/runtime#27382

Open

kunalspathak mentioned this pull request Jun 8, 2020

ARM64: Optimize IndexOf(byte) and IndexOf(char) APIs using intrinsics. dotnet/runtime#37624

Merged

		@@ -199,10 +200,22 @@ public static unsafe int IndexOf(ref byte searchSpace, byte value, int length)
		IntPtr index = (IntPtr)0; // Use IntPtr for arithmetic to avoid unnecessary 64->32->64 truncations

Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) #22118

Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) #22118

Conversation

benaadams commented Jan 21, 2019 • edited Loading

benaadams commented Jan 21, 2019

benaadams commented Jan 21, 2019

benaadams commented Jan 21, 2019 • edited Loading

benaadams commented Jan 21, 2019 • edited Loading

benaadams commented Jan 22, 2019

benaadams commented Jan 22, 2019

benaadams commented Jan 22, 2019

jkotas commented Jan 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benaadams Jan 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benaadams Jan 23, 2019 • edited Loading

Choose a reason for hiding this comment

CarolEidt commented Jan 22, 2019

fiigii commented Jan 22, 2019

benaadams commented Jan 23, 2019

tannergooding left a comment

Choose a reason for hiding this comment

fiigii left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benaadams Jan 23, 2019 • edited Loading

Choose a reason for hiding this comment

benaadams Jan 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benaadams commented Jan 23, 2019

benaadams commented Jan 23, 2019

benaadams commented Jan 23, 2019

benaadams commented Jan 21, 2019 •

edited

Loading

benaadams commented Jan 21, 2019 •

edited

Loading

benaadams commented Jan 21, 2019 •

edited

Loading

benaadams Jan 22, 2019 •

edited

Loading

benaadams Jan 23, 2019 •

edited

Loading

benaadams Jan 23, 2019 •

edited

Loading

benaadams Jan 23, 2019 •

edited

Loading