Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) #22118

Merged
merged 12 commits into from
Jan 24, 2019

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Jan 21, 2019

Learnings from #22019; improvement on #21073

Applied to

internal static partial class SpanHelpers
{
    static int IndexOf(ref byte searchSpace, byte value, int length);
    static int IndexOfAny(ref byte searchSpace, byte value0, byte value1, int length);
    static int IndexOfAny(ref byte searchSpace, byte value0, byte value1, byte value2, int length);
}

MoveMask/vpmovmskb used for equality in the vectorized path already has flagged the bits that match; so if we use the specific intrinsics rather than generic Vector, we just need to determine the bit set, rather than do further processing to determine the element offset.

-; Total bytes of code 597, prolog size 5 for method SpanHelpers:IndexOf(byref,ubyte,int):int
+; Total bytes of code 549, prolog size 5 for method SpanHelpers:IndexOf(byref,ubyte,int):int

Performance measurements:

Array length 512 significant improvements (up to more than double) for found item in any position. #22118 (comment)

Array lengths >= 32 with item in last position significant improvements (up to more than double) #22118 (comment)

/cc @CarolEidt @fiigii @tannergooding @ahsonkhan

@benaadams
Copy link
Member Author

Working on perf numbers

@benaadams
Copy link
Member Author

Some interesting speed bumps in there; might be able to go better

@benaadams
Copy link
Member Author

benaadams commented Jan 21, 2019

Array length 512

 Method   | Pos |    Current |         PR | % Improvement |
----------|-----|------------|------------|---------------|
  IndexOf |   0 |   6.403 ns |   4.461 ns |        +43.5% |
  IndexOf |   1 |   6.256 ns |   4.516 ns |        +38.5% |
  IndexOf |   2 |   6.192 ns |   4.494 ns |        +37.7% |
  IndexOf |   3 |   6.425 ns |   4.460 ns |        +44.0% |
  IndexOf |   4 |   6.394 ns |   4.420 ns |        +44.6% |
  IndexOf |   7 |   6.176 ns |   4.424 ns |        +39.6% |
  IndexOf |   8 |   6.265 ns |   4.464 ns |        +40.3% |
  IndexOf |   9 |   6.202 ns |   4.564 ns |        +35.8% |
  IndexOf |  10 |   6.274 ns |   4.458 ns |        +40.7% |
  IndexOf |  11 |   6.198 ns |   4.607 ns |        +34.5% |
  IndexOf |  12 |   6.537 ns |   4.469 ns |        +46.2% |
  IndexOf |  13 |   7.124 ns |   4.421 ns |        +61.1% |
  IndexOf |  14 |   7.530 ns |   4.429 ns |        +70.0% |
  IndexOf |  15 |   7.055 ns |   4.401 ns |        +60.3% |
  IndexOf |  16 |   9.521 ns |   4.421 ns |       +115.3% |
  IndexOf |  17 |   9.510 ns |   4.863 ns |        +95.5% |
  IndexOf |  18 |   9.452 ns |   4.531 ns |       +108.6% |
  IndexOf |  19 |   9.518 ns |   4.535 ns |       +109.8% |
  IndexOf |  20 |   9.511 ns |   4.499 ns |       +111.4% |
  IndexOf |  21 |   9.501 ns |   4.322 ns |       +119.8% |
  IndexOf |  22 |   9.458 ns |   4.418 ns |       +114.0% |
  IndexOf |  23 |   9.278 ns |   4.579 ns |       +102.6% |
  IndexOf |  24 |   9.904 ns |   4.637 ns |       +113.5% |
  IndexOf |  25 |   9.697 ns |   4.667 ns |       +107.7% |
  IndexOf |  26 |   9.720 ns |   4.459 ns |       +117.9% |
  IndexOf |  27 |   9.819 ns |   4.380 ns |       +124.1% |
  IndexOf |  28 |   9.795 ns |   4.520 ns |       +116.7% |
  IndexOf |  29 |   9.584 ns |   4.628 ns |       +107.0% |
  IndexOf |  30 |   9.751 ns |   4.457 ns |       +118.7% |
  IndexOf |  31 |   9.778 ns |   4.520 ns |       +116.3% |
  IndexOf |  32 |  10.110 ns |   6.077 ns |        +66.3% |
  IndexOf |  33 |   9.933 ns |   6.129 ns |        +62.0% |
  IndexOf |  34 |   9.922 ns |   6.183 ns |        +60.4% |
  IndexOf |  35 |  10.034 ns |   6.155 ns |        +63.0% |
  IndexOf |  36 |   9.994 ns |   5.244 ns |        +90.5% |
  IndexOf |  37 |  10.059 ns |   6.211 ns |        +61.9% |
  IndexOf |  38 |  10.101 ns |   5.085 ns |        +98.6% |
  IndexOf |  39 |  10.096 ns |   6.255 ns |        +61.4% |
  IndexOf |  40 |  10.147 ns |   5.560 ns |        +82.5% |
  IndexOf |  41 |  10.449 ns |   6.150 ns |        +69.9% |
  IndexOf |  42 |  10.161 ns |   5.542 ns |        +83.3% |
  IndexOf |  43 |  10.296 ns |   5.075 ns |       +102.8% |
  IndexOf |  44 |  10.240 ns |   4.993 ns |       +105.0% |
  IndexOf |  45 |  10.385 ns |   4.956 ns |       +109.5% |
  IndexOf |  46 |  10.246 ns |   6.147 ns |        +66.6% |
  IndexOf |  47 |  10.427 ns |   4.988 ns |       +109.0% |
  IndexOf |  48 |  10.877 ns |   6.229 ns |        +74.6% |
  IndexOf |  49 |  10.688 ns |   6.090 ns |        +75.5% |
  IndexOf |  50 |  10.895 ns |   6.019 ns |        +81.0% |
  IndexOf |  51 |  10.919 ns |   6.206 ns |        +75.9% |
  IndexOf |  52 |  10.591 ns |   6.012 ns |        +76.1% |
  IndexOf |  53 |  10.772 ns |   6.232 ns |        +72.8% |
  IndexOf |  54 |  10.760 ns |   6.212 ns |        +73.2% |
  IndexOf |  55 |  10.898 ns |   5.241 ns |       +107.9% |
  IndexOf |  56 |  10.776 ns |   4.983 ns |       +116.2% |
  IndexOf |  57 |  11.472 ns |   6.150 ns |        +86.5% |
  IndexOf |  58 |  10.756 ns |   7.093 ns |        +51.6% |
  IndexOf |  59 |  10.946 ns |   4.891 ns |       +123.7% |
  IndexOf |  60 |  10.745 ns |   5.260 ns |       +104.2% |
  IndexOf |  61 |  10.889 ns |   6.158 ns |        +76.8% |
  IndexOf |  62 |  10.755 ns |   6.050 ns |        +77.7% |
  IndexOf |  63 |  10.777 ns |   4.971 ns |       +116.7% |
  IndexOf |  64 |  11.007 ns |   5.782 ns |        +90.3% |
  IndexOf |  65 |  11.187 ns |   6.599 ns |        +69.5% |
  IndexOf |  66 |  11.009 ns |   5.690 ns |        +93.4% |
  IndexOf |  67 |  11.129 ns |   5.653 ns |        +96.8% |
  IndexOf |  68 |  11.129 ns |   5.775 ns |        +92.7% |
  IndexOf |  69 |  10.839 ns |   6.633 ns |        +63.4% |
  IndexOf |  70 |  11.207 ns |   5.734 ns |        +95.4% |
  IndexOf |  71 |  11.139 ns |   5.611 ns |        +98.5% |
  IndexOf |  72 |  11.345 ns |   5.701 ns |        +99.0% |
  IndexOf |  73 |  11.559 ns |   6.728 ns |        +71.8% |
  IndexOf |  74 |  11.396 ns |   5.614 ns |       +102.9% |
  IndexOf |  75 |  11.271 ns |   6.672 ns |        +68.9% |
  IndexOf |  76 |  11.282 ns |   6.741 ns |        +67.3% |
  IndexOf |  77 |  11.889 ns |   5.737 ns |       +107.2% |
  IndexOf |  78 |  11.292 ns |   6.755 ns |        +67.1% |
  IndexOf |  79 |  11.417 ns |   6.746 ns |        +69.2% |
  IndexOf |  80 |  11.843 ns |   5.734 ns |       +106.5% |
  IndexOf |  81 |  11.679 ns |   6.817 ns |        +71.3% |
  IndexOf |  82 |  11.755 ns |   5.670 ns |       +107.3% |
  IndexOf |  83 |  11.757 ns |   5.736 ns |       +104.9% |
  IndexOf |  84 |  11.785 ns |   5.743 ns |       +105.2% |
  IndexOf |  85 |  11.625 ns |   6.828 ns |        +70.2% |
  IndexOf |  86 |  11.900 ns |   6.753 ns |        +76.2% |
  IndexOf | 126 |  10.102 ns |   6.467 ns |        +56.2% |
  IndexOf | 127 |  10.030 ns |   6.393 ns |        +56.8% |
  IndexOf | 128 |  10.144 ns |   7.130 ns |        +42.2% |
  IndexOf | 129 |  10.342 ns |   7.978 ns |        +29.6% |
  IndexOf | 130 |  10.213 ns |   8.016 ns |        +27.4% |
  IndexOf | 131 |  10.170 ns |   7.112 ns |        +42.9% |
  IndexOf | 250 |  13.517 ns |   9.686 ns |        +39.5% |
  IndexOf | 251 |  13.416 ns |   9.778 ns |        +37.2% |
  IndexOf | 252 |  13.473 ns |   7.693 ns |        +75.1% |
  IndexOf | 253 |  13.629 ns |   9.824 ns |        +38.7% |
  IndexOf | 254 |  13.355 ns |   9.702 ns |        +37.6% |
  IndexOf | 255 |  26.023 ns |  22.202 ns |        +17.2% |
public class Program
{
    private static byte[] s_source;
    private const int Iters = 100;

    public static void Main(string[] args)
    {
        var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    [Params(
        0, 1, 2, 3, 4, 7, 
        8, 9, 10, 11, 12, 13, 14, 15, 
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86,
        126, 127, 128, 129, 130, 131,
        250, 251, 252, 253, 254, 255)]
    public int Position { get; set; }

    [Benchmark(OperationsPerInvoke = Iters)]
    public int IndexOf()
    {
        int total = 0;
        byte value = (byte)Position;
        ReadOnlySpan<byte> span = s_source;
        for (int i = 0; i < Iters; i++)
        {
            total += span.IndexOf(value);
        }

        return total;
    }

    [GlobalSetup]
    public void Setup()
    {
        var e = Enumerable.Range(0, 255).Select(b => (byte)b);
        s_source = e.Concat(e).ToArray();
    }
}

@benaadams
Copy link
Member Author

benaadams commented Jan 21, 2019

Array length = Length, position length-1 (asm < 32 length is identical)

        | Length |   Current |        PR |  Change |
--------|--------|-----------|-----------|---------| 
IndexOf |     32 | 10.656 ns |  4.341 ns | +145.4% |
IndexOf |     33 | 11.519 ns |  6.885 ns |  +67.3% |
IndexOf |     34 | 11.598 ns |  7.919 ns |  +46.4% |
IndexOf |     35 | 12.137 ns |  9.070 ns |  +33.8% |
IndexOf |     36 | 11.576 ns |  6.869 ns |  +68.5% |
IndexOf |     37 | 12.408 ns |  7.759 ns |  +59.9% |
IndexOf |     38 | 12.698 ns |  8.847 ns |  +43.5% |
IndexOf |     39 | 13.556 ns |  9.545 ns |  +42.0% |
IndexOf |     40 | 12.697 ns |  7.737 ns |  +64.1% |
IndexOf |     41 | 13.432 ns |  8.837 ns |  +51.9% |
IndexOf |     42 | 13.722 ns |  9.884 ns |  +38.8% |
IndexOf |     43 | 14.293 ns | 10.869 ns |  +31.5% |
IndexOf |     44 | 13.778 ns |  8.467 ns |  +62.7% |
IndexOf |     45 | 14.098 ns |  9.696 ns |  +45.4% |
IndexOf |     46 | 14.577 ns | 10.531 ns |  +38.4% |
IndexOf |     47 | 14.968 ns | 11.591 ns |  +29.1% |
IndexOf |     48 | 14.570 ns |  9.746 ns |  +49.4% |
IndexOf |     49 | 15.488 ns | 10.501 ns |  +47.4% |
IndexOf |     50 | 15.768 ns | 11.235 ns |  +40.3% |
IndexOf |     51 | 15.911 ns | 12.203 ns |  +30.3% |
IndexOf |     52 | 15.725 ns | 11.275 ns |  +39.4% |
IndexOf |     53 | 16.409 ns | 11.763 ns |  +39.4% |
IndexOf |     54 | 16.784 ns | 12.757 ns |  +31.5% |
IndexOf |     55 | 17.012 ns | 13.623 ns |  +24.8% |
IndexOf |     56 | 16.614 ns | 11.852 ns |  +40.1% |
IndexOf |     57 | 17.657 ns | 12.829 ns |  +37.6% |
IndexOf |     58 | 17.540 ns | 12.969 ns |  +35.2% |
IndexOf |     59 | 18.289 ns | 13.645 ns |  +34.0% |
IndexOf |     60 | 17.954 ns | 13.319 ns |  +34.7% |
IndexOf |     61 | 18.590 ns | 13.571 ns |  +36.9% |
IndexOf |     62 | 18.875 ns | 14.356 ns |  +31.4% |
IndexOf |     63 | 19.249 ns | 15.241 ns |  +26.2% |
IndexOf |     64 | 13.423 ns |  5.675 ns | +136.5% |
IndexOf |     65 | 14.752 ns |  7.573 ns |  +94.7% |
IndexOf |     66 | 17.021 ns |  8.455 ns | +101.3% |
IndexOf |     67 | 17.204 ns |  9.857 ns |  +74.5% |
IndexOf |     68 | 14.591 ns |  7.044 ns | +107.1% |
IndexOf |     69 | 16.171 ns |  8.305 ns |  +94.7% |
IndexOf |     70 | 17.683 ns | 10.001 ns |  +76.8% |
IndexOf |     71 | 17.657 ns |  9.873 ns |  +78.8% |
IndexOf |     72 | 15.210 ns |  7.629 ns |  +99.3% |
IndexOf |     73 | 15.957 ns |  9.059 ns |  +76.1% |
IndexOf |     74 | 17.029 ns | 10.633 ns |  +60.1% |
IndexOf |     75 | 18.380 ns | 11.072 ns |  +66.0% |
IndexOf |     76 | 16.562 ns |  8.931 ns |  +85.4% |
IndexOf |     77 | 17.548 ns |  9.685 ns |  +81.1% |
IndexOf |     78 | 18.575 ns | 10.918 ns |  +70.1% |
IndexOf |     79 | 19.531 ns | 11.723 ns |  +66.6% |
IndexOf |     80 | 11.346 ns |  9.681 ns |  +17.1% |
IndexOf |     81 | 12.372 ns | 11.069 ns |  +11.7% |
IndexOf |     82 | 15.016 ns | 11.506 ns |  +30.5% |
IndexOf |     83 | 14.934 ns | 13.004 ns |  +14.8% |
IndexOf |     84 | 11.788 ns | 10.907 ns |   +8.0% |
IndexOf |     85 | 13.532 ns | 11.802 ns |  +14.6% |
IndexOf |     86 | 15.145 ns | 13.030 ns |  +16.2% |
IndexOf |    126 | 17.694 ns | 15.026 ns |  +17.7% |
IndexOf |    127 | 17.632 ns | 15.711 ns |  +12.2% |
IndexOf |    128 | 14.848 ns |  6.428 ns | +130.9% |
IndexOf |    129 | 16.267 ns |  8.638 ns |  +88.3% |
IndexOf |    130 | 19.239 ns | 10.284 ns |  +87.0% |
IndexOf |    131 | 19.029 ns | 10.843 ns |  +75.4% |
IndexOf |    250 | 19.764 ns | 16.818 ns |  +17.5% |
IndexOf |    251 | 21.105 ns | 16.881 ns |  +25.0% |
IndexOf |    252 | 17.472 ns | 15.944 ns |   +9.5% |
IndexOf |    253 | 19.112 ns | 16.290 ns |  +17.3% |
IndexOf |    254 | 20.948 ns | 17.231 ns |  +21.5% |
IndexOf |    255 | 21.242 ns | 17.344 ns |  +22.4% |
public class Program
{
    private static byte[] s_source;
    private const int Iters = 100;

    public static void Main(string[] args)
    {
        var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    [Params(
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86,
        126, 127, 128, 129, 130, 131,
        250, 251, 252, 253, 254, 255)]
    public int Length { get; set; }

    [Benchmark(OperationsPerInvoke = Iters)]
    public int IndexOf()
    {
        int total = 0;
        byte value = (byte)(Length - 1);
        ReadOnlySpan<byte> span = s_source;
        for (int i = 0; i < Iters; i++)
        {
            total += span.IndexOf(value);
        }

        return total;
    }

    [GlobalSetup]
    public void Setup()
    {
        s_source = Enumerable.Range(0, Length).Select(b => (byte)b).ToArray();
    }
}

@benaadams
Copy link
Member Author

Not sure if worried about coreclr-ci yet?

Mostly seems good other than what looks to be an infra issue?:

Test Pri0 Windows_NT x64 checked Job

tools\Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(87,5): error : 
RestApiException: An Unexpected error occured when processing the request. [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.WorkItem.ListInternalAsync(String job, CancellationToken cancellationToken) in /_/src/Microsoft.DotNet.Helix/Client/CSharp/generated-code/WorkItem.cs:line 89 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.WorkItem.ListAsync(String job, CancellationToken cancellationToken) in /_/src/Microsoft.DotNet.Helix/Client/CSharp/generated-code/WorkItem.cs:line 47 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Client.HelixApi.RetryAsync[T](Func`1 function, Action`1 logRetry, Func`2 isRetryable) [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixWait.WaitForHelixJobAsync(String jobName) in /_/src/Microsoft.DotNet.Helix/Sdk/HelixWait.cs:line 70 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixWait.ExecuteCore() in /_/src/Microsoft.DotNet.Helix/Sdk/HelixWait.cs:line 60 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
  at Microsoft.DotNet.Helix.Sdk.HelixTask.Execute() in /_/src/Microsoft.DotNet.Helix/Sdk/HelixTask.cs:line 43 [F:\vsagent\11\s\tests\helixpublishwitharcade.proj]
[F:\vsagent\11\s\tests\helixpublishwitharcade.proj]

@benaadams benaadams force-pushed the SpanHelpers.IndexOf branch from e2177ec to 497cb8c Compare January 22, 2019 02:21
So it can be used by other types than byte
@benaadams
Copy link
Member Author

Same change would work for .IndexOf(char), .IndexOfAny(byte,...), .IndexOfAny(char,...) and .SequenceCompareTo(...)

@benaadams
Copy link
Member Author

Added .IndexOfAny(byte,...)

@benaadams benaadams changed the title Speedup SpanHelpers.IndexOf(byte) Speedup SpanHelpers.IndexOf{Any}(byte, ...) Jan 22, 2019
@jkotas
Copy link
Member

jkotas commented Jan 22, 2019

@fiigii @tannergooding Could you please take a look?

@@ -199,10 +200,22 @@ public static unsafe int IndexOf(ref byte searchSpace, byte value, int length)
IntPtr index = (IntPtr)0; // Use IntPtr for arithmetic to avoid unnecessary 64->32->64 truncations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why this code is currently using IntPtr rather than the:

#if BIT64
using nint = System.Int64;
#else
using nint = System.Int32;
#endif

code that everywhere else seems to prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is left-over from times when this lived in CoreFX and it was not compiled bitness-specific. It would be good for readability to switch this over to nuint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a bit messy to clean that up in this PR, will do a follow up

nLength = (IntPtr)((Vector<byte>.Count - unaligned) & (Vector<byte>.Count - 1));
if (length >= Vector128<byte>.Count * 2)
{
int unaligned = (int)Unsafe.AsPointer(ref searchSpace) & (Vector128<byte>.Count - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice if the JIT would just optimize Unsafe.AsPointer(ref searchSpace) % Vector128<byte>.Count, or if we had a helper function for this type of code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding - is there an open issue for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I'll take a look in a little bit (and will log one if we don't). It's probably worth noting that this optimization is only applicable for unsigned types (IIRC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the JIT already does this optimization for unsigned types. This is possibly just something that is normally missed due to most inputs being signed by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it do it for an int that becomes a const when inlined? (as would be this case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test that case after lunch. It seems like it would be valid to do when the constant is known to be positive.

Vector256<byte> comparison = Vector256.Create(value);
do
{
Vector256<byte> search = Unsafe.ReadUnaligned<Vector256<byte>>(ref Unsafe.AddByteOffset(ref searchSpace, index));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ReadUnaligned rather than LoadVector256? Is it to avoid pinning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No pinning here... :)

Its over raw byte data which is generally the longest type of data, so probably best to not get in the way of the GC; if it wants to compact the heap, rather than create a giant fixed block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, the initial assumption was that HWIntrinsics would likely be used with longer inputs and, in those cases, that pinning would be desirable (so alignment, cache-coherency, etc could be preserved).

If that isn't the case, we might want to revisit whether or not we also expose Load overloads that take ref T rather than just T*.

Copy link
Member Author

@benaadams benaadams Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably still the case for directly using them; but everything is getting plumbed though the SpanHelper methods so they are generic methods for all data sizes. e.g. String, Array, Span, ReadOnlySpan and CompareInfo, all use SpanHelpers now rather than their own implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the data would be required to ensure that the alignment check remains valid.

The user can pin the data if they want the alignment to remain fixed (though they may non-know to do so); Kestrel's arrays are already pre-pinned so a second pin here would add no advantage, native memory pinning here would add no advantage, lengths < 32 pinning would add no advantage; but in all those cases it would be extra unnecessary work to fix the data (GC considerations aside).

Is there a significant performance benefit to using LoadVector256 over ReadUnaligned here?

Otherwise the unalignment issue will only effect the method if GC moves the data; and then, well you just had a GC so that will probably have more impact than being misaligned afterwards; and its also likely a low frequency event vs the number of times these methods are called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the index argument of the API be? We do not have native int public type yet...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the index argument of the API be? We do not have native int public type yet...

Presumably just IntPtr, as the existing AddByteOffset method takes: public static ref T AddByteOffset<T>(ref T source, IntPtr byteOffset).

This emits an IL signature using native int and works correctly with languages (like F#) that treat IntPtr as a native sized integer. This should also work with C#, unless they decide to create a new wrapper type (rather than use partial erasure) or unless they decide that the nint type should have a modreq/modopt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also work with C#, unless they decide to create a new wrapper type

I do not think there was a decision made on this one yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, what C# does is still pending further language discussion. However, we have already exposed a number of similar APIs that take IntPtr (mostly on S.R.CS.Unsafe), and the worst case scenario is that we may want to expose an additional convenience overload (likely just implemented in managed code) that casts from whatever C# uses to IntPtr.

It could also just be internal for the time being, to make our own code more readable.

Copy link
Member Author

@benaadams benaadams Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added local LoadVector256, LoadVector128, LoadVector and LoadUIntPtr to see how it looks; makes it more readable.

@CarolEidt
Copy link

I would second the suggestions to factor out some of the common code. Otherwise it LGTM overall.

@fiigii
Copy link

fiigii commented Jan 22, 2019

I also second the suggestions to factor out common code, but there are some points you may need to pay attention:

  1. It is probably okay to calling into the helper functions without inlining since the helper functions seem not to take vector parameters (that means no additional vector copies). This is good for code size (I-cache), but with a little bit calling-overhead. It is worthwhile to collect perf data (e.g., VTune) to verify.
  2. Please make sure the caller functions do not have vector variables living after the call-sites to avoid caller-saving vector registers.

@benaadams benaadams force-pushed the SpanHelpers.IndexOf branch from 430a825 to 78e1f89 Compare January 23, 2019 02:55
@benaadams
Copy link
Member Author

Skipped bounds check from software fallback acdd8e9; don't see any other changes to make; any feedback?

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question.

Found7:
return (int)(byte*)(index + 7);
return (int)(byte*)(offset + 7);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a variable to avoid the multiple goto targets? For example

if (uValue == Unsafe.AddByteOffset(ref searchSpace, offset + i))
{
    delta = i;
    goto Found;
}

...

Found: 
    return (int)(byte*)(offset + delta);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?

Currently these sections looks like this:

movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG14
lea      r11, [r9+2]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG15
lea      r11, [r9+3]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG16
lea      r11, [r9+4]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG17
lea      r11, [r9+5]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG18
lea      r11, [r9+6]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG19
lea      r11, [r9+7]
movzx    r11, byte  ptr [rcx+r11]
cmp      r11d, eax
je       G_M10207_IG20

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushes more code into the fast chain (running through the compares) to avoid more in the slow area (the targets)?

Hmm, I always prefer smaller code-size over the "tricky" loop-unrolling (if the loop-body does not have long latency instructions). Sometimes, loop-unrolling can have a bit more beautiful perf data on microbenchmarks, but I believe I-cache is the most precious resource for the large real applications (e.g., ASP.NET servers).

Additionally, if a loop body is small enough, that would be specially optimized on Intel architectures (please see "Loop Stream Detector" sections in Intel optimization manual). I think most of the fast chain loops could be really small.

Copy link
Member Author

@benaadams benaadams Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness the exit points are

G_M10207_IG11:
   mov      eax, -1
G_M10207_IG12:
   vzeroupper 
   ret  
G_M10207_IG13:
   mov      eax, r9d
   jmp      SHORT G_M10207_IG21
G_M10207_IG14:
   lea      rax, [r9+1]
   jmp      SHORT G_M10207_IG21
G_M10207_IG15:
   lea      rax, [r9+2]
   jmp      SHORT G_M10207_IG21
G_M10207_IG16:
   lea      rax, [r9+3]
   jmp      SHORT G_M10207_IG21
G_M10207_IG17:
   lea      rax, [r9+4]
   jmp      SHORT G_M10207_IG21
G_M10207_IG18:
   lea      rax, [r9+5]
   jmp      SHORT G_M10207_IG21
G_M10207_IG19:
   lea      rax, [r9+6]
   jmp      SHORT G_M10207_IG21
G_M10207_IG20:
   lea      rax, [r9+7]
G_M10207_IG21:
   vzeroupper 
   ret      

Copy link
Member Author

@benaadams benaadams Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow up turning them to regular for loops might be worth investigating then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current micro-benchmark oriented performance culture in the .NET Core repos favors bigger streamlined code because of it tends to produce better results in microbenchmarks. It is a common wisdom (at Microsoft at least) that smaller code runs faster in real workloads because of the factors @fiigii mentioned. The code bloating optimizations are worth it in like 1% of the cases. You have seen me pushing back on the more extreme cases of code bloat. This case is on the edge. My gut-feel is that this one is probably good as it is. We would need to be able to get data about performance in real workloads to tell with confidence.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow up turning them to regular for loops might be worth investigating then?

Yes, that should be a follow-up work, not in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worth revisiting; lots of remarks that amount to the CPU might not be happy with 8 conditional branches in quick succession e.g.

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four
branches in a 16-byte chunk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though not sure how you could; as here the jump are 6 bytes and the compares are 4 bytes... be a push to get 4 compares and 4 jumps in 16 bytes?

@benaadams
Copy link
Member Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@benaadams
Copy link
Member Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@benaadams
Copy link
Member Author

coreclr-ci passed, first time I've seen that happen! 😃

@jkotas jkotas merged commit 07d1e6b into dotnet:master Jan 24, 2019
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Jan 24, 2019
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <[email protected]>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Jan 24, 2019
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <[email protected]>
marek-safar pushed a commit to mono/mono that referenced this pull request Jan 24, 2019
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <[email protected]>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request Jan 24, 2019
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <[email protected]>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Jan 24, 2019
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <[email protected]>
@benaadams benaadams deleted the SpanHelpers.IndexOf branch January 24, 2019 20:21
jkotas pushed a commit to dotnet/corert that referenced this pull request Jan 24, 2019
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback

Signed-off-by: dotnet-bot <[email protected]>
@benaadams benaadams changed the title Speedup SpanHelpers.IndexOf{Any}(byte, ...) Intrinsicify SpanHelpers.IndexOf{Any}(byte, ...) Feb 3, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Speedup SpanHelpers.IndexOf(byte)

* 128 * 2 alignment

* Move TrailingZeroCountFallback to common SpanHelpers

So it can be used by other types than byte

* Speedup SpanHelpers.IndexOfAny(byte, ...)

* Indent for support flags

* More helpers, constency in local names/formatting, feedback

* Skip bounds check in software fallback


Commit migrated from dotnet/coreclr@07d1e6b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants