Reduce worst-case alg complexity of MemoryExtensions.IndexOfAny #53652

GrabYourPitchforks · 2021-06-03T00:48:01Z

Based on a comment thread at #53115 (comment).

In a nutshell, given a large enough needle, the code pattern below may have O(n^2 * l) complexity, where n is the length of the haystack and l is the number of needles.

Span<T> haystack = GetHaystack();
Span<T> needles = GetNeedles();

while (true)
{
    int idxOfFirstNeedle = haystack.IndexOfAny(needles);
    if (idxOfFirstNeedle >= 0)
    {
        DoSomethingWith(haystack.Slice(0, idxOfFirstNeedle);
        haystack = haystack.Slice(idxOfFirstNeedle + 1);
    }
    else
    {
        DoSomethingWith(haystack); // remainder
        break;
    }
}

This PR reduces the algorithmic complexity of this slice-and-loop pattern to worst-case O(n * l), which is what most callers probably expect. The downside is that there's a higher constant factor, and this higher constant factor will dominate when the haystack is small or when first needle has a much higher probability of being found than any subsequent needle.

I'm intentionally keeping this PR simple, and a future PR can introduce vectorization or any other optimization technique to help get some of this perf back. The unit tests introduced as part of this PR should help catch future regressions in this area.

I don't have perf numbers right now but should be able to share them by the end of the week if there's interest in seeing them. Per investigation in the previously linked thread, I don't expect this to impact typical libraries code, as MemoryExtensions.[Last]IndexOfAny already has logic to handle specific needle counts typical of perf-critical code paths.

ghost · 2021-06-03T00:48:06Z

Tagging subscribers to this area: @GrabYourPitchforks, @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

Based on a comment thread at #53115 (comment).

In a nutshell, given a large enough needle, the code pattern below may have O(n^2 * l) complexity, where n is the length of the haystack and l is the number of needles.

Span<T> haystack = GetHaystack();
Span<T> needles = GetNeedles();

while (true)
{
    int idxOfFirstNeedle = haystack.IndexOfAny(needles);
    if (idxOfFirstNeedle >= 0)
    {
        DoSomethingWith(haystack.Slice(0, idxOfFirstNeedle);
        haystack = haystack.Slice(idxOfFirstNeedle + 1);
    }
    else
    {
        DoSomethingWith(haystack); // remainder
        break;
    }
}

This PR reduces the algorithmic complexity of this slice-and-loop pattern to worst-case O(n * l), which is what most callers probably expect. The downside is that there's a higher constant factor, and this higher constant factor will dominate when the haystack is small or when first needle has a much higher probability of being found than any subsequent needle.

I'm intentionally keeping this PR simple, and a future PR can introduce vectorization or any other optimization technique to help get some of this perf back. The unit tests introduced as part of this PR should help catch future regressions in this area.

I don't have perf numbers right now but should be able to share them by the end of the week if there's interest in seeing them. Per investigation in the previously linked thread, I don't expect this to impact typical libraries code, as MemoryExtensions.[Last]IndexOfAny already has logic to handle specific needle counts typical of perf-critical code paths.

Author:	GrabYourPitchforks
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`
Milestone:	6.0.0

src/libraries/System.Memory/tests/Span/IndexOfAny.AlgorithmicComplexity.cs

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

stephentoub

Thanks for doing this.

GrabYourPitchforks · 2021-06-03T18:21:22Z

~~Need to investigate why the new unit tests are failing on arm / arm64 / OSX x64.~~

Edit: It's because BoundedMemory (on non-Windows) points to a GC array instead of native memory, and in this particular test I'm intentionally pointing the gcref at a bad address. I'll change BoundedMemory to use the native heap on Unix, just like it does for Windows, which should resolve the issue.

GrabYourPitchforks · 2021-06-04T05:04:34Z

Benchmark results

Full benchmark app: https://github.com/GrabYourPitchforks/ConsoleApplicationBenchmark/blob/471fa6822a5c5e5fecfc14196e54dbfe82a0f20c/ConsoleAppBenchmark/IndexOfAnyRunner.cs

[Benchmark]
public int SliceInALoop()
{
    var haystack = _haystack;
    _ = haystack.Length; // allow JIT to prove not null
    ReadOnlySpan<T> haystackSpan = haystack;

    var needles = _needles;
    _ = needles.Length; // allow JIT to prove not null
    ReadOnlySpan<T> needlesSpan = needles;

    while (true)
    {
        int idx = haystackSpan.IndexOfAny(needlesSpan);
        if (idx < 0)
        {
            return haystackSpan.Length; // length of final slice
        }
        haystackSpan = haystackSpan.Slice(idx + 1);
    }
}

Method	Toolchain	HaystackLength	LastNeedleMatches	Mean	Error	StdDev	Ratio	RatioSD
SliceInALoop	idxofany	4	False	18.33 ns	28.879 ns	1.583 ns	0.64	0.06
SliceInALoop	main	4	False	28.79 ns	21.655 ns	1.187 ns	1.00	0.00

SliceInALoop	idxofany	4	True	17.12 ns	4.562 ns	0.250 ns	0.38	0.01
SliceInALoop	main	4	True	44.52 ns	12.959 ns	0.710 ns	1.00	0.00

SliceInALoop	idxofany	16	False	65.40 ns	16.572 ns	0.908 ns	0.68	0.02
SliceInALoop	main	16	False	96.56 ns	21.013 ns	1.152 ns	1.00	0.00

SliceInALoop	idxofany	16	True	60.07 ns	9.261 ns	0.508 ns	0.46	0.00
SliceInALoop	main	16	True	131.59 ns	10.865 ns	0.596 ns	1.00	0.00

SliceInALoop	idxofany	64	False	236.47 ns	38.295 ns	2.099 ns	0.63	0.03
SliceInALoop	main	64	False	378.68 ns	309.722 ns	16.977 ns	1.00	0.00

SliceInALoop	idxofany	64	True	195.13 ns	24.816 ns	1.360 ns	0.22	0.00
SliceInALoop	main	64	True	878.64 ns	9.714 ns	0.532 ns	1.00	0.00

SliceInALoop	idxofany	256	False	901.88 ns	1,002.251 ns	54.937 ns	0.62	0.04
SliceInALoop	main	256	False	1,450.75 ns	54.688 ns	2.998 ns	1.00	0.00

SliceInALoop	idxofany	256	True	757.16 ns	39.841 ns	2.184 ns	0.07	0.00
SliceInALoop	main	256	True	11,050.08 ns	419.823 ns	23.012 ns	1.00	0.00

SliceInALoop	idxofany	1024	False	3,561.17 ns	4,301.921 ns	235.803 ns	0.62	0.05
SliceInALoop	main	1024	False	5,781.88 ns	2,345.776 ns	128.580 ns	1.00	0.00

SliceInALoop	idxofany	1024	True	2,988.88 ns	11.492 ns	0.630 ns	0.02	0.00
SliceInALoop	main	1024	True	167,176.09 ns	29,612.891 ns	1,623.182 ns	1.00	0.00

SliceInALoop	idxofany	4096	False	14,217.25 ns	11,186.894 ns	613.191 ns	0.62	0.02
SliceInALoop	main	4096	False	22,928.52 ns	4,179.188 ns	229.075 ns	1.00	0.00

SliceInALoop	idxofany	4096	True	11,919.86 ns	730.079 ns	40.018 ns	0.005	0.00
SliceInALoop	main	4096	True	2,495,866.67 ns	136,852.103 ns	7,501.326 ns	1.000	0.00

Discussion

This benchmark tests IndexOfAny when called in a loop (a typical use case) for various haystack lengths, and it alternates whether the first needle or the last needle is discovered. In particular, this shows how the original IndexOfAny logic when called in a loop devolves into an O(n^2) operation if the last needle in the collection is the one that keeps being found, and how with the new logic the loop maintains an O(n) worst-case performance guarantee.

Don't read too deeply into the benchmark showing that the new IndexOfAny logic consistently outperforms the original logic. This is because the benchmark is specifically written to ensure the needles appear very frequently within the haystack, which results in many calls to IndexOfAny, and the overhead of setting up the SIMD loop on each entry is showing up as a large constant factor. Ignore this for now, as the really important takeaway is the worst-case algorithmic complexity as discussed in the previous paragraph.

stephentoub · 2021-10-11T14:40:36Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

+            // and when this is called in a loop, we want the entire loop to be bounded by O(n * l)
+            // rather than O(n^2 * l).
+
+            if (typeof(T).IsValueType)


@GrabYourPitchforks, for chars, should we do the same thing string.IndexOfAny does, using ProbabilisticMap? Then presumably string.IndexOfAny could just always delegate to MemoryExtensions.IndexOfAny rather than special-casing specific lengths?

It also occurs to me that for really large sets, we could invert the vectorization, and for each character in the haystack, vectorize the searching of the needles. Presumably that would require a relatively large set of input characters, though, e.g. at least 8.

Reduce worst-case alg complexity of IndexOfAny

fe89d29

GrabYourPitchforks added area-System.Memory tenet-performance Performance related issue labels Jun 3, 2021

GrabYourPitchforks added this to the 6.0.0 milestone Jun 3, 2021

GrabYourPitchforks requested a review from stephentoub June 3, 2021 00:48

GrabYourPitchforks commented Jun 3, 2021

View reviewed changes

src/libraries/System.Memory/tests/Span/IndexOfAny.AlgorithmicComplexity.cs Outdated Show resolved Hide resolved

GrabYourPitchforks commented Jun 3, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs Outdated Show resolved Hide resolved

stephentoub reviewed Jun 3, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs Outdated Show resolved Hide resolved

stephentoub reviewed Jun 3, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs Show resolved Hide resolved

stephentoub approved these changes Jun 3, 2021

View reviewed changes

GrabYourPitchforks mentioned this pull request Jun 3, 2021

MemoryMarshal.IndexOfAny/LastIndexOfAny for non-primitives does not special-case small needle spans #53655

Open

GrabYourPitchforks added 2 commits June 3, 2021 13:38

BoundedMemory should use native memory on Unix

eff1b41

PR feedback - fix unit tests and comments

9cc0b26

GrabYourPitchforks merged commit 5a1b822 into dotnet:main Jun 4, 2021

GrabYourPitchforks deleted the idxofany branch June 4, 2021 05:49

GrabYourPitchforks mentioned this pull request Jun 5, 2021

Add ReplaceLineEndings and EnumerateLines #53115

Merged

This was referenced Jun 8, 2021

[Perf] Changes at 6/4/2021 10:08:16 AM DrewScoggins/performance-2#6398

Open

[Perf] Changes at 6/4/2021 10:08:16 AM DrewScoggins/performance-2#6387

Open

[Perf] Changes at 6/4/2021 10:08:16 AM DrewScoggins/performance-2#6464

Open

GrabYourPitchforks mentioned this pull request Jun 14, 2021

[Perf] Regression in Span Indexing #54172

Closed

ghost locked as resolved and limited conversation to collaborators Jul 4, 2021

stephentoub reviewed Oct 11, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce worst-case alg complexity of MemoryExtensions.IndexOfAny #53652

Reduce worst-case alg complexity of MemoryExtensions.IndexOfAny #53652

GrabYourPitchforks commented Jun 3, 2021

ghost commented Jun 3, 2021

stephentoub left a comment

GrabYourPitchforks commented Jun 3, 2021 •

edited

Loading

GrabYourPitchforks commented Jun 4, 2021

stephentoub Oct 11, 2021

Reduce worst-case alg complexity of MemoryExtensions.IndexOfAny #53652

Reduce worst-case alg complexity of MemoryExtensions.IndexOfAny #53652

Conversation

GrabYourPitchforks commented Jun 3, 2021

ghost commented Jun 3, 2021

stephentoub left a comment

Choose a reason for hiding this comment

GrabYourPitchforks commented Jun 3, 2021 • edited Loading

GrabYourPitchforks commented Jun 4, 2021

Benchmark results

Discussion

stephentoub Oct 11, 2021

Choose a reason for hiding this comment

GrabYourPitchforks commented Jun 3, 2021 •

edited

Loading