Vectorize `find_end`, make sure ASan passes #5042

AlexGuteniev · 2024-10-26T04:53:42Z

Re-try of #4943

Reverts #5041 and fixes the bug that triggered ASan.

🦠 The Bug

The overall structure

We do SSE4.2 checks that search for up to 16 bytes sequence in a 16 bytes sequence. Such search can result in a match if it is a full match, or if it is a match of the beginning, and can be a match or mismatch when looking into further bytes.

We have two different code branches for needles within 16 bytes and longer needles. The bug is in the longer needles branch.

Short needle branch

The 16 or less bytes branch can have SSE4.2 match as confirmed, when the needle fits fully, or a match that needs further confirmation when beginning of the needle matched.

The search starts from the 16 bytes before the end, to have data for the first SSE4.2 instruction.

The whole haystack is split into three parts:

The last part, that is 16 bytes closest to the end. Here we can have only confirmed match, as whatever doesn't fit, doesn't have any remaining part.
The middle part 16 bytes sequences further on. Here we can have match that needs confirmation. In this case, 16 bytes are loaded from the potential match offset, and compared against the needle, using valid bytes mask. The load is safe, as we already passed the last part here, so we certainly have more than 16 bytes available.
The first part. Any remaining bytes that are less than 16 bytes. The same as middle part, except that we know that some matches are not possible. Mask them out to avoid checking again.

Long needle branch

Any match needs further confirmation.

The search starts from the byte offset from the end that can have first match. This offset is greater than 16 bytes.

The whole haystack split into two parts:

The main part. For any match starting from nonzero offset we load 16 bytes and compare with needle beginning, then memcmp the remaining. For match starting from zero offset we can go directly to memcmp, as first 16 bytes are already checked
The first part. Again, any remaining bytes, again, no difference, except the mask.

Comparing first 16 bytes before going to memcmp is an optimization to check first 16 bytes faster. We already have the needle in a vector register, and we know that there are more than 16 bytes, so we can compare them faster, and skip memcmp at all, if there's a mismatch.

Meet the bug

On the first iteration of main part, we have a possibility that a match has nonzero offset. In this case, either the 16 bytes check or memcmp would do out-of-range read.

💥 The Impact

ASan failure when building STL with ASan
Potentially unallocated memory read
False match when the characters beyond the end happen to match

💊 The Fix

Here are two ways possible:

Make the last part handled separately, like in short needles branch, with only zero offset match being valid
Start search from the earlier offset, so that any match is valid

Whereas the second approach is more appealing from the performance perspective, it is harder to reason about. One of the difficulties of the second approach is that skipping more bytes than 16 may result in offset before the beginning. In this case we have to do one matching, with beginning offset, and apply proper mask to the result of such matching.

So, this fix implements the first approach. It uses pxor/ptest for matching, as to match only first offset we don't need pcmp*str*, and can use faster instructions, and if they confirm, uses memcmp for the rest.

⏱️ The Fix Impact

About the same results in the benchmark. Within the usual variation.

…t#5041)" This reverts commit ca1553d.

StephanTLavavej · 2024-10-29T19:43:58Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-10-30T14:35:15Z

Thanks for figuring out how to fix this! 🛠️ 🧠 🪄

AlexGuteniev added 3 commits October 26, 2024 05:57

Revert "Revert product changes for find_end vectorization (microsof…

a0e3cfc

…t#5041)" This reverts commit ca1553d.

fix error found by ASan vector annotation

e21e55f

constistency

80e88cf

AlexGuteniev requested a review from a team as a code owner October 26, 2024 04:53

StephanTLavavej added the performance Must go faster label Oct 26, 2024

StephanTLavavej self-assigned this Oct 26, 2024

StephanTLavavej approved these changes Oct 26, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 26, 2024

StephanTLavavej mentioned this pull request Oct 26, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Oct 29, 2024

StephanTLavavej merged commit cb1e359 into microsoft:main Oct 30, 2024
39 checks passed

AlexGuteniev deleted the find-end-safely branch October 30, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `find_end`, make sure ASan passes #5042

Vectorize `find_end`, make sure ASan passes #5042

AlexGuteniev commented Oct 26, 2024 •

edited by StephanTLavavej

Loading

StephanTLavavej commented Oct 29, 2024

StephanTLavavej commented Oct 30, 2024

Vectorize find_end, make sure ASan passes #5042

Vectorize find_end, make sure ASan passes #5042

Conversation

AlexGuteniev commented Oct 26, 2024 • edited by StephanTLavavej Loading

🦠 The Bug

The overall structure

Short needle branch

Long needle branch

Meet the bug

💥 The Impact

💊 The Fix

⏱️ The Fix Impact

StephanTLavavej commented Oct 29, 2024

StephanTLavavej commented Oct 30, 2024

Vectorize `find_end`, make sure ASan passes #5042

Vectorize `find_end`, make sure ASan passes #5042

AlexGuteniev commented Oct 26, 2024 •

edited by StephanTLavavej

Loading