Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize find_end, make sure ASan passes #5042

Merged
merged 3 commits into from
Oct 30, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Oct 26, 2024

Re-try of #4943

Reverts #5041 and fixes the bug that triggered ASan.

🦠 The Bug

The overall structure

We do SSE4.2 checks that search for up to 16 bytes sequence in a 16 bytes sequence. Such search can result in a match if it is a full match, or if it is a match of the beginning, and can be a match or mismatch when looking into further bytes.

We have two different code branches for needles within 16 bytes and longer needles. The bug is in the longer needles branch.

Short needle branch

The 16 or less bytes branch can have SSE4.2 match as confirmed, when the needle fits fully, or a match that needs further confirmation when beginning of the needle matched.

The search starts from the 16 bytes before the end, to have data for the first SSE4.2 instruction.

The whole haystack is split into three parts:

  1. The last part, that is 16 bytes closest to the end. Here we can have only confirmed match, as whatever doesn't fit, doesn't have any remaining part.
  2. The middle part 16 bytes sequences further on. Here we can have match that needs confirmation. In this case, 16 bytes are loaded from the potential match offset, and compared against the needle, using valid bytes mask. The load is safe, as we already passed the last part here, so we certainly have more than 16 bytes available.
  3. The first part. Any remaining bytes that are less than 16 bytes. The same as middle part, except that we know that some matches are not possible. Mask them out to avoid checking again.

Long needle branch

Any match needs further confirmation.

The search starts from the byte offset from the end that can have first match. This offset is greater than 16 bytes.

The whole haystack split into two parts:

  1. The main part. For any match starting from nonzero offset we load 16 bytes and compare with needle beginning, then memcmp the remaining. For match starting from zero offset we can go directly to memcmp, as first 16 bytes are already checked
  2. The first part. Again, any remaining bytes, again, no difference, except the mask.

Comparing first 16 bytes before going to memcmp is an optimization to check first 16 bytes faster. We already have the needle in a vector register, and we know that there are more than 16 bytes, so we can compare them faster, and skip memcmp at all, if there's a mismatch.

Meet the bug

On the first iteration of main part, we have a possibility that a match has nonzero offset. In this case, either the 16 bytes check or memcmp would do out-of-range read.

💥 The Impact

  • ASan failure when building STL with ASan
  • Potentially unallocated memory read
  • False match when the characters beyond the end happen to match

💊 The Fix

Here are two ways possible:

  1. Make the last part handled separately, like in short needles branch, with only zero offset match being valid
  2. Start search from the earlier offset, so that any match is valid

Whereas the second approach is more appealing from the performance perspective, it is harder to reason about. One of the difficulties of the second approach is that skipping more bytes than 16 may result in offset before the beginning. In this case we have to do one matching, with beginning offset, and apply proper mask to the result of such matching.

So, this fix implements the first approach. It uses pxor/ptest for matching, as to match only first offset we don't need pcmp*str*, and can use faster instructions, and if they confirm, uses memcmp for the rest.

⏱️ The Fix Impact

About the same results in the benchmark. Within the usual variation.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner October 26, 2024 04:53
@StephanTLavavej StephanTLavavej added the performance Must go faster label Oct 26, 2024
@StephanTLavavej StephanTLavavej self-assigned this Oct 26, 2024
@StephanTLavavej StephanTLavavej removed their assignment Oct 26, 2024
@StephanTLavavej StephanTLavavej self-assigned this Oct 29, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit cb1e359 into microsoft:main Oct 30, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks for figuring out how to fix this! 🛠️ 🧠 🪄

@AlexGuteniev AlexGuteniev deleted the find-end-safely branch October 30, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants