Vectorize find_end
, make sure ASan passes
#5042
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Re-try of #4943
Reverts #5041 and fixes the bug that triggered ASan.
🦠 The Bug
The overall structure
We do SSE4.2 checks that search for up to 16 bytes sequence in a 16 bytes sequence. Such search can result in a match if it is a full match, or if it is a match of the beginning, and can be a match or mismatch when looking into further bytes.
We have two different code branches for needles within 16 bytes and longer needles. The bug is in the longer needles branch.
Short needle branch
The 16 or less bytes branch can have SSE4.2 match as confirmed, when the needle fits fully, or a match that needs further confirmation when beginning of the needle matched.
The search starts from the 16 bytes before the end, to have data for the first SSE4.2 instruction.
The whole haystack is split into three parts:
Long needle branch
Any match needs further confirmation.
The search starts from the byte offset from the end that can have first match. This offset is greater than 16 bytes.
The whole haystack split into two parts:
memcmp
the remaining. For match starting from zero offset we can go directly tomemcmp
, as first 16 bytes are already checkedComparing first 16 bytes before going to
memcmp
is an optimization to check first 16 bytes faster. We already have the needle in a vector register, and we know that there are more than 16 bytes, so we can compare them faster, and skipmemcmp
at all, if there's a mismatch.Meet the bug
On the first iteration of main part, we have a possibility that a match has nonzero offset. In this case, either the 16 bytes check or
memcmp
would do out-of-range read.💥 The Impact
💊 The Fix
Here are two ways possible:
Whereas the second approach is more appealing from the performance perspective, it is harder to reason about. One of the difficulties of the second approach is that skipping more bytes than 16 may result in offset before the beginning. In this case we have to do one matching, with beginning offset, and apply proper mask to the result of such matching.
So, this fix implements the first approach. It uses
pxor
/ptest
for matching, as to match only first offset we don't needpcmp*str*
, and can use faster instructions, and if they confirm, usesmemcmp
for the rest.⏱️ The Fix Impact
About the same results in the benchmark. Within the usual variation.