-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to count UTF-8 chars #12
Comments
It sounds faster to use |
Why? We can use |
I expect we should be using the "Equal Ordered" mode. The challenge here is simply that this does not consume the whole 16 bytes each cycle, which means either we need to use unaligned reads or deals with characters crossing the 16-byte blocks. Having slept since the last comment, it's obvious to me that unaligned reads are the sensible bet as Intel hardware has fast unaligned loads and it doesn't even prevent running multiple in parallel. |
Nothing in performance is obvious until measured. 😄 |
I don't completely understand the Equal Ordered mode. OTOH the Range mode looks like a viable solution, using the range argument: |
I should clarify how the Equal Ordered mode should be used. Consider the example given:
When we're counting chars, it's going to look more like
and you can just count the set bits on that. Note, though, that you might have unaligned characters, eg. something like
which means the next search can't jump a whole vector along. It needs to include the trailing bytes in the next search. Ergo the suggestion to use unaligned reads, which are fast on Intel CPUs. |
I implemented a very simplistic version of this with no simd acceleration. CAD97/bytecount@77c0bad pub fn naive_count_utf8(haystack: &[u8]) -> usize {
// UTF-8 is self-aligning: all trailing bytes are encoded as 0b10_xxx_xxx
// So to count code points, we count non-trailing bytes 0b11_xxx_xxx or 0b0x_xxx_xxx
haystack.iter().filter(|&&byte| (byte >> 6) != 0b10).count()
} Because all trailing bytes are encoded as But the interesting part is how this compares against the stdlib's solution of unsafe fn random_str_safeish_bytes(len: usize) -> String {
let mut bytes = rand::thread_rng().gen_iter::<u8>().take(len).collect::<Vec<_>>();
for idx in len-4..len {
bytes[idx] = 0;
}
String::from_utf8_unchecked(bytes)
}
macro_rules! bench {
($i: expr, $name_stdlib: ident, $name_naive: ident) => {
fn $name_stdlib(b: &mut bencher::Bencher) {
let haystack = unsafe{random_str_safeish_bytes($i)};
b.iter(|| haystack.chars().count());
}
fn $name_naive(b: &mut bencher::Bencher) {
let haystack = unsafe{random_str_safeish_bytes($i)};
b.iter(|| naive_count_utf8(haystack.as_bytes()));
}
};
} I set the last four random bytes to
If I'm reading it correctly, we get the same performance (within error margins) below somewhere around 10,000 bytes, at which point the manual approach becomes consistently marginally faster, though still within error margins. I don't know enough about SIMD to accelerate this, but I think it should be fairly easy to do. I agree with @llogiq here, that by my reading of EDIT: Also tried the following with pub fn naive_count_utf8_alt(haystack: &[u8]) -> usize {
haystack.iter().fold(0usize, |n, &byte| n + ((byte >> 6) != 0b10) as usize)
} |
@CAD97 If you're searching a The answer here may very well be "assume valid UTF-8," and maybe that's OK! :) (This is perhaps not a comment directed toward you @CAD97, but rather, the entire issue.) |
What I did was to say that using the count function on invalid UTF-8 is safe but meaningless
I'm not intimately familiar with the lossy decoding rules, but I think this approach does not fully agree with the lossy decoding protocol. If I recall correctly, a lossy decoder should stop when encountering an unexpected byte, and replace the bytes it has eaten already with the replacement character. This means that a single trailing byte would be converted to a replacement character, whereas the naive algorithm I used would not count it. It would not be difficult to write a decoder that handled this, but it would necessarily be serial. I personally think the solution is to use an algorithm correct on UTF-8 but still safe on arbitrary bytes, such as the above |
@CAD97 SGTM. :) |
Oh, I just figured out the potential miscommunication between @llogiq / @Veedrac around The Range count (count bytes The Equal Ordered count is intended to count occurrences of a given character in a UTF-8 string. These are both useful information to be able to look for, though I think the former is much more likely to be used. Searching for occurrences of a given Unicode Codepoint is fraught with its own perils, because of the codepoint-grapheme disconnect as well as other encoding gotchas. I think this issue should focus on the |
Looking again at rust-lang/rust#37888 (I did look at it, I promise), the naive solution I used above is exactly as that PR introduced to the stdlib, so I'm not surprised that the performance was within margin of error every time. Any speedup we can offer will be by offering a version on byte slices without the utf-8 check (see also fflorent/nom_locate#4) or by looking at more bytes at a time a-la |
Yup, I completely misunderstood the goal here. Thanks @CAD97 for the help! This should actually be really easy to retrofit onto the existing code, because of the tricks that have already been mentioned. You basically need to mask the lower bits and then return |
Proof of concept is up at #29. |
Slightly improved version in #30 – still not using pcempstrn, so it could potentially work on Aarch64, too. |
Closed by #29 |
Basically we just need an bitwise AND or OR and compare with the respective value whose highest bits are '10' and subtract the count from the slice length.
This could also go into
.chars().count()
if it turns out to not lose too much time compared to naive onn short strings.The text was updated successfully, but these errors were encountered: