-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculating columns is slow #4
Comments
See also llogiq/bytecount#12 |
A lot of analysis that I did in like 30 minutes below. For reference, my initial benchmarks: Initial BenchmarksTrial 1
Trial 2
Trial 3
UTF-8 encoding, for reference:
Every trailing byte starts is pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
let before_self = self.get_columns_and_bytes_before().1;
Ok(before_self.iter()
.filter(|&&byte| (byte >> 6) != 0b10)
.count() + 1)
} And got these results from the bench: Count leading UTF-8 bytesTrial 1
Trial 2
Trial 3
Note that this approach does not guarantee that the slice is valid UTF-8 like the current implementation. If we run through pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
let before_self = self.get_columns_and_bytes_before().1;
Ok(std::str::from_utf8(before_self)?
.as_bytes().iter()
.filter(|&&byte| (byte >> 6) != 0b10)
.count() + 1)
} Manually count leading UTF-8 bytes after UTF-8 checkTrial 1
Trial 2
Trial 3
And just to show that this gets optimized to the proper iterative algorithm, here's the ASM output from the playground: #[inline(never)]
fn count_char(bytes: &[u8]) -> usize {
bytes
.iter()
.filter(|&&byte| (byte >> 6) != 0b10)
.count() + 1
} Release ASM for aboveplayground::count_char:
leaq str.0(%rip), %rcx
xorl %esi, %esi
leaq str.0+653(%rip), %r8
jmp .LBB0_1
.LBB0_3:
movzbl (%rcx), %edx
movzbl 1(%rcx), %esi
andb $-64, %dl
xorl %edi, %edi
cmpb $-128, %dl
setne %dil
addq %rax, %rdi
andb $-64, %sil
xorl %eax, %eax
cmpb $-128, %sil
setne %al
addq %rdi, %rax
movzbl 2(%rcx), %edx
andb $-64, %dl
xorl %esi, %esi
cmpb $-128, %dl
setne %sil
addq %rax, %rsi
addq $3, %rcx
.LBB0_1:
movzbl (%rcx), %edx
andb $-64, %dl
xorl %eax, %eax
cmpb $-128, %dl
setne %al
addq %rsi, %rax
incq %rcx
cmpq %r8, %rcx
jne .LBB0_3
incq %rax
retq
.Lfunc_end0: |
As you can see, the majority of the time spent in The question is whether we want to keep the UTF-8 check when we don't have that guarantee. It's not unsafe -- this operates at the individual byte level. |
One more option I decided should be tested: use the pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
let before_self = self.get_columns_and_bytes_before().1;
Ok(unsafe {std::str::from_utf8_unchecked(before_self)}
.chars()
.count() + 1)
} Benchmark trialsTrial 1
Trial 2
Trial 3
Comparable to manually checking for non-trailing bytes. If we were only concerned about the case where we know we have valid UTF-8, I'd recommend this method as it relies just the same on it being well-formed UTF-8 for a valid answer, but doesn't require this code to know anything about the details of UTF-8. But this is unsafe with malformed UTF-8. |
Per another reading of rust-lang/rust#37888, I realized that the manual solution is actually the same as the improved So, stubbed with no checking, unsafe trait UTF8Safe {}
unsafe impl UTF8Safe for &str {}
unsafe impl UTF8Safe for nom::CompleteStr {}
unsafe impl<T: UTF8Safe> UTF8Safe for LocatedSpan<T> {}
impl LocatedSpan<T> {
fn get_byte_column(&self) -> usize {
self.get_columns_and_bytes_before().0
}
fn checked_get_char_column(&self) -> Result<usize, UTF8Error> {
let before = self.get_columns_and_bytes_before().1;
Ok(str::from_utf8(before)?
.chars()
.count() + 1)
}
unsafe fn unchecked_get_char_column(&self) -> usize {
let before = self.get_columns_and_bytes_before().1;
Ok(str::from_utf8_unchecked(before)
.chars()
.count() + 1)
}
}
impl LocatedSpan<T> where T: UTF8Safe {
fn get_char_column(&self) -> usize {
unsafe { self.unchecked_get_char_column() }
}
} If negative trait bounds were stable and working like I think they do I'd remove Note that while |
bytecount has merged a count for characters: llogiq/bytecount#29 Semantics are:
|
Note that these benchmarks don't bother measuring the fixed mispredict overhead, since it's not relevant to tuning, so the crossover point is going to be slightly later and the loss at short lengths will be larger. |
Sorry for having taking long to react on this. Thanks for your comments and investigations! I read all of this carefully and answer you very soon. Florent |
I think I get one of your arguments. Indeed, that function can assume the bytes are well-formed utf8 chars. Otherwise, in the worst case, it would just return an inconsistent length, as long as it won't crash. No need to return a So if I understand correctly all the topic, in both your investigation and the latest work in bytecount, two options are proposed:
I am torn between the use of naive algorithm and the optimized one from bytecount. Some thoughts:
As a JS developer, I would not neglect the latest point. Maybe it's worthy to offer a feature to choose one or the other solutions (if I read correctly the benches here, the hyper algo is much faster for counting 100000 chars). What do you think? And @CAD97 for your own use case, what solution would you prefer? Florent |
👍 You interpreted my information dump correctly 🎉 If we could safely reuse If you pull in I don't really feel comfortable recommending one way or the other. My use case probably expects short lines of maximum around 120 characters. It's for a small DSL for probably my own use only, as an experiment in language design more than anything else. I would probably argue to err on the side of favoring shorter lines; human code writers usually favor shorter lines for legibility. The other point is that we need to consider what IDEs expect. The primary use of
|
@CAD97 More information about timings is available at llogiq/bytecount#29, though note that the non-SIMD variant has been improved since then. |
Well, I also wonder if that wouldn't be reasonable to let the user do their choice. If we offer both the functions (calling either naive or hyper versions of Also if I am not wrong, bytecount is lightweight enough to consider embedding directly in nom_locate without much regret.
Building lexers/parsers with columns support may justify such an optimization. @CAD97 Are you interested in that work? |
From benchmark:
There must be a way to count non-ascii chars correctly and faster than that.
Also see this remark from the author of bytecount: rust-lang/rust#37888 (comment)
The text was updated successfully, but these errors were encountered: