Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating columns is slow #4

Closed
fflorent opened this issue May 28, 2017 · 12 comments
Closed

Calculating columns is slow #4

fflorent opened this issue May 28, 2017 · 12 comments

Comments

@fflorent
Copy link
Owner

fflorent commented May 28, 2017

From benchmark:

running 6 tests
test bench_slice                             ... bench:          44 ns/iter (+/- 0)
test bench_slice_columns_only                ... bench:         250 ns/iter (+/- 1)
test bench_slice_columns_only_for_ascii_text ... bench:         250 ns/iter (+/- 3)
test bench_slice_from                        ... bench:          44 ns/iter (+/- 0)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 0)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 0)

There must be a way to count non-ascii chars correctly and faster than that.

Also see this remark from the author of bytecount: rust-lang/rust#37888 (comment)

@fflorent
Copy link
Owner Author

fflorent commented Jun 3, 2017

See also llogiq/bytecount#12

@CAD97
Copy link
Contributor

CAD97 commented Dec 19, 2017

A lot of analysis that I did in like 30 minutes below. For reference, my initial benchmarks:

Initial Benchmarks

Trial 1

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 40)
test bench_slice_columns_only                ... bench:         482 ns/iter (+/- 307)
test bench_slice_columns_only_for_ascii_text ... bench:         105 ns/iter (+/- 10)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 51)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 2)

Trial 2

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 23)
test bench_slice_columns_only                ... bench:         482 ns/iter (+/- 216)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 11)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 84)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 0)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 1)

Trial 3

running 6 tests
test bench_slice                             ... bench:          92 ns/iter (+/- 41)
test bench_slice_columns_only                ... bench:         480 ns/iter (+/- 51)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 94)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 66)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 4)

UTF-8 encoding, for reference:

Number of bytes Bits for code point First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx      
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx    
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx  
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Every trailing byte starts is 10xxxxxx. So I set get_column_utf8 to:

pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
    let before_self = self.get_columns_and_bytes_before().1;
    Ok(before_self.iter()
        .filter(|&&byte| (byte >> 6) != 0b10)
        .count() + 1)
}

And got these results from the bench:

Count leading UTF-8 bytes

Trial 1

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 14)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 82)
test bench_slice_columns_only_for_ascii_text ... bench:         101 ns/iter (+/- 9)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 47)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 5)

Trial 2

running 6 tests
test bench_slice                             ... bench:          90 ns/iter (+/- 17)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 4)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 97)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 3)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 0)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 1)

Trial 3

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 2)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 57)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 32)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 74)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 4)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 3)

Note that this approach does not guarantee that the slice is valid UTF-8 like the current implementation. If we run through from_utf8 again, we see that most of the overhead is the initial check for well-formedness:

pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
    let before_self = self.get_columns_and_bytes_before().1;
    Ok(std::str::from_utf8(before_self)?
        .as_bytes().iter()
        .filter(|&&byte| (byte >> 6) != 0b10)
        .count() + 1)
}
Manually count leading UTF-8 bytes after UTF-8 check

Trial 1

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 43)
test bench_slice_columns_only                ... bench:         481 ns/iter (+/- 47)
test bench_slice_columns_only_for_ascii_text ... bench:          99 ns/iter (+/- 87)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 51)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 2)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 5)

Trial 2

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 18)
test bench_slice_columns_only                ... bench:         481 ns/iter (+/- 295)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 93)
test bench_slice_from                        ... bench:          90 ns/iter (+/- 80)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 0)

Trial 3

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 83)
test bench_slice_columns_only                ... bench:         480 ns/iter (+/- 138)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 94)
test bench_slice_from                        ... bench:          90 ns/iter (+/- 45)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 3)

And just to show that this gets optimized to the proper iterative algorithm, here's the ASM output from the playground:

#[inline(never)]
fn count_char(bytes: &[u8]) -> usize {
    bytes
        .iter()
        .filter(|&&byte| (byte >> 6) != 0b10)
        .count() + 1
}
Release ASM for above
playground::count_char:
	leaq	str.0(%rip), %rcx
	xorl	%esi, %esi
	leaq	str.0+653(%rip), %r8
	jmp	.LBB0_1
.LBB0_3:
	movzbl	(%rcx), %edx
	movzbl	1(%rcx), %esi
	andb	$-64, %dl
	xorl	%edi, %edi
	cmpb	$-128, %dl
	setne	%dil
	addq	%rax, %rdi
	andb	$-64, %sil
	xorl	%eax, %eax
	cmpb	$-128, %sil
	setne	%al
	addq	%rdi, %rax
	movzbl	2(%rcx), %edx
	andb	$-64, %dl
	xorl	%esi, %esi
	cmpb	$-128, %dl
	setne	%sil
	addq	%rax, %rsi
	addq	$3, %rcx
.LBB0_1:
	movzbl	(%rcx), %edx
	andb	$-64, %dl
	xorl	%eax, %eax
	cmpb	$-128, %dl
	setne	%al
	addq	%rsi, %rax
	incq	%rcx
	cmpq	%r8, %rcx
	jne	.LBB0_3
	incq	%rax
	retq
.Lfunc_end0:

@CAD97
Copy link
Contributor

CAD97 commented Dec 19, 2017

As you can see, the majority of the time spent in get_column_utf8 is actually checking that we have a valid UTF-8 slice. In the case where we know this already (the LocatedSpan is backed by a &str), we should definitely elide the UTF-8 check, as this makes checking character position nearly the exact same cost as byte position.

The question is whether we want to keep the UTF-8 check when we don't have that guarantee. It's not unsafe -- this operates at the individual byte level.

@CAD97
Copy link
Contributor

CAD97 commented Dec 20, 2017

One more option I decided should be tested: use the Chars iterator to count, rather than doing the leading byte check manually, but use the unsafe from_utf8_unchecked. This option is unsafe for malformed utf8, and is for comparison purposes only.

pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
    let before_self = self.get_columns_and_bytes_before().1;
    Ok(unsafe {std::str::from_utf8_unchecked(before_self)}
        .chars()
        .count() + 1)
}
Benchmark trials

Trial 1

running 6 tests
test bench_slice                             ... bench:          95 ns/iter (+/- 45)
test bench_slice_columns_only                ... bench:         114 ns/iter (+/- 98)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 62)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 80)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 3)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 2)

Trial 2

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 22)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 54)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 36)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 5)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 0)

Trial 3

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 49)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 95)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 98)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 56)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 3)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 1)

Comparable to manually checking for non-trailing bytes. If we were only concerned about the case where we know we have valid UTF-8, I'd recommend this method as it relies just the same on it being well-formed UTF-8 for a valid answer, but doesn't require this code to know anything about the details of UTF-8. But this is unsafe with malformed UTF-8.

@CAD97
Copy link
Contributor

CAD97 commented Dec 20, 2017

Per another reading of rust-lang/rust#37888, I realized that the manual solution is actually the same as the improved .chars().count(). It might be worth it to figure out some way to provide the -> Result<usize, Utf8Error> version for unknown utf8ness, an unsafe -> usize version for unknown utf8ness, and a safe -> usize version for known utf8-safe slices that uses str::from_utf8_unchecked.

So, stubbed with no checking,

unsafe trait UTF8Safe {}
unsafe impl UTF8Safe for &str {}
unsafe impl UTF8Safe for nom::CompleteStr {}
unsafe impl<T: UTF8Safe> UTF8Safe for LocatedSpan<T> {}

impl LocatedSpan<T> {
    fn get_byte_column(&self) -> usize {
        self.get_columns_and_bytes_before().0
    }

    fn checked_get_char_column(&self) -> Result<usize, UTF8Error> {
        let before = self.get_columns_and_bytes_before().1;
        Ok(str::from_utf8(before)?
            .chars()
            .count() + 1)
    }

    unsafe fn unchecked_get_char_column(&self) -> usize {
        let before = self.get_columns_and_bytes_before().1;
        Ok(str::from_utf8_unchecked(before)
            .chars()
            .count() + 1)
    }
}

impl LocatedSpan<T> where T: UTF8Safe {
    fn get_char_column(&self) -> usize {
        unsafe { self.unchecked_get_char_column() }
    }
}

If negative trait bounds were stable and working like I think they do I'd remove LocatedSpan::checked_get_char_column and implement get_char_column for T: !UTF8Safe.

Note that while str::from_utf8_unchecked(bytes).chars().count()'s current implementation happens to be safe for arbitrary bytes currently, it is not guaranteed to be so.

@CAD97
Copy link
Contributor

CAD97 commented Dec 24, 2017

bytecount has merged a count for characters: llogiq/bytecount#29

Semantics are:

  • Takes byte slice
  • If slice is correct UTF-8
    • Returns the number of unicode codepoints in the string
  • Else
    • Returns an arbitrary number less than or equal to the length of the slice

we win against naive count from ~80 chars upwards.

I'll make a new release soon.

@Veedrac
Copy link

Veedrac commented Dec 24, 2017

we win against naive count from ~80 chars upwards.

Note that these benchmarks don't bother measuring the fixed mispredict overhead, since it's not relevant to tuning, so the crossover point is going to be slightly later and the loss at short lengths will be larger.

@fflorent
Copy link
Owner Author

fflorent commented Jan 2, 2018

Sorry for having taking long to react on this. Thanks for your comments and investigations! I read all of this carefully and answer you very soon.

Florent

@fflorent
Copy link
Owner Author

fflorent commented Jan 2, 2018

I think I get one of your arguments. Indeed, that function can assume the bytes are well-formed utf8 chars. Otherwise, in the worst case, it would just return an inconsistent length, as long as it won't crash. No need to return a Result<usize, _> then.

So if I understand correctly all the topic, in both your investigation and the latest work in bytecount, two options are proposed:

  • a naive algorithm that does all the stuff well and fast enough until ~80 chars (or a bit later according to Veedrac)
  • an optimized algorithm otherwise

I am torn between the use of naive algorithm and the optimized one from bytecount. Some thoughts:

  • the pros of naive algorithm:
    • for regular source codes, the lines exceeding 200 tend to be rare
    • that's a function called by the user, and therefore more or less occasionally, so that may not drastically degrade the performance
  • the pros of the optimized (hyper) algorithm:
    • obfuscated / minified source codes, like in Javascript, exist. In production, they rarely contain new lines (except for comments). And the columns count can exceed 100,000 easily.

As a JS developer, I would not neglect the latest point. Maybe it's worthy to offer a feature to choose one or the other solutions (if I read correctly the benches here, the hyper algo is much faster for counting 100000 chars).

What do you think? And @CAD97 for your own use case, what solution would you prefer?

Florent

@CAD97
Copy link
Contributor

CAD97 commented Jan 6, 2018

I think I get one of your arguments. Indeed, that function can assume the bytes are well-formed utf8 chars. Otherwise, in the worst case, it would just return an inconsistent length, as long as it won't crash. No need to return a Result<usize, _> then.

👍 You interpreted my information dump correctly 🎉

If we could safely reuse std::Chars::count on arbitrary bytes, I'd recommend that option, as it's the simplest option. However, this is not guaranteed safe. Since you need an algorithm that works on arbitrary bytes, your options are effectively 1) reimplement bytecount::naive_num_chars and the ASCII handling that comes with or 2) use bytecount.

If you pull in bytecount, I'd think you'd want to at least offer the hyper algorithm. But also keep in mind that the huge speedups come from SIMD, and you'll need to pass on the two SIMD cargo features to your consumers as well. @Veedrac could say better than I can what the performance deltas look like without any SIMD features enabled.

I don't really feel comfortable recommending one way or the other. My use case probably expects short lines of maximum around 120 characters. It's for a small DSL for probably my own use only, as an experiment in language design more than anything else. I would probably argue to err on the side of favoring shorter lines; human code writers usually favor shorter lines for legibility.


The other point is that we need to consider what IDEs expect. The primary use of line:column information is to be able to navigate to the problem. TL;DR code point index is (probably) the correct behavior here. For my wonderful sample size of 1, IntelliJ IDEA:

メカジキ
aääAÄÄ
E383A1 E382AB E382B8 E382AD 0A
61 C3A4 61 CC88 41 C384 41 CC88 0A
[U+30E1 Katakana Letter Me] [U+30AB Katakana Letter Ka] [U+30B8 Katakana Letter Zi] [U+30AD Katakana Letter Ki] [U+000A New Line]
[U+0061 Latin Small Letter A] [U+00E4 Latin Small Letter A with Diaeresis] [U+0061 Latin Small Letter A] [U+0308 Combining Diaeresis] [U+0041 Latin Capital Letter A] [U+00C4 Latin Capital Letter A with Diaeresis] [U+0041 Latin Capital Letter A] [U+0308 Combining Diaeresis] [U+000A New Line]
[1:1]メ[1:2]カ[1:3]ジ[1:4]キ[1:5]
[2:1]a[2:2]ä[2:3]ä[2:5]A[2:6]Ä[2:7]Ä[2:9]

@Veedrac
Copy link

Veedrac commented Jan 7, 2018

@CAD97 bytecount::num_chars is, very roughly speaking, 10x the speed of naive_num_chars even without explicit SIMD over long slices. If you don't want a dependency on SIMD, it does not necessarily mean num_chars is unsuited.

More information about timings is available at llogiq/bytecount#29, though note that the non-SIMD variant has been improved since then.

@fflorent
Copy link
Owner Author

fflorent commented Jan 7, 2018

Well, I also wonder if that wouldn't be reasonable to let the user do their choice. If we offer both the functions (calling either naive or hyper versions of num_chars), they could use one or the other regarding the input (long lines expected or not). We could tell them in the documentation which one they may prefer using.

Also if I am not wrong, bytecount is lightweight enough to consider embedding directly in nom_locate without much regret.

But also keep in mind that the huge speedups come from SIMD, and you'll need to pass on the two SIMD cargo features to your consumers as well

Building lexers/parsers with columns support may justify such an optimization.

@CAD97 Are you interested in that work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants