Calculating columns is slow #4

fflorent · 2017-05-28T13:13:07Z

From benchmark:

running 6 tests
test bench_slice                             ... bench:          44 ns/iter (+/- 0)
test bench_slice_columns_only                ... bench:         250 ns/iter (+/- 1)
test bench_slice_columns_only_for_ascii_text ... bench:         250 ns/iter (+/- 3)
test bench_slice_from                        ... bench:          44 ns/iter (+/- 0)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 0)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 0)

There must be a way to count non-ascii chars correctly and faster than that.

Also see this remark from the author of bytecount: rust-lang/rust#37888 (comment)

The text was updated successfully, but these errors were encountered:

fflorent · 2017-06-03T16:51:38Z

See also llogiq/bytecount#12

CAD97 · 2017-12-19T22:45:27Z

A lot of analysis that I did in like 30 minutes below. For reference, my initial benchmarks:

Initial Benchmarks

Trial 1

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 40)
test bench_slice_columns_only                ... bench:         482 ns/iter (+/- 307)
test bench_slice_columns_only_for_ascii_text ... bench:         105 ns/iter (+/- 10)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 51)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 2)

Trial 2

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 23)
test bench_slice_columns_only                ... bench:         482 ns/iter (+/- 216)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 11)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 84)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 0)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 1)

Trial 3

running 6 tests
test bench_slice                             ... bench:          92 ns/iter (+/- 41)
test bench_slice_columns_only                ... bench:         480 ns/iter (+/- 51)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 94)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 66)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 4)

UTF-8 encoding, for reference:

Number of bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	7	U+0000	U+007F	`0xxxxxxx`
2	11	U+0080	U+07FF	`110xxxxx`	`10xxxxxx`
3	16	U+0800	U+FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
4	21	U+10000	U+10FFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

Every trailing byte starts is 10xxxxxx. So I set get_column_utf8 to:

pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
    let before_self = self.get_columns_and_bytes_before().1;
    Ok(before_self.iter()
        .filter(|&&byte| (byte >> 6) != 0b10)
        .count() + 1)
}

And got these results from the bench:

Count leading UTF-8 bytes

Trial 1

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 14)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 82)
test bench_slice_columns_only_for_ascii_text ... bench:         101 ns/iter (+/- 9)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 47)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 5)

Trial 2

running 6 tests
test bench_slice                             ... bench:          90 ns/iter (+/- 17)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 4)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 97)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 3)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 0)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 1)

Trial 3

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 2)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 57)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 32)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 74)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 4)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 3)

Note that this approach does not guarantee that the slice is valid UTF-8 like the current implementation. If we run through from_utf8 again, we see that most of the overhead is the initial check for well-formedness:

pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
    let before_self = self.get_columns_and_bytes_before().1;
    Ok(std::str::from_utf8(before_self)?
        .as_bytes().iter()
        .filter(|&&byte| (byte >> 6) != 0b10)
        .count() + 1)
}

Manually count leading UTF-8 bytes after UTF-8 check

Trial 1

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 43)
test bench_slice_columns_only                ... bench:         481 ns/iter (+/- 47)
test bench_slice_columns_only_for_ascii_text ... bench:          99 ns/iter (+/- 87)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 51)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 2)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 5)

Trial 2

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 18)
test bench_slice_columns_only                ... bench:         481 ns/iter (+/- 295)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 93)
test bench_slice_from                        ... bench:          90 ns/iter (+/- 80)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 0)

Trial 3

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 83)
test bench_slice_columns_only                ... bench:         480 ns/iter (+/- 138)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 94)
test bench_slice_from                        ... bench:          90 ns/iter (+/- 45)
test bench_slice_full                        ... bench:           4 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           5 ns/iter (+/- 3)

And just to show that this gets optimized to the proper iterative algorithm, here's the ASM output from the playground:

#[inline(never)]
fn count_char(bytes: &[u8]) -> usize {
    bytes
        .iter()
        .filter(|&&byte| (byte >> 6) != 0b10)
        .count() + 1
}

Release ASM for above

playground::count_char:
	leaq	str.0(%rip), %rcx
	xorl	%esi, %esi
	leaq	str.0+653(%rip), %r8
	jmp	.LBB0_1
.LBB0_3:
	movzbl	(%rcx), %edx
	movzbl	1(%rcx), %esi
	andb	$-64, %dl
	xorl	%edi, %edi
	cmpb	$-128, %dl
	setne	%dil
	addq	%rax, %rdi
	andb	$-64, %sil
	xorl	%eax, %eax
	cmpb	$-128, %sil
	setne	%al
	addq	%rdi, %rax
	movzbl	2(%rcx), %edx
	andb	$-64, %dl
	xorl	%esi, %esi
	cmpb	$-128, %dl
	setne	%sil
	addq	%rax, %rsi
	addq	$3, %rcx
.LBB0_1:
	movzbl	(%rcx), %edx
	andb	$-64, %dl
	xorl	%eax, %eax
	cmpb	$-128, %dl
	setne	%al
	addq	%rsi, %rax
	incq	%rcx
	cmpq	%r8, %rcx
	jne	.LBB0_3
	incq	%rax
	retq
.Lfunc_end0:

CAD97 · 2017-12-19T22:51:49Z

As you can see, the majority of the time spent in get_column_utf8 is actually checking that we have a valid UTF-8 slice. In the case where we know this already (the LocatedSpan is backed by a &str), we should definitely elide the UTF-8 check, as this makes checking character position nearly the exact same cost as byte position.

The question is whether we want to keep the UTF-8 check when we don't have that guarantee. It's not unsafe -- this operates at the individual byte level.

CAD97 · 2017-12-20T04:28:45Z

One more option I decided should be tested: use the Chars iterator to count, rather than doing the leading byte check manually, but use the unsafe from_utf8_unchecked. This option is unsafe for malformed utf8, and is for comparison purposes only.

pub fn get_column_utf8(&self) -> Result<usize, Utf8Error> {
    let before_self = self.get_columns_and_bytes_before().1;
    Ok(unsafe {std::str::from_utf8_unchecked(before_self)}
        .chars()
        .count() + 1)
}

Benchmark trials

Trial 1

running 6 tests
test bench_slice                             ... bench:          95 ns/iter (+/- 45)
test bench_slice_columns_only                ... bench:         114 ns/iter (+/- 98)
test bench_slice_columns_only_for_ascii_text ... bench:          98 ns/iter (+/- 62)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 80)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 3)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 2)

Trial 2

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 22)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 54)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 36)
test bench_slice_from                        ... bench:          88 ns/iter (+/- 5)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 1)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 0)

Trial 3

running 6 tests
test bench_slice                             ... bench:          89 ns/iter (+/- 49)
test bench_slice_columns_only                ... bench:          97 ns/iter (+/- 95)
test bench_slice_columns_only_for_ascii_text ... bench:          97 ns/iter (+/- 98)
test bench_slice_from                        ... bench:          89 ns/iter (+/- 56)
test bench_slice_full                        ... bench:           3 ns/iter (+/- 3)
test bench_slice_to                          ... bench:           6 ns/iter (+/- 1)

Comparable to manually checking for non-trailing bytes. If we were only concerned about the case where we know we have valid UTF-8, I'd recommend this method as it relies just the same on it being well-formed UTF-8 for a valid answer, but doesn't require this code to know anything about the details of UTF-8. But this is unsafe with malformed UTF-8.

CAD97 · 2017-12-20T20:24:50Z

Per another reading of rust-lang/rust#37888, I realized that the manual solution is actually the same as the improved .chars().count(). It might be worth it to figure out some way to provide the -> Result<usize, Utf8Error> version for unknown utf8ness, an unsafe -> usize version for unknown utf8ness, and a safe -> usize version for known utf8-safe slices that uses str::from_utf8_unchecked.

So, stubbed with no checking,

unsafe trait UTF8Safe {}
unsafe impl UTF8Safe for &str {}
unsafe impl UTF8Safe for nom::CompleteStr {}
unsafe impl<T: UTF8Safe> UTF8Safe for LocatedSpan<T> {}

impl LocatedSpan<T> {
    fn get_byte_column(&self) -> usize {
        self.get_columns_and_bytes_before().0
    }

    fn checked_get_char_column(&self) -> Result<usize, UTF8Error> {
        let before = self.get_columns_and_bytes_before().1;
        Ok(str::from_utf8(before)?
            .chars()
            .count() + 1)
    }

    unsafe fn unchecked_get_char_column(&self) -> usize {
        let before = self.get_columns_and_bytes_before().1;
        Ok(str::from_utf8_unchecked(before)
            .chars()
            .count() + 1)
    }
}

impl LocatedSpan<T> where T: UTF8Safe {
    fn get_char_column(&self) -> usize {
        unsafe { self.unchecked_get_char_column() }
    }
}

If negative trait bounds were stable and working like I think they do I'd remove LocatedSpan::checked_get_char_column and implement get_char_column for T: !UTF8Safe.

Note that while str::from_utf8_unchecked(bytes).chars().count()'s current implementation happens to be safe for arbitrary bytes currently, it is not guaranteed to be so.

CAD97 · 2017-12-24T00:37:13Z

bytecount has merged a count for characters: llogiq/bytecount#29

Semantics are:

Takes byte slice
If slice is correct UTF-8
- Returns the number of unicode codepoints in the string
Else
- Returns an arbitrary number less than or equal to the length of the slice

we win against naive count from ~80 chars upwards.

I'll make a new release soon.

Veedrac · 2017-12-24T03:21:55Z

we win against naive count from ~80 chars upwards.

Note that these benchmarks don't bother measuring the fixed mispredict overhead, since it's not relevant to tuning, so the crossover point is going to be slightly later and the loss at short lengths will be larger.

fflorent · 2018-01-02T10:39:12Z

Sorry for having taking long to react on this. Thanks for your comments and investigations! I read all of this carefully and answer you very soon.

Florent

fflorent · 2018-01-02T17:24:27Z

I think I get one of your arguments. Indeed, that function can assume the bytes are well-formed utf8 chars. Otherwise, in the worst case, it would just return an inconsistent length, as long as it won't crash. No need to return a Result<usize, _> then.

So if I understand correctly all the topic, in both your investigation and the latest work in bytecount, two options are proposed:

a naive algorithm that does all the stuff well and fast enough until ~80 chars (or a bit later according to Veedrac)
an optimized algorithm otherwise

I am torn between the use of naive algorithm and the optimized one from bytecount. Some thoughts:

the pros of naive algorithm:
- for regular source codes, the lines exceeding 200 tend to be rare
- that's a function called by the user, and therefore more or less occasionally, so that may not drastically degrade the performance
the pros of the optimized (hyper) algorithm:
- obfuscated / minified source codes, like in Javascript, exist. In production, they rarely contain new lines (except for comments). And the columns count can exceed 100,000 easily.

As a JS developer, I would not neglect the latest point. Maybe it's worthy to offer a feature to choose one or the other solutions (if I read correctly the benches here, the hyper algo is much faster for counting 100000 chars).

What do you think? And @CAD97 for your own use case, what solution would you prefer?

Florent

CAD97 · 2018-01-06T05:22:46Z

I think I get one of your arguments. Indeed, that function can assume the bytes are well-formed utf8 chars. Otherwise, in the worst case, it would just return an inconsistent length, as long as it won't crash. No need to return a Result<usize, _> then.

👍 You interpreted my information dump correctly 🎉

If we could safely reuse std::Chars::count on arbitrary bytes, I'd recommend that option, as it's the simplest option. However, this is not guaranteed safe. Since you need an algorithm that works on arbitrary bytes, your options are effectively 1) reimplement bytecount::naive_num_chars and the ASCII handling that comes with or 2) use bytecount.

If you pull in bytecount, I'd think you'd want to at least offer the hyper algorithm. But also keep in mind that the huge speedups come from SIMD, and you'll need to pass on the two SIMD cargo features to your consumers as well. @Veedrac could say better than I can what the performance deltas look like without any SIMD features enabled.

I don't really feel comfortable recommending one way or the other. My use case probably expects short lines of maximum around 120 characters. It's for a small DSL for probably my own use only, as an experiment in language design more than anything else. I would probably argue to err on the side of favoring shorter lines; human code writers usually favor shorter lines for legibility.

The other point is that we need to consider what IDEs expect. The primary use of line:column information is to be able to navigate to the problem. TL;DR code point index is (probably) the correct behavior here. For my wonderful sample size of 1, IntelliJ IDEA:

メカジキ
aääAÄÄ

E383A1 E382AB E382B8 E382AD 0A
61 C3A4 61 CC88 41 C384 41 CC88 0A

[U+30E1 Katakana Letter Me] [U+30AB Katakana Letter Ka] [U+30B8 Katakana Letter Zi] [U+30AD Katakana Letter Ki] [U+000A New Line]
[U+0061 Latin Small Letter A] [U+00E4 Latin Small Letter A with Diaeresis] [U+0061 Latin Small Letter A] [U+0308 Combining Diaeresis] [U+0041 Latin Capital Letter A] [U+00C4 Latin Capital Letter A with Diaeresis] [U+0041 Latin Capital Letter A] [U+0308 Combining Diaeresis] [U+000A New Line]

[1:1]メ[1:2]カ[1:3]ジ[1:4]キ[1:5]
[2:1]a[2:2]ä[2:3]ä[2:5]A[2:6]Ä[2:7]Ä[2:9]

Veedrac · 2018-01-07T05:29:25Z

@CAD97 bytecount::num_chars is, very roughly speaking, 10x the speed of naive_num_chars even without explicit SIMD over long slices. If you don't want a dependency on SIMD, it does not necessarily mean num_chars is unsuited.

More information about timings is available at llogiq/bytecount#29, though note that the non-SIMD variant has been improved since then.

fflorent · 2018-01-07T21:23:26Z

Well, I also wonder if that wouldn't be reasonable to let the user do their choice. If we offer both the functions (calling either naive or hyper versions of num_chars), they could use one or the other regarding the input (long lines expected or not). We could tell them in the documentation which one they may prefer using.

Also if I am not wrong, bytecount is lightweight enough to consider embedding directly in nom_locate without much regret.

But also keep in mind that the huge speedups come from SIMD, and you'll need to pass on the two SIMD cargo features to your consumers as well

Building lexers/parsers with columns support may justify such an optimization.

@CAD97 Are you interested in that work?

CAD97 mentioned this issue Dec 20, 2017

Add method to count UTF-8 chars llogiq/bytecount#12

Closed

fflorent closed this as completed in 1192d55 Jan 31, 2018

pontussjostedt mentioned this issue Dec 22, 2023

update example in readme see issue #92 #94

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating columns is slow #4

Calculating columns is slow #4

fflorent commented May 28, 2017 •

edited

Loading

fflorent commented Jun 3, 2017

CAD97 commented Dec 19, 2017

CAD97 commented Dec 19, 2017

CAD97 commented Dec 20, 2017 •

edited

Loading

CAD97 commented Dec 20, 2017 •

edited

Loading

CAD97 commented Dec 24, 2017 •

edited

Loading

Veedrac commented Dec 24, 2017

fflorent commented Jan 2, 2018

fflorent commented Jan 2, 2018

CAD97 commented Jan 6, 2018

Veedrac commented Jan 7, 2018

fflorent commented Jan 7, 2018

Calculating columns is slow #4

Calculating columns is slow #4

Comments

fflorent commented May 28, 2017 • edited Loading

fflorent commented Jun 3, 2017

CAD97 commented Dec 19, 2017

CAD97 commented Dec 19, 2017

CAD97 commented Dec 20, 2017 • edited Loading

CAD97 commented Dec 20, 2017 • edited Loading

CAD97 commented Dec 24, 2017 • edited Loading

Veedrac commented Dec 24, 2017

fflorent commented Jan 2, 2018

fflorent commented Jan 2, 2018

CAD97 commented Jan 6, 2018

Veedrac commented Jan 7, 2018

fflorent commented Jan 7, 2018

fflorent commented May 28, 2017 •

edited

Loading

CAD97 commented Dec 20, 2017 •

edited

Loading

CAD97 commented Dec 20, 2017 •

edited

Loading

CAD97 commented Dec 24, 2017 •

edited

Loading