-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trigram: support multi-byte string trigrams; perf improvements #93757
Conversation
63dffa7
to
f123cc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some comments and suggestions from me.
For a more fair comparison, it'd probably be good to add the benchmark in the first commit, keep the unicode fix in the second and run the benchmark to see the impact of the fix, and then the third commit would show improvements relative to the unicode fix. This way we'd also be able to see the impact on the benchmark of the whole PR. (Not that I'm advocating for not supporting multi-byte characters.)
Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)
-- commits
line 4 at r1:
nit: should this be a bug fix
category? Also, maybe s/multi-byte/unicode
is more clear?
pkg/util/trigram/trigram.go
line 22 at r1 (raw file):
// MakeTrigrams returns the downcased, sorted and de-duplicated trigrams for an // input string. Non-alphanumeric characters are treated as word boundaries.
nit: the second sentence here probably needs an update.
pkg/util/trigram/trigram.go
line 37 at r3 (raw file):
start := -1 oneByteCharsOnly := true // This loop would be more ergonomic with strings.FieldsFunc, but doing so
nit: consider adding one more sentence before the current one to explain the high level goal of the loop. It'd probably be good to mention start
and end
values in that sentence.
pkg/util/trigram/trigram.go
line 52 at r3 (raw file):
start = end } if r >= utf8.RuneSelf {
nit: it might be faster to replace the conditional with a boolean operation, i.e. something like oneByteCharsOnly = oneBytesCharsOnly && r < utf8.RuneSelf
.
pkg/util/trigram/trigram.go
line 92 at r3 (raw file):
func generateTrigrams(appendTo []string, word string, pad bool, onlyOneByteChars bool) []string { if pad { var sb strings.Builder
Should we reuse the builder across generateTrigrams
calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated benchmark to show the end to end benchmark change, before and after the mb change. I probably should have made 2 PRs here rather than combine them as the perf improvements were not related to worries about making things slower with multibyte, it's just that it's slow in general (see linked issue filed by @chrisseto).
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)
pkg/util/trigram/trigram.go
line 92 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should we reuse the builder across
generateTrigrams
calls?
Reusing a builder doesn't do anything unless you are unsafely sharing the backing string memory - it's just a wrapper around a byte buffer.
f123cc0
to
7a27bb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r4, 4 of 4 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)
pkg/util/trigram/trigram.go
line 92 at r3 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Reusing a builder doesn't do anything unless you are unsafely sharing the backing string memory - it's just a wrapper around a byte buffer.
Oh, I was thinking about bytes.Buffer
when I left this comment, nvm.
pkg/util/trigram/trigram_test.go
line 104 at r5 (raw file):
} }) fmt.Println(sink)
nit: this seems like a leftover.
Release note: None
Release note (bug fix): previously, trigrams ignored unicode (multi-byte) characters from input strings. This is now corrected.
``` name old time/op new time/op delta Similarity-10 1.05µs ± 0% 0.77µs ± 0% -26.81% (p=0.000 n=7+10) name old alloc/op new alloc/op delta Similarity-10 824B ± 0% 376B ± 0% -54.37% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Similarity-10 13.0 ± 0% 7.0 ± 0% -46.15% (p=0.000 n=10+10) ``` Release note (sql change): improve the performance of trigram operations
7a27bb3
to
e4abe68
Compare
bors r+ |
Build failed: |
Flake (#94197) bors r+ |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
Reviewed 4 of 4 files at r7, 4 of 4 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
Fixes #93744
Related to #93830
Release note (sql change): previously, trigrams ignored multi-byte characters from input strings. This is now corrected.