Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trigram: support multi-byte string trigrams; perf improvements #93757

Merged
merged 3 commits into from
Dec 23, 2022

Conversation

jordanlewis
Copy link
Member

@jordanlewis jordanlewis commented Dec 15, 2022

Fixes #93744
Related to #93830

  • Add multi-byte character support
  • Improve performance
name           old time/op    new time/op    delta
Similarity-32    1.72µs ± 0%    0.60µs ± 3%  -64.98%  (p=0.000 n=9+10)

name           old alloc/op   new alloc/op   delta
Similarity-32    1.32kB ± 0%    0.37kB ± 0%  -72.10%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Similarity-32      15.0 ± 0%       6.0 ± 0%  -60.00%  (p=0.000 n=10+10)

Release note (sql change): previously, trigrams ignored multi-byte characters from input strings. This is now corrected.

@jordanlewis jordanlewis requested a review from mgartner December 15, 2022 22:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis jordanlewis force-pushed the mb-trgrm branch 3 times, most recently from 63dffa7 to f123cc0 Compare December 20, 2022 03:17
@jordanlewis jordanlewis changed the title trigram: support multi-byte string trigrams trigram: support multi-byte string trigrams; perf improvements Dec 20, 2022
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some comments and suggestions from me.

For a more fair comparison, it'd probably be good to add the benchmark in the first commit, keep the unicode fix in the second and run the benchmark to see the impact of the fix, and then the third commit would show improvements relative to the unicode fix. This way we'd also be able to see the impact on the benchmark of the whole PR. (Not that I'm advocating for not supporting multi-byte characters.)

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)


-- commits line 4 at r1:
nit: should this be a bug fix category? Also, maybe s/multi-byte/unicode is more clear?


pkg/util/trigram/trigram.go line 22 at r1 (raw file):

// MakeTrigrams returns the downcased, sorted and de-duplicated trigrams for an
// input string. Non-alphanumeric characters are treated as word boundaries.

nit: the second sentence here probably needs an update.


pkg/util/trigram/trigram.go line 37 at r3 (raw file):

	start := -1
	oneByteCharsOnly := true
	// This loop would be more ergonomic with strings.FieldsFunc, but doing so

nit: consider adding one more sentence before the current one to explain the high level goal of the loop. It'd probably be good to mention start and end values in that sentence.


pkg/util/trigram/trigram.go line 52 at r3 (raw file):

				start = end
			}
			if r >= utf8.RuneSelf {

nit: it might be faster to replace the conditional with a boolean operation, i.e. something like oneByteCharsOnly = oneBytesCharsOnly && r < utf8.RuneSelf.


pkg/util/trigram/trigram.go line 92 at r3 (raw file):

func generateTrigrams(appendTo []string, word string, pad bool, onlyOneByteChars bool) []string {
	if pad {
		var sb strings.Builder

Should we reuse the builder across generateTrigrams calls?

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated benchmark to show the end to end benchmark change, before and after the mb change. I probably should have made 2 PRs here rather than combine them as the perf improvements were not related to worries about making things slower with multibyte, it's just that it's slow in general (see linked issue filed by @chrisseto).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)


pkg/util/trigram/trigram.go line 92 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Should we reuse the builder across generateTrigrams calls?

Reusing a builder doesn't do anything unless you are unsafely sharing the backing string memory - it's just a wrapper around a byte buffer.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r4, 4 of 4 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @mgartner)


pkg/util/trigram/trigram.go line 92 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Reusing a builder doesn't do anything unless you are unsafely sharing the backing string memory - it's just a wrapper around a byte buffer.

Oh, I was thinking about bytes.Buffer when I left this comment, nvm.


pkg/util/trigram/trigram_test.go line 104 at r5 (raw file):

			}
		})
		fmt.Println(sink)

nit: this seems like a leftover.

Release note (bug fix): previously, trigrams ignored unicode
(multi-byte) characters from input strings. This is now corrected.
```
name           old time/op    new time/op    delta
Similarity-10    1.05µs ± 0%    0.77µs ± 0%  -26.81%  (p=0.000 n=7+10)

name           old alloc/op   new alloc/op   delta
Similarity-10      824B ± 0%      376B ± 0%  -54.37%  (p=0.000 n=10+10)

name           old allocs/op  new allocs/op  delta
Similarity-10      13.0 ± 0%       7.0 ± 0%  -46.15%  (p=0.000 n=10+10)
```

Release note (sql change): improve the performance of trigram operations
@jordanlewis
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 23, 2022

Build failed:

@jordanlewis
Copy link
Member Author

Flake (#94197)

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 23, 2022

Build succeeded:

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

Reviewed 4 of 4 files at r7, 4 of 4 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: trigrams don't properly include multi-byte characters
4 participants