Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785) #74898

Merged
merged 1 commit into from
Jul 5, 2021

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Jul 5, 2021

Backport of #73785

…73785)

The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive.

The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).
@jimczi jimczi added the backport label Jul 5, 2021
@jimczi jimczi merged commit 9efc37e into elastic:7.x Jul 5, 2021
@jimczi jimczi deleted the pr/73785_backport branch July 5, 2021 10:57
ywelsch added a commit that referenced this pull request Jul 13, 2021
@ywelsch ywelsch removed the v7.15.0 label Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants