Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785) #74898

jimczi · 2021-07-05T09:09:58Z

Backport of #73785

…73785) The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive. The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).

…73785) (#74898)" This reverts commit 9efc37e.

jimczi added the backport label Jul 5, 2021

elasticsearchmachine added the v7.15.0 label Jul 5, 2021

jimczi merged commit 9efc37e into elastic:7.x Jul 5, 2021

jimczi deleted the pr/73785_backport branch July 5, 2021 10:57

ywelsch added a commit that referenced this pull request Jul 13, 2021

Revert "Improve BoundedBreakIteratorScanner fragmentation algorithm (#…

e97c4ce

…73785) (#74898)" This reverts commit 9efc37e.

ywelsch removed the v7.15.0 label Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785) #74898

Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785) #74898

jimczi commented Jul 5, 2021

Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785) #74898

Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785) #74898

Conversation

jimczi commented Jul 5, 2021