Fix handling of empty matches in iterators in UTF mode #36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
find_iter
andcaptures_iter
functions iterate over the distinct, non-overlapping matches within the subject string.Normally, this means the iterator will start searching at the end position of the previous match. However, if the previous match was zero-width, we want to advance by an additional character so we don't return the same match again.
In byte mode, this just means incrementing the position by 1. In "UTF" mode, however, we need to advance to the next UTF-8 character boundary.
Note that to determine whether the regex is in UTF mode, we can't simply check the
utf
flag of itsconfig
, because UTF mode can also be turned on by including(*UTF)
in the pattern itself. So we need to invokepcre2_pattern_info_8
to check which mode is actually used.(I also considered that, instead of adding a
utf
field toRegex
, theRegex::build
function could modify the savedconfig
. It's possible that might work but it seems fragile.)