Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of empty matches in iterators in UTF mode #36

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bemoody
Copy link

@bemoody bemoody commented Nov 8, 2023

The find_iter and captures_iter functions iterate over the distinct, non-overlapping matches within the subject string.

Normally, this means the iterator will start searching at the end position of the previous match. However, if the previous match was zero-width, we want to advance by an additional character so we don't return the same match again.

In byte mode, this just means incrementing the position by 1. In "UTF" mode, however, we need to advance to the next UTF-8 character boundary.

Note that to determine whether the regex is in UTF mode, we can't simply check the utf flag of its config, because UTF mode can also be turned on by including (*UTF) in the pattern itself. So we need to invoke pcre2_pattern_info_8 to check which mode is actually used.

(I also considered that, instead of adding a utf field to Regex, the Regex::build function could modify the saved config. It's possible that might work but it seems fragile.)

Benjamin Moody added 2 commits November 7, 2023 22:48
Following an empty match, the iterator (Matches or CaptureMatches)
advances the last_end position so as not to return the same match
twice.

However, if the regex uses UTF mode, the position passed to
find_at_with_match_data or captures_read_at is required to be a UTF-8
character boundary, so last_end must be advanced by a whole UTF-8
character, not just one byte.

Determining whether or not the regex is using UTF mode requires
PCRE2_INFO_ALLOPTIONS (checking the config is not enough.)
These tests will fail if the iterator does not correctly advance the
end position (by either a byte or a whole character, as appropriate)
following an empty match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant