Fix handling of empty matches in iterators in UTF mode #36

bemoody · 2023-11-08T03:55:58Z

The find_iter and captures_iter functions iterate over the distinct, non-overlapping matches within the subject string.

Normally, this means the iterator will start searching at the end position of the previous match. However, if the previous match was zero-width, we want to advance by an additional character so we don't return the same match again.

In byte mode, this just means incrementing the position by 1. In "UTF" mode, however, we need to advance to the next UTF-8 character boundary.

Note that to determine whether the regex is in UTF mode, we can't simply check the utf flag of its config, because UTF mode can also be turned on by including (*UTF) in the pattern itself. So we need to invoke pcre2_pattern_info_8 to check which mode is actually used.

(I also considered that, instead of adding a utf field to Regex, the Regex::build function could modify the saved config. It's possible that might work but it seems fragile.)

Following an empty match, the iterator (Matches or CaptureMatches) advances the last_end position so as not to return the same match twice. However, if the regex uses UTF mode, the position passed to find_at_with_match_data or captures_read_at is required to be a UTF-8 character boundary, so last_end must be advanced by a whole UTF-8 character, not just one byte. Determining whether or not the regex is using UTF mode requires PCRE2_INFO_ALLOPTIONS (checking the config is not enough.)

These tests will fail if the iterator does not correctly advance the end position (by either a byte or a whole character, as appropriate) following an empty match.

Benjamin Moody added 2 commits November 7, 2023 22:48

tests: check iterating over empty matches in UTF-8 text

fb07086

These tests will fail if the iterator does not correctly advance the end position (by either a byte or a whole character, as appropriate) following an empty match.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of empty matches in iterators in UTF mode #36

Fix handling of empty matches in iterators in UTF mode #36

bemoody commented Nov 8, 2023

Fix handling of empty matches in iterators in UTF mode #36

Are you sure you want to change the base?

Fix handling of empty matches in iterators in UTF mode #36

Conversation

bemoody commented Nov 8, 2023