Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply Regex starting loop optimization to non-atomic loops as well #35936

Merged
merged 2 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -310,18 +310,18 @@ internal RegexNode FinalOptimize()
// to implementations that don't support backtracking.
EliminateEndingBacktracking(rootNode.Child(0), DefaultMaxRecursionDepth);

// Optimization: unnecessary re-processing of atomic starting groups.
// If an expression is guaranteed to begin with a single-character infinite atomic group that isn't part of an alternation (in which case it
// Optimization: unnecessary re-processing of starting loops.
// If an expression is guaranteed to begin with a single-character unbounded loop that isn't part of an alternation (in which case it
// wouldn't be guaranteed to be at the beginning) or a capture (in which case a back reference could be influenced by its length), then we
// can update the tree with a temporary node to indicate that the implementation should use that node's ending position in the input text
// as the next starting position at which to start the next match. This avoids redoing matches we've already performed, e.g. matching
// "\[email protected]" against "is this a valid [email protected]", the \w+ will initially match the "is" and then will fail to match the "@".
// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". We limit ourselves to
// one/set atomic loops with a min iteration count of 1 so that we know we'll get something in exchange for the extra overhead of storing
// the updated position. For functional correctness we can only consider infinite atomic loops, as to be able to start at the end of the
// loop we need the loop to have consumed all possible matches; otherwise, you could end up with a pattern like "a{1,3}b" matching
// against "aaaabc", which should match, but if we pre-emptively stop consuming after the first three a's and re-start from that position,
// we'll end up failing the match even though it should have succeeded.
// Rather than bumping the scan loop by 1 and trying again to match at the "s", we can instead start at the " ". For functional correctness
// we can only consider unbounded loops, as to be able to start at the end of the loop we need the loop to have consumed all possible matches;
// otherwise, you could end up with a pattern like "a{1,3}b" matching against "aaaabc", which should match, but if we pre-emptively stop consuming
// after the first three a's and re-start from that position, we'll end up failing the match even though it should have succeeded. We can also
// apply this optimization to non-atomic loops. Even though backtracking could be necessary, such backtracking would be handled within the processing
// of a single starting position.
{
RegexNode node = rootNode.Child(0); // skip implicit root capture node
while (true)
Expand All @@ -333,9 +333,12 @@ internal RegexNode FinalOptimize()
node = node.Child(0);
continue;

case Oneloopatomic when node.M > 0 && node.N == int.MaxValue:
case Notoneloopatomic when node.M > 0 && node.N == int.MaxValue:
case Setloopatomic when node.M > 0 && node.N == int.MaxValue:
case Oneloop when node.N == int.MaxValue:
case Oneloopatomic when node.N == int.MaxValue:
case Notoneloop when node.N == int.MaxValue:
case Notoneloopatomic when node.N == int.MaxValue:
case Setloop when node.N == int.MaxValue:
case Setloopatomic when node.N == int.MaxValue:
RegexNode? parent = node.Next;
if (parent != null && parent.Type == Concatenate)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ public static IEnumerable<object[]> Match_Basic_TestData()
yield return new object[] { @"\w+(?<!a)", "aa", RegexOptions.None, 0, 2, false, string.Empty };
yield return new object[] { @"(?>\w+)(?<!a)", "a", RegexOptions.None, 0, 1, false, string.Empty };
yield return new object[] { @"(?>\w+)(?<!a)", "aa", RegexOptions.None, 0, 2, false, string.Empty };
yield return new object[] { @".+a", "baa", RegexOptions.None, 0, 3, true, "baa" };
yield return new object[] { @"[ab]+a", "cacbaac", RegexOptions.None, 0, 7, true, "baa" };

// Using beginning/end of string chars \A, \Z: Actual - "\\Aaaa\\w+zzz\\Z"
yield return new object[] { @"\Aaaa\w+zzz\Z", "aaaasdfajsdlfjzzz", RegexOptions.IgnoreCase, 0, 17, true, "aaaasdfajsdlfjzzz" };
Expand Down