Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RegexDiff X64] [stephentoub] Factor positive lookaheads better into find optimizations #980

Open
MihuBot opened this issue Feb 3, 2025 · 1 comment

Comments

@MihuBot
Copy link
Owner

MihuBot commented Feb 3, 2025

Job completed in 16 minutes 5 seconds (remote runner delay: 1 minute 38 seconds).

Using arguments: regexdiff -NoPRLink

112 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25 ..." (1964 uses)
[GeneratedRegex("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$|^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])\\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])$", RegexOptions.IgnoreCase)]
  /// <code>RegexOptions.IgnoreCase</code><br/>
  /// Explanation:<br/>
  /// <code>
+   /// ○ Match if at the beginning of the string.<br/>
  /// ○ Match with 2 alternative expressions, atomically.<br/>
  ///     ○ Match a sequence of expressions.<br/>
-   ///         ○ Match if at the beginning of the string.<br/>
  ///         ○ Loop exactly 3 times.<br/>
  ///             ○ 1st capture group.<br/>
  ///                 ○ 2nd capture group.<br/>
  ///                     ○ Match a character in the set [0-5].<br/>
  ///         ○ Match if at the end of the string or if before an ending newline.<br/>
  ///     ○ Match a sequence of expressions.<br/>
-   ///         ○ Match if at the beginning of the string.<br/>
  ///         ○ Loop greedily any number of times.<br/>
  ///             ○ 4th capture group.<br/>
  ///                 ○ 5th capture group.<br/>
                  int startingStackpos = 0;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                  
+                   // Match if at the beginning of the string.
+                   if (pos != 0)
+                   {
+                       UncaptureUntil(0);
+                       return false; // The input didn't match.
+                   }
+                   
                  // Atomic group.
                  {
                      int atomic_stackpos = stackpos;
                          
                          // Branch 0
                          {
-                               // Match if at the beginning of the string.
-                               if (pos != 0)
-                               {
-                                   goto AlternationBranch;
-                               }
-                               
                              // Loop exactly 3 times.
                              //{
                                  startingStackpos = stackpos;
                          
                          // Branch 1
                          {
-                               // Match if at the beginning of the string.
-                               if (pos != 0)
-                               {
-                                   UncaptureUntil(0);
-                                   return false; // The input didn't match.
-                               }
-                               
                              // Loop greedily any number of times.
                              //{
                                  loop_iteration1 = 0;
"^_(?=\\S)([\\s\\S]*?\\S)_(?!_)|^\\*(?=\\S)([ ..." (823 uses)
[GeneratedRegex("^_(?=\\S)([\\s\\S]*?\\S)_(?!_)|^\\*(?=\\S)([\\s\\S]*?\\S)\\*(?!\\*)")]
  /// <code>^_(?=\\S)([\\s\\S]*?\\S)_(?!_)|^\\*(?=\\S)([\\s\\S]*?\\S)\\*(?!\\*)</code><br/>
  /// Explanation:<br/>
  /// <code>
+   /// ○ Match if at the beginning of the string.<br/>
  /// ○ Match with 2 alternative expressions, atomically.<br/>
  ///     ○ Match a sequence of expressions.<br/>
-   ///         ○ Match if at the beginning of the string.<br/>
-   ///         ○ Match '_'.<br/>
+   ///         ○ Match an empty string.<br/>
  ///         ○ Zero-width positive lookahead.<br/>
  ///             ○ Match any character other than a whitespace character.<br/>
  ///         ○ 1st capture group.<br/>
  ///         ○ Zero-width negative lookahead.<br/>
  ///             ○ Match '_'.<br/>
  ///     ○ Match a sequence of expressions.<br/>
-   ///         ○ Match if at the beginning of the string.<br/>
-   ///         ○ Match '*'.<br/>
+   ///         ○ Match an empty string.<br/>
  ///         ○ Zero-width positive lookahead.<br/>
  ///             ○ Match any character other than a whitespace character.<br/>
  ///         ○ 2nd capture group.<br/>
                  int stackpos = 0;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                  
+                   // Match if at the beginning of the string.
+                   if (pos != 0)
+                   {
+                       UncaptureUntil(0);
+                       return false; // The input didn't match.
+                   }
+                   
                  // Atomic group.
                  {
                      int atomic_stackpos = stackpos;
                      
                      // Match with 2 alternative expressions, atomically.
                      //{
-                           int alternation_starting_pos = pos;
-                           int alternation_starting_capturepos = base.Crawlpos();
-                           
-                           // Branch 0
+                           if (slice.IsEmpty)
                          {
-                               // Match if at the beginning of the string.
-                               if (pos != 0)
-                               {
-                                   goto AlternationBranch;
-                               }
-                               
-                               // Match '_'.
-                               if (slice.IsEmpty || slice[0] != '_')
-                               {
-                                   goto AlternationBranch;
-                               }
-                               
-                               // Zero-width positive lookahead.
-                               {
-                                   int positivelookahead_starting_pos = pos;
-                                   
-                                   if (Utilities.s_hasTimeout)
-                                   {
-                                       base.CheckTimeout();
-                                   }
-                                   
-                                   // Match any character other than a whitespace character.
-                                   if ((uint)slice.Length < 2 || char.IsWhiteSpace(slice[1]))
-                                   {
-                                       goto AlternationBranch;
-                                   }
-                                   
-                                   pos = positivelookahead_starting_pos;
-                                   slice = inputSpan.Slice(pos);
-                               }
-                               
-                               // 1st capture group.
-                               //{
-                                   pos++;
-                                   slice = inputSpan.Slice(pos);
-                                   capture_starting_pos = pos;
-                                   
-                                   // Match any character lazily any number of times.
-                                   //{
-                                       lazyloop_pos = pos;
-                                       goto LazyLoopEnd;
-                                       
-                                       LazyLoopBacktrack:
-                                       UncaptureUntil(lazyloop_capturepos);
-                                       if (Utilities.s_hasTimeout)
-                                       {
-                                           base.CheckTimeout();
-                                       }
-                                       
-                                       pos = lazyloop_pos;
-                                       slice = inputSpan.Slice(pos);
-                                       if (slice.IsEmpty || false)
-                                       {
-                                           goto AlternationBranch;
-                                       }
-                                       pos++;
-                                       slice = inputSpan.Slice(pos);
-                                       lazyloop_pos = pos;
-                                       
-                                       LazyLoopEnd:
-                                       lazyloop_capturepos = base.Crawlpos();
-                                   //}
-                                   
-                                   // Match any character other than a whitespace character.
-                                   if (slice.IsEmpty || char.IsWhiteSpace(slice[0]))
-                                   {
-                                       goto LazyLoopBacktrack;
-                                   }
-                                   
-                                   pos++;
-                                   slice = inputSpan.Slice(pos);
-                                   base.Capture(1, capture_starting_pos, pos);
-                                   
-                                   goto CaptureSkipBacktrack;
-                                   
-                                   CaptureBacktrack:
-                                   goto LazyLoopBacktrack;
-                                   
-                                   CaptureSkipBacktrack:;
-                               //}
-                               
-                               // Match '_'.
-                               if (slice.IsEmpty || slice[0] != '_')
-                               {
-                                   goto CaptureBacktrack;
-                               }
-                               
-                               // Zero-width negative lookahead.
-                               {
-                                   int negativelookahead__starting_pos = pos;
-                                   if (Utilities.s_hasTimeout)
-                                   {
-                                       base.CheckTimeout();
-                                   }
-                                   
-                                   // Match '_'.
-                                   if ((uint)slice.Length < 2 || slice[1] != '_')
-                                   {
-                                       goto NegativeLookaroundMatch;
-                                   }
-                                   
-                                   goto CaptureBacktrack;
-                                   
-                                   NegativeLookaroundMatch:
-                                   pos = negativelookahead__starting_pos;
-                                   slice = inputSpan.Slice(pos);
-                               }
-                               
-                               pos++;
-                               slice = inputSpan.Slice(pos);
-                               goto AlternationMatch;
-                               
-                               AlternationBranch:
-                               pos = alternation_starting_pos;
-                               slice = inputSpan.Slice(pos);
-                               UncaptureUntil(alternation_starting_capturepos);
+                               UncaptureUntil(0);
+                               return false; // The input didn't match.
                          }
                          
-                           // Branch 1
+                           switch (slice[0])
                          {
-                               // Match if at the beginning of the string.
-                               if (pos != 0)
-                               {
-                                   UncaptureUntil(0);
-                                   return false; // The input didn't match.
-                               }
-                               
-                               // Match '*'.
-                               if (slice.IsEmpty || slice[0] != '*')
-                               {
-                                   UncaptureUntil(0);
-                                   return false; // The input didn't match.
-                               }
-                               
-                               // Zero-width positive lookahead.
-                               {
-                                   int positivelookahead_starting_pos1 = pos;
+                               case '_':
                                  
-                                   if (Utilities.s_hasTimeout)
+                                   // Zero-width positive lookahead.
                                  {
-                                       base.CheckTimeout();
-                                   }
-                                   
-                                   // Match any character other than a whitespace character.
-                                   if ((uint)slice.Length < 2 || char.IsWhiteSpace(slice[1]))
-                                   {
-                                       UncaptureUntil(0);
-                                       return false; // The input didn't match.
-                                   }
-                                   
-                                   pos = positivelookahead_starting_pos1;
-                                   slice = inputSpan.Slice(pos);
-                               }
-                               
-                               // 2nd capture group.
-                               //{
-                                   pos++;
-                                   slice = inputSpan.Slice(pos);
-                                   capture_starting_pos1 = pos;
-                                   
-                                   // Match any character lazily any number of times.
-                                   //{
-                                       lazyloop_pos1 = pos;
-                                       goto LazyLoopEnd1;
+                                       int positivelookahead_starting_pos = pos;
                                      
-                                       LazyLoopBacktrack1:
-                                       UncaptureUntil(lazyloop_capturepos1);
                                      if (Utilities.s_hasTimeout)
                                      {
                                          base.CheckTimeout();
                                      }
                                      
-                                       pos = lazyloop_pos1;
-                                       slice = inputSpan.Slice(pos);
-                                       if (slice.IsEmpty || false)
+                                       // Match any character other than a whitespace character.
+                                       if ((uint)slice.Length < 2 || char.IsWhiteSpace(slice[1]))
                                      {
                                          UncaptureUntil(0);
                                          return false; // The input didn't match.
                                      }
+                                       
+                                       pos = positivelookahead_starting_pos;
+                                       slice = inputSpan.Slice(pos);
+                                   }
+                                   
+                                   // 1st capture group.
+                                   //{
                                      pos++;
                                      slice = inputSpan.Slice(pos);
-                                       lazyloop_pos1 = pos;
+                                       capture_starting_pos = pos;
                                      
-                                       LazyLoopEnd1:
-                                       lazyloop_capturepos1 = base.Crawlpos();
+                                       // Match any character lazily any number of times.
+                                       //{
+                                           lazyloop_pos = pos;
+                                           goto LazyLoopEnd;
+                                           
+                                           LazyLoopBacktrack:
+                                           UncaptureUntil(lazyloop_capturepos);
+                                           if (Utilities.s_hasTimeout)
+                                           {
+                                               base.CheckTimeout();
+                                           }
+                                           
+                                           pos = lazyloop_pos;
+                                           slice = inputSpan.Slice(pos);
+                                           if (slice.IsEmpty || false)
+                                           {
+                                               UncaptureUntil(0);
+                                               return false; // The input didn't match.
+                                           }
+                                           pos++;
+                                           slice = inputSpan.Slice(pos);
+                                           lazyloop_pos = pos;
+                                           
+                                           LazyLoopEnd:
+                                           lazyloop_capturepos = base.Crawlpos();
+                                       //}
+                                       
+                                       // Match any character other than a whitespace character.
+                                       if (slice.IsEmpty || char.IsWhiteSpace(slice[0]))
+                                       {
+                                           goto LazyLoopBacktrack;
+                                       }
+                                       
+                                       pos++;
+                                       slice = inputSpan.Slice(pos);
+                                       base.Capture(1, capture_starting_pos, pos);
+                                       
+                                       goto CaptureSkipBacktrack;
+                                       
+                                       CaptureBacktrack:
+                                       goto LazyLoopBacktrack;
+                                       
+                                       CaptureSkipBacktrack:;
                                  //}
                                  
-                                   // Match any character other than a whitespace character.
-                                   if (slice.IsEmpty || char.IsWhiteSpace(slice[0]))
+                                   // Match '_'.
+                                   if (slice.IsEmpty || slice[0] != '_')
                                  {
-                                       goto LazyLoopBacktrack1;
+                                       goto CaptureBacktrack;
+                                   }
+                                   
+                                   // Zero-width negative lookahead.
+                                   {
+                                       int negativelookahead__starting_pos = pos;
+                                       if (Utilities.s_hasTimeout)
+                                       {
+                                           base.CheckTimeout();
+                                       }
+                                       
+                                       // Match '_'.
+                                       if ((uint)slice.Length < 2 || slice[1] != '_')
+                                       {
+                                           goto NegativeLookaroundMatch;
+                                       }
+                                       
+                                       goto CaptureBacktrack;
+                                       
+                                       NegativeLookaroundMatch:
+                                       pos = negativelookahead__starting_pos;
+                                       slice = inputSpan.Slice(pos);
                                  }
                                  
                                  pos++;
                                  slice = inputSpan.Slice(pos);
-                                   base.Capture(2, capture_starting_pos1, pos);
+                                   break;
                                  
-                                   goto CaptureSkipBacktrack1;
+                               case '*':
                                  
-                                   CaptureBacktrack1:
-                                   goto LazyLoopBacktrack1;
-                                   
-                                   CaptureSkipBacktrack1:;
-                               //}
-                               
-                               // Match '*'.
-                               if (slice.IsEmpty || slice[0] != '*')
-                               {
-                                   goto CaptureBacktrack1;
-                               }
-                               
-                               // Zero-width negative lookahead.
-                               {
-                                   int negativelookahead__starting_pos1 = pos;
-                                   if (Utilities.s_hasTimeout)
+                                   // Zero-width positive lookahead.
                                  {
-                                       base.CheckTimeout();
+                                       int positivelookahead_starting_pos1 = pos;
+                                       
+                                       if (Utilities.s_hasTimeout)
+                                       {
+                                           base.CheckTimeout();
+                                       }
+                                       
+                                       // Match any character other than a whitespace character.
+                                       if ((uint)slice.Length < 2 || char.IsWhiteSpace(slice[1]))
+                                       {
+                                           UncaptureUntil(0);
+                                           return false; // The input didn't match.
+                                       }
+                                       
+                                       pos = positivelookahead_starting_pos1;
+                                       slice = inputSpan.Slice(pos);
                                  }
                                  
+                                   // 2nd capture group.
+                                   //{
+                                       pos++;
+                                       slice = inputSpan.Slice(pos);
+                                       capture_starting_pos1 = pos;
+                                       
+                                       // Match any character lazily any number of times.
+                                       //{
+                                           lazyloop_pos1 = pos;
+                                           goto LazyLoopEnd1;
+                                           
+                                           LazyLoopBacktrack1:
+                                           UncaptureUntil(lazyloop_capturepos1);
+                                           if (Utilities.s_hasTimeout)
+                                           {
+                                               base.CheckTimeout();
+                                           }
+                                           
+                                           pos = lazyloop_pos1;
+                                           slice = inputSpan.Slice(pos);
+                                           if (slice.IsEmpty || false)
+                                           {
+                                               UncaptureUntil(0);
+                                               return false; // The input didn't match.
+                                           }
+                                           pos++;
+                                           slice = inputSpan.Slice(pos);
+                                           lazyloop_pos1 = pos;
+                                           
+                                           LazyLoopEnd1:
+                                           lazyloop_capturepos1 = base.Crawlpos();
+                                       //}
+                                       
+                                       // Match any character other than a whitespace character.
+                                       if (slice.IsEmpty || char.IsWhiteSpace(slice[0]))
+                                       {
+                                           goto LazyLoopBacktrack1;
+                                       }
+                                       
+                                       pos++;
+                                       slice = inputSpan.Slice(pos);
+                                       base.Capture(2, capture_starting_pos1, pos);
+                                       
+                                       goto CaptureSkipBacktrack1;
+                                       
+                                       CaptureBacktrack1:
+                                       goto LazyLoopBacktrack1;
+                                       
+                                       CaptureSkipBacktrack1:;
+                                   //}
+                                   
                                  // Match '*'.
-                                   if ((uint)slice.Length < 2 || slice[1] != '*')
+                                   if (slice.IsEmpty || slice[0] != '*')
                                  {
-                                       goto NegativeLookaroundMatch1;
+                                       goto CaptureBacktrack1;
                                  }
                                  
-                                   goto CaptureBacktrack1;
+                                   // Zero-width negative lookahead.
+                                   {
+                                       int negativelookahead__starting_pos1 = pos;
+                                       if (Utilities.s_hasTimeout)
+                                       {
+                                           base.CheckTimeout();
+                                       }
+                                       
+                                       // Match '*'.
+                                       if ((uint)slice.Length < 2 || slice[1] != '*')
+                                       {
+                                           goto NegativeLookaroundMatch1;
+                                       }
+                                       
+                                       goto CaptureBacktrack1;
+                                       
+                                       NegativeLookaroundMatch1:
+                                       pos = negativelookahead__starting_pos1;
+                                       slice = inputSpan.Slice(pos);
+                                   }
                                  
-                                   NegativeLookaroundMatch1:
-                                   pos = negativelookahead__starting_pos1;
+                                   pos++;
                                  slice = inputSpan.Slice(pos);
-                               }
-                               
-                               pos++;
-                               slice = inputSpan.Slice(pos);
+                                   break;
+                                   
+                               default:
+                                   UncaptureUntil(0);
+                                   return false; // The input didn't match.
                          }
-                           
-                           AlternationMatch:;
                      //}
                      
                      stackpos = atomic_stackpos;

For more diff examples, see https://gist.github.com/MihuBot/6ea86eec172802ac7975b1b69c126a04

Total bytes of base: 53924088
Total bytes of diff: 53916304
Total bytes of delta: -7784 (-0.01 % of base)
Total relative delta: 15.46
    diff is an improvement.
    relative diff is a regression.

For a list of JIT diff regressions, see Regressions.md
For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-980.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/EorKPNpA");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

Artifacts:

@MihuBot
Copy link
Owner Author

MihuBot commented Feb 3, 2025

@MihaZupan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant