Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor positive lookaheads better into find optimizations #112107

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Feb 3, 2025

A positive lookahead at the start of a pattern can be used for determining find optimizations even when the non-zero-width portions of the pattern aren't. This helps particularly in cases where the positive lookahead contains an anchor or a literal.

Also extends the existing alternation reduction optimization to factor out anchors that begin every branch of an alternation.

As an example of how this applies, this is one of the expressions in our list:

(?=^\bset\b.*;\s*\bselect\b|^\bselect\b).*?\s+from\s+(?:[\s\(\[`\"]*([^,;\[\s\]\(\)`\"\.]*)[\s\)\]`\"]*)(?:\.[\s\(\[`\"]*([^,;\[\s\]\(\)`\"\.]*)[\s\)\]`\"]*)*

Note that the pattern begins with a positive lookahead and that what comes after it isn't particularly searchable. This leads to the following generated TryFindNextPossibleStartingPosition routine:

                private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                {
                    int pos = base.runtextpos;
                    
                    // Any possible match is at least 6 characters.
                    if (pos <= inputSpan.Length - 6)
                    {
                        return true;
                    }
                    
                    // No match found.
                    base.runtextpos = inputSpan.Length;
                    return false;
                }

which effectively always just returns true (as long as the input is long enough), which means we end up trying to match at most positions.

Now with the change, we get this:

                private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                {
                    int pos = base.runtextpos;
                    
                    // Any possible match is at least 6 characters.
                    if (pos <= inputSpan.Length - 6)
                    {
                        // The pattern leads with a beginning (\A) anchor.
                        if (pos == 0)
                        {
                            return true;
                        }
                    }
                    
                    // No match found.
                    base.runtextpos = inputSpan.Length;
                    return false;
                }

Note the new pos == 0 check. That happens because we will now factor in the ^ from the positive lookahead's alternation.

A positive lookahead at the start of a pattern can be used for determining find optimizations even when the non-zero-width portions of the pattern aren't. This helps particularly in cases where the positive lookahead contains an anchor or a literal.

Also extends the existing alternation reduction optimization to factor out anchors that begin every branch of an alternation.
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@stephentoub stephentoub requested a review from Copilot February 3, 2025 20:42

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Tip: If you use Visual Studio Code, you can request a review from Copilot before you push from the "Source Control" tab. Learn more

@stephentoub stephentoub merged commit 2faef6d into dotnet:main Feb 5, 2025
85 checks passed
@stephentoub stephentoub deleted the updatelookaroundopt branch February 5, 2025 22:18
grendello added a commit to grendello/runtime that referenced this pull request Feb 6, 2025
* main: (23 commits)
  add important remarks to NrbfDecoder (dotnet#111286)
  docs: fix spelling grammar and missing words in clr-code-guide.md (dotnet#112222)
  Consider type declaration order in MethodImpls (dotnet#111998)
  Add a feature flag to not use GVM in Linq Select (dotnet#109978)
  [cDAC] Implement ISOSDacInterface::GetMethodDescPtrFromIp (dotnet#110755)
  Restructure JSImport/JSExport generators to share more code and utilize more Microsoft.Interop.SourceGeneration shared code (dotnet#107769)
  Add more detailed explanations to control-flow RegexOpcode values (dotnet#112170)
  Add repo-specific condition to labeling workflows (dotnet#112169)
  Fix bad assembly when a nested exported type is marked via link.xml (dotnet#107945)
  Make `CalculateAssemblyAction` virtual. (dotnet#112154)
  JIT: Enable reusing profile-aware DFS trees between phases (dotnet#112198)
  Add support for LDAPTLS_CACERTDIR \ TrustedCertificateDirectory (dotnet#111877)
  JIT: Support custom `ClassLayout` instances with GC pointers in them (dotnet#112064)
  Factor positive lookaheads better into find optimizations (dotnet#112107)
  Add ImmutableCollectionsMarshal.AsMemory (dotnet#112177)
  [mono] ILStrip write directly to the output filestream (dotnet#112142)
  Allow the NativeAOT runtime pack to be specified as the ILC runtime package (dotnet#111876)
  JIT: some reworking for conditional escape analysis (dotnet#112194)
  Replace HELPER_METHOD_FRAME with DynamicHelperFrame in patchpoints (dotnet#112025)
  [Android] Decouple runtime initialization and entry point execution for Android sample (dotnet#111742)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants