Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

eduard-bagdasaryan · 2025-02-25T11:50:09Z

This change removes expensive CharacterSet negation and copying on every
AppendTokens() call.

Also simplified complex "empty haystack" and "empty needle" conditions
after naming buf_.substr() return value. Those conditions are mutually
exclusive (in npos cases) but earlier code did not relay that fact well.

No functionality changes expected outside of level-8 debugging messages.

The conditions are mutually exclusive, but that fact was not clear in the official code (because that code lacked `str`). Also adjusted "no prefix" debugs() wording to clarify that use case description and to improve symmetry with "empty haystack" use case. This change also avoids "insufficient input" phrasing that should be reserved for methods throwing InsufficientInput.

XXX: This code compiles, but I am concerned that callers may specify a wrong/third SBuf method (with the same profile as SearchMethod).

Keep the new prefix_() parameter order as a lot more readable.

... across Tokenizer methods (at least).

... and slightly fewer official code changes.

eduard-bagdasaryan · 2025-02-25T11:51:16Z

This PR implements a PR1896 suggestion.

rousskov

Thank you.

rousskov · 2025-02-25T14:12:45Z

src/parser/Tokenizer.cc

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

    SBuf result;

-    if (!prefix(result, tokenChars, limit))
+    if (!prefix_(result, limit, findFirstNotOf, tokenChars))


I have tried to replace new { findFirstOf, findFirstNotOf } enum with SBuf method pointers, but the result was no more readable and more error-prone then current PR code because a prefix_() caller could supply a pointer to a method that compiles fine but is not compatible with prefix_() logic. The current/proposed "findFirstNotOf, tokenChars" and "findFirstOf, delimiters" calls are readable and safer.

Moving findFirstOf() and findFirstNotOf() calls outside prefix_() does not work well either, for several reasons.

This comment does not request any changes.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

- const auto tokenCharacters = delimiters.complement("non-delimiters");

Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary.

rousskov · 2025-02-25T14:26:05Z

src/parser/Tokenizer.cc

        return false;
    }
-    if (prefixLen == SBuf::npos && (atEnd() || limit == 0)) {
-        debugs(24, 8, "no char in set " << tokenChars.name << " while looking for prefix");
+    if (prefixLen == SBuf::npos && !limitedBuf.length()) {


We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).

Computing prefixLen == SBuf::npos is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).

Keeping the two SBuf::npos conditions together clarifies this code. Separating them (and removing one npos check) may complicate correct code interpretation.

Keeping this condition in its current place results in fewer out-of-scope changes.

This comment does not request any changes.

We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

1. Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).

That can be measured easily enough. No need to guess.
If you do not want to perfect it now while touching this code. Then I suggest adding a TODO about doing that measurement and optimizing. This is a high-use piece of code.

2. Computing `prefixLen == SBuf::npos` is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).

Are you sure about that? IMO they are "probably" the same cost. Since both are an integer lookup and compare against a constant.

We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).

That can be measured easily enough. No need to guess. If you do not want to perfect it now while touching this code. Then I suggest adding a TODO about doing that measurement and optimizing. This is a high-use piece of code.

I believe that correctly measuring the effect of those additional optimizations is difficult, but I added the requested TODO in hope to merge this PR (commit 8d848cd).

Computing prefixLen == SBuf::npos is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).

Are you sure about that? IMO they are "probably" the same cost. Since both are an integer lookup and compare against a constant.

I am pretty sure about the correctness of my "necessary and probably faster" statement. The alleged speed difference here is based on the suspicion that, in some cases, prefixLen will probably be stored in a CPU register while computing limitedBuf.length() may require a memory lookup.

yadij

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

yadij · 2025-02-26T07:51:17Z

src/parser/Tokenizer.cc

        return false;
    }
-    if (prefixLen == SBuf::npos && (atEnd() || limit == 0)) {
-        debugs(24, 8, "no char in set " << tokenChars.name << " while looking for prefix");
+    if (prefixLen == SBuf::npos && !limitedBuf.length()) {


We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

1. Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).

That can be measured easily enough. No need to guess.
If you do not want to perfect it now while touching this code. Then I suggest adding a TODO about doing that measurement and optimizing. This is a high-use piece of code.

2. Computing `prefixLen == SBuf::npos` is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).

Are you sure about that? IMO they are "probably" the same cost. Since both are an integer lookup and compare against a constant.

Context: squid-cache#2003 (review)

rousskov · 2025-02-26T15:00:44Z

src/parser/Tokenizer.cc

        return false;
    }
-    if (prefixLen == SBuf::npos && (atEnd() || limit == 0)) {
-        debugs(24, 8, "no char in set " << tokenChars.name << " while looking for prefix");
+    if (prefixLen == SBuf::npos && !limitedBuf.length()) {


We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).

That can be measured easily enough. No need to guess. If you do not want to perfect it now while touching this code. Then I suggest adding a TODO about doing that measurement and optimizing. This is a high-use piece of code.

I believe that correctly measuring the effect of those additional optimizations is difficult, but I added the requested TODO in hope to merge this PR (commit 8d848cd).

Computing prefixLen == SBuf::npos is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).

Are you sure about that? IMO they are "probably" the same cost. Since both are an integer lookup and compare against a constant.

I am pretty sure about the correctness of my "necessary and probably faster" statement. The alleged speed difference here is based on the suspicion that, in some cases, prefixLen will probably be stored in a CPU register while computing limitedBuf.length() may require a memory lookup.

rousskov · 2025-02-26T15:21:21Z

src/parser/Tokenizer.cc

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

    SBuf result;

-    if (!prefix(result, tokenChars, limit))
+    if (!prefix_(result, limit, findFirstNotOf, tokenChars))


FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

- const auto tokenCharacters = delimiters.complement("non-delimiters");

Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary.

eduard-bagdasaryan and others added 10 commits February 20, 2025 01:43

Optimize AppendTokens() by adding Tokenizer::prefixUntil()

8020835

fixup: Reduced prefix() naming ambiguity by renaming protected method

775a49e

WIP: Avoid adding enum

177db9b

XXX: This code compiles, but I am concerned that callers may specify a wrong/third SBuf method (with the same profile as SearchMethod).

Do not use SBuf method pointers, addressing previous commit concern

8986638

Keep the new prefix_() parameter order as a lot more readable.

fixup: More consistent CharacterSet param naming

a2e873d

... across Tokenizer methods (at least).

fixup: More consistent debugs() output formatting for Tokenizer

d20a96f

... and slightly fewer official code changes.

fixup: Use a more descriptive name for the local variable

fcd34e2

fixup: Polished new method description

4063879

fixup: Separate documented method to improve code readability

f290fa9

rousskov mentioned this pull request Feb 25, 2025

Bug 5417: An empty annotation value does not match #1896

Closed

rousskov previously approved these changes Feb 25, 2025

View reviewed changes

rousskov added M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Feb 25, 2025

yadij requested changes Feb 26, 2025

View reviewed changes

yadij added S-waiting-for-author author action is expected (and usually required) and removed M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Feb 26, 2025

fixup: Added an optimization TODO

8d848cd

Context: squid-cache#2003 (review)

rousskov dismissed their stale review via 8d848cd February 26, 2025 15:06

rousskov approved these changes Feb 26, 2025

View reviewed changes

rousskov requested a review from yadij February 26, 2025 15:33

rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

eduard-bagdasaryan commented Feb 25, 2025 •

edited by rousskov

Loading

eduard-bagdasaryan commented Feb 25, 2025

rousskov left a comment

rousskov Feb 25, 2025

yadij Feb 26, 2025

rousskov Feb 26, 2025

rousskov Feb 25, 2025

yadij Feb 26, 2025

rousskov Feb 26, 2025

yadij left a comment

yadij Feb 26, 2025

rousskov Feb 26, 2025

rousskov Feb 26, 2025

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Are you sure you want to change the base?

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Conversation

eduard-bagdasaryan commented Feb 25, 2025 • edited by rousskov Loading

eduard-bagdasaryan commented Feb 25, 2025

rousskov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yadij left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eduard-bagdasaryan commented Feb 25, 2025 •

edited by rousskov

Loading