Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
3 changes: 1 addition & 2 deletions src/Notes.cc
Original file line number Diff line number Diff line change
Expand Up @@ -339,10 +339,9 @@ static void
AppendTokens(NotePairs::Entries &entries, const SBuf &key, const SBuf &val, const CharacterSet &delimiters)
{
Parser::Tokenizer tok(val);
const auto tokenCharacters = delimiters.complement("non-delimiters");
do {
SBuf token;
(void)tok.prefix(token, tokenCharacters);
(void)tok.prefixUntil(token, delimiters);
entries.push_back(new NotePairs::Entry(key, token)); // token may be empty
} while (tok.skipOne(delimiters));
}
Expand Down
28 changes: 21 additions & 7 deletions src/parser/Tokenizer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -76,18 +76,20 @@ Parser::Tokenizer::token(SBuf &returnedToken, const CharacterSet &delimiters)
}

bool
Parser::Tokenizer::prefix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit)
Parser::Tokenizer::prefix_(SBuf &returnedToken, const SBuf::size_type limit, const SearchAlgorithm searchAlgorithm, const CharacterSet &chars)
{
SBuf::size_type prefixLen = buf_.substr(0,limit).findFirstNotOf(tokenChars);
const auto limitedBuf = buf_.substr(0, limit);
auto prefixLen = (searchAlgorithm == findFirstOf) ? limitedBuf.findFirstOf(chars) : limitedBuf.findFirstNotOf(chars);
if (prefixLen == 0) {
debugs(24, 8, "no prefix for set " << tokenChars.name);
debugs(24, 8, "empty needle with set " << chars.name);
return false;
}
if (prefixLen == SBuf::npos && (atEnd() || limit == 0)) {
debugs(24, 8, "no char in set " << tokenChars.name << " while looking for prefix");
if (prefixLen == SBuf::npos && !limitedBuf.length()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

  1. Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).
  2. Computing prefixLen == SBuf::npos is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).
  3. Keeping the two SBuf::npos conditions together clarifies this code. Separating them (and removing one npos check) may complicate correct code interpretation.
  4. Keeping this condition in its current place results in fewer out-of-scope changes.

This comment does not request any changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

1. Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).

That can be measured easily enough. No need to guess.
If you do not want to perfect it now while touching this code. Then I suggest adding a TODO about doing that measurement and optimizing. This is a high-use piece of code.

2. Computing `prefixLen == SBuf::npos` is necessary and probably faster than doing limitedBuf.length() check.  Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).

Are you sure about that? IMO they are "probably" the same cost. Since both are an integer lookup and compare against a constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop npos check and even move this if statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:

  1. Zero limitedBuf.length() condition is probably rare. Delaying its computation (as done in the official and PR code) can be viewed as an optimization (attempt).

That can be measured easily enough. No need to guess. If you do not want to perfect it now while touching this code. Then I suggest adding a TODO about doing that measurement and optimizing. This is a high-use piece of code.

I believe that correctly measuring the effect of those additional optimizations is difficult, but I added the requested TODO in hope to merge this PR (commit 8d848cd).

  1. Computing prefixLen == SBuf::npos is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).

Are you sure about that? IMO they are "probably" the same cost. Since both are an integer lookup and compare against a constant.

I am pretty sure about the correctness of my "necessary and probably faster" statement. The alleged speed difference here is based on the suspicion that, in some cases, prefixLen will probably be stored in a CPU register while computing limitedBuf.length() may require a memory lookup.

// TODO: Evaluate whether checking limitedBuf.length() before computing prefixLen is an optimization.
debugs(24, 8, "empty haystack with limit " << limit);
return false;
}
if (prefixLen == SBuf::npos && limit > 0) {
if (prefixLen == SBuf::npos) {
debugs(24, 8, "whole haystack matched");
prefixLen = limit;
}
Expand All @@ -96,6 +98,12 @@ Parser::Tokenizer::prefix(SBuf &returnedToken, const CharacterSet &tokenChars, c
return true;
}

bool
Parser::Tokenizer::prefix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit)
{
return prefix_(returnedToken, limit, findFirstNotOf, tokenChars);
}

SBuf
Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChars, const SBuf::size_type limit)
{
Expand All @@ -104,7 +112,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

SBuf result;

if (!prefix(result, tokenChars, limit))
if (!prefix_(result, limit, findFirstNotOf, tokenChars))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to replace new { findFirstOf, findFirstNotOf } enum with SBuf method pointers, but the result was no more readable and more error-prone then current PR code because a prefix_() caller could supply a pointer to a method that compiles fine but is not compatible with prefix_() logic. The current/proposed "findFirstNotOf, tokenChars" and "findFirstOf, delimiters" calls are readable and safer.

Moving findFirstOf() and findFirstNotOf() calls outside prefix_() does not work well either, for several reasons.

This comment does not request any changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

-    const auto tokenCharacters = delimiters.complement("non-delimiters");

Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary.

throw TexcHere(ToSBuf("cannot parse ", description));

if (atEnd())
Expand All @@ -113,6 +121,12 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar
return result;
}

bool
Parser::Tokenizer::prefixUntil(SBuf &returnedToken, const CharacterSet &delimiters, SBuf::size_type limit)
{
return prefix_(returnedToken, limit, findFirstOf, delimiters);
}

bool
Parser::Tokenizer::suffix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit)
{
Expand Down
17 changes: 17 additions & 0 deletions src/parser/Tokenizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,15 @@ class Tokenizer
*/
bool prefix(SBuf &returnedToken, const CharacterSet &tokenChars, SBuf::size_type limit = SBuf::npos);

/// Extracts all sequential non-delimiter characters up to an optional
/// length limit. Any subsequent characters are left intact. If no delimiter
/// characters were found, and the length limit has not been reached, then
/// the prefix may continue when/if more input data becomes available later!
///
/// \retval true if one or more permitted characters were found
/// \param returnedToken is used to store permitted characters found
bool prefixUntil(SBuf &returnedToken, const CharacterSet &delimiters, SBuf::size_type limit = SBuf::npos);

/** Extracts all sequential permitted characters up to an optional length limit.
* Operates on the trailing end of the buffer.
*
Expand Down Expand Up @@ -164,6 +173,14 @@ class Tokenizer
int64_t udec64(const char *description, SBuf::size_type limit = SBuf::npos);

protected:
/// SBuf searches supported by prefix_()
using SearchAlgorithm = enum { findFirstOf, findFirstNotOf };

/// Code shared by prefix() and prefixUntil() methods.
/// \param searchAlgorithm specifies how to scan buf_ prefix using the given CharacterSet
/// \param chars searchAlgorithm parameter -- permitted token or delimiter characters
bool prefix_(SBuf &returnedToken, SBuf::size_type limit, SearchAlgorithm searchAlgorithm, const CharacterSet &chars);

SBuf consume(const SBuf::size_type n);
SBuf::size_type success(const SBuf::size_type n);
SBuf consumeTrailing(const SBuf::size_type n);
Expand Down