-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003
base: master
Are you sure you want to change the base?
Changes from all commits
8020835
775a49e
c82b576
177db9b
8986638
a2e873d
d20a96f
fcd34e2
4063879
f290fa9
8d848cd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,18 +76,20 @@ Parser::Tokenizer::token(SBuf &returnedToken, const CharacterSet &delimiters) | |
} | ||
|
||
bool | ||
Parser::Tokenizer::prefix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit) | ||
Parser::Tokenizer::prefix_(SBuf &returnedToken, const SBuf::size_type limit, const SearchAlgorithm searchAlgorithm, const CharacterSet &chars) | ||
{ | ||
SBuf::size_type prefixLen = buf_.substr(0,limit).findFirstNotOf(tokenChars); | ||
const auto limitedBuf = buf_.substr(0, limit); | ||
auto prefixLen = (searchAlgorithm == findFirstOf) ? limitedBuf.findFirstOf(chars) : limitedBuf.findFirstNotOf(chars); | ||
if (prefixLen == 0) { | ||
debugs(24, 8, "no prefix for set " << tokenChars.name); | ||
debugs(24, 8, "empty needle with set " << chars.name); | ||
return false; | ||
} | ||
if (prefixLen == SBuf::npos && (atEnd() || limit == 0)) { | ||
debugs(24, 8, "no char in set " << tokenChars.name << " while looking for prefix"); | ||
if (prefixLen == SBuf::npos && !limitedBuf.length()) { | ||
// TODO: Evaluate whether checking limitedBuf.length() before computing prefixLen is an optimization. | ||
debugs(24, 8, "empty haystack with limit " << limit); | ||
return false; | ||
} | ||
if (prefixLen == SBuf::npos && limit > 0) { | ||
if (prefixLen == SBuf::npos) { | ||
debugs(24, 8, "whole haystack matched"); | ||
prefixLen = limit; | ||
} | ||
|
@@ -96,6 +98,12 @@ Parser::Tokenizer::prefix(SBuf &returnedToken, const CharacterSet &tokenChars, c | |
return true; | ||
} | ||
|
||
bool | ||
Parser::Tokenizer::prefix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit) | ||
{ | ||
return prefix_(returnedToken, limit, findFirstNotOf, tokenChars); | ||
} | ||
|
||
SBuf | ||
Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChars, const SBuf::size_type limit) | ||
{ | ||
|
@@ -104,7 +112,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar | |
|
||
SBuf result; | ||
|
||
if (!prefix(result, tokenChars, limit)) | ||
if (!prefix_(result, limit, findFirstNotOf, tokenChars)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have tried to replace new Moving findFirstOf() and findFirstNotOf() calls outside prefix_() does not work well either, for several reasons. This comment does not request any changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI, That trouble is a "red flag" for this alteration being a bad design change. IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.
- const auto tokenCharacters = delimiters.complement("non-delimiters"); Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.
This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary. |
||
throw TexcHere(ToSBuf("cannot parse ", description)); | ||
|
||
if (atEnd()) | ||
|
@@ -113,6 +121,12 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar | |
return result; | ||
} | ||
|
||
bool | ||
Parser::Tokenizer::prefixUntil(SBuf &returnedToken, const CharacterSet &delimiters, SBuf::size_type limit) | ||
{ | ||
return prefix_(returnedToken, limit, findFirstOf, delimiters); | ||
} | ||
|
||
bool | ||
Parser::Tokenizer::suffix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit) | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop npos check and even move this
if
statement higher/earlier, before the findFirstOf() call. I believe its current location/shape is slightly better for several reasons:prefixLen == SBuf::npos
is necessary and probably faster than doing limitedBuf.length() check. Placing this npos check first (as done in the official and PR code) can be viewed as an optimization (attempt).This comment does not request any changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can be measured easily enough. No need to guess.
If you do not want to perfect it now while touching this code. Then I suggest adding a TODO about doing that measurement and optimizing. This is a high-use piece of code.
Are you sure about that? IMO they are "probably" the same cost. Since both are an integer lookup and compare against a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that correctly measuring the effect of those additional optimizations is difficult, but I added the requested TODO in hope to merge this PR (commit 8d848cd).
I am pretty sure about the correctness of my "necessary and probably faster" statement. The alleged speed difference here is based on the suspicion that, in some cases, prefixLen will probably be stored in a CPU register while computing limitedBuf.length() may require a memory lookup.