-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache the normalized unicode-value on the Glyph
-instance
#15657
Cache the normalized unicode-value on the Glyph
-instance
#15657
Conversation
Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the `Glyph`-instance and makes it *lazily* initialized. Taking the `tracemonkey.pdf` document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique `Glyph`-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be. *Please note:* The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/d16ecec7c3e81f1/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/7a58297ddac95ba/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/d16ecec7c3e81f1/output.txt Total script time: 25.53 mins
Image differences available at: http://54.241.84.105:8877/d16ecec7c3e81f1/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/7a58297ddac95ba/output.txt Total script time: 31.08 mins
Image differences available at: http://54.193.163.58:8877/7a58297ddac95ba/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a while ago this opportunity to cache, but I forgot to do it.
Thank you.
Sorry I was playing with branch protection rules. I enabled the merge option again. |
Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the
Glyph
-instance and makes it lazily initialized.Taking the
tracemonkey.pdf
document as an example: When extracting the text-content there's a total of 69236 characters but only 595 uniqueGlyph
-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be.Please note: The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.