Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache the normalized unicode-value on the Glyph-instance #15657

Merged

Conversation

Snuffleupagus
Copy link
Collaborator

Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the Glyph-instance and makes it lazily initialized.

Taking the tracemonkey.pdf document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique Glyph-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be.

Please note: The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.

Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the `Glyph`-instance and makes it *lazily* initialized.

Taking the `tracemonkey.pdf` document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique `Glyph`-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be.

*Please note:* The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Nov 4, 2022

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d16ecec7c3e81f1/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Nov 4, 2022

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/7a58297ddac95ba/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Nov 4, 2022

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/d16ecec7c3e81f1/output.txt

Total script time: 25.53 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 6
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/d16ecec7c3e81f1/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Nov 4, 2022

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/7a58297ddac95ba/output.txt

Total script time: 31.08 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 1

Image differences available at: http://54.193.163.58:8877/7a58297ddac95ba/reftest-analyzer.html#web=eq.log

Copy link
Contributor

@calixteman calixteman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a while ago this opportunity to cache, but I forgot to do it.
Thank you.

@Snuffleupagus
Copy link
Collaborator Author

Hmm, what happened to our regular merge work-flow?
I seem to recall that in the past we had trouble running makeref after landing a PR, if a merge commit wasn't used. (That obviously doesn't matter here, but I figured it'd make sense to flag this anyway.)

no_merge

@marco-c
Copy link
Contributor

marco-c commented Nov 4, 2022

Sorry I was playing with branch protection rules. I enabled the merge option again.

@Snuffleupagus Snuffleupagus merged commit 26f6f77 into mozilla:master Nov 5, 2022
@Snuffleupagus Snuffleupagus deleted the Glyph-normalizedUnicode branch November 5, 2022 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants