Cache the normalized unicode-value on the `Glyph`-instance #15657

Snuffleupagus · 2022-11-03T21:36:59Z

Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the Glyph-instance and makes it lazily initialized.

Taking the tracemonkey.pdf document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique Glyph-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be.

Please note: The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.

Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the `Glyph`-instance and makes it *lazily* initialized. Taking the `tracemonkey.pdf` document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique `Glyph`-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be. *Please note:* The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.

Snuffleupagus · 2022-11-04T12:28:32Z

/botio test

pdfjsbot · 2022-11-04T12:28:35Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d16ecec7c3e81f1/output.txt

pdfjsbot · 2022-11-04T12:28:35Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/7a58297ddac95ba/output.txt

pdfjsbot · 2022-11-04T12:54:07Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/d16ecec7c3e81f1/output.txt

Total script time: 25.53 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 6
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/d16ecec7c3e81f1/reftest-analyzer.html#web=eq.log

pdfjsbot · 2022-11-04T12:59:41Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/7a58297ddac95ba/output.txt

Total script time: 31.08 mins

Font tests: Passed
Unit tests: FAILED
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 1

Image differences available at: http://54.193.163.58:8877/7a58297ddac95ba/reftest-analyzer.html#web=eq.log

calixteman

I noticed a while ago this opportunity to cache, but I forgot to do it.
Thank you.

Snuffleupagus · 2022-11-04T21:39:43Z

Hmm, what happened to our regular merge work-flow?
I seem to recall that in the past we had trouble running makeref after landing a PR, if a merge commit wasn't used. (That obviously doesn't matter here, but I figured it'd make sense to flag this anyway.)

marco-c · 2022-11-04T23:10:44Z

Sorry I was playing with branch protection rules. I enabled the merge option again.

Snuffleupagus added core text-selection labels Nov 3, 2022

calixteman approved these changes Nov 4, 2022

View reviewed changes

Snuffleupagus merged commit 26f6f77 into mozilla:master Nov 5, 2022

Snuffleupagus deleted the Glyph-normalizedUnicode branch November 5, 2022 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache the normalized unicode-value on the `Glyph`-instance #15657

Cache the normalized unicode-value on the `Glyph`-instance #15657

Snuffleupagus commented Nov 3, 2022

Snuffleupagus commented Nov 4, 2022

pdfjsbot commented Nov 4, 2022

pdfjsbot commented Nov 4, 2022

pdfjsbot commented Nov 4, 2022

pdfjsbot commented Nov 4, 2022

calixteman left a comment

Snuffleupagus commented Nov 4, 2022

marco-c commented Nov 4, 2022

Cache the normalized unicode-value on the Glyph-instance #15657

Cache the normalized unicode-value on the Glyph-instance #15657

Conversation

Snuffleupagus commented Nov 3, 2022

Snuffleupagus commented Nov 4, 2022

pdfjsbot commented Nov 4, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 4, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 4, 2022

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Nov 4, 2022

From: Bot.io (Windows)

Failed

calixteman left a comment

Choose a reason for hiding this comment

Snuffleupagus commented Nov 4, 2022

marco-c commented Nov 4, 2022

Cache the normalized unicode-value on the `Glyph`-instance #15657

Cache the normalized unicode-value on the `Glyph`-instance #15657