Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local caching of "simple" Graphics State (ExtGState) data in PartialEvaluator.{getOperatorList, getTextContent} (issue 2813) #12087

Merged
merged 3 commits into from
Jul 17, 2020

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Jul 11, 2020

  • Add local caching of "simple" Graphics State (ExtGState) data in PartialEvaluator.getOperatorList (issue 2813)

    This patch will help pathological cases the most, with issue Extrememly slow rendering on this pdf #2813 being a particularily problematic example. While there's only four /ExtGState resources, there's a total 29062 of setGState operators. Even though parsing of a single /ExtGState resource is quite fast, having to re-parse them thousands of times does add up quite significantly.

    For simplicity we'll only cache "simple" /ExtGState resource, since e.g. the general SMask case cannot be easily cached (without re-factoring other code, which may have undesirable effects on general parsing).

    By caching "simple" /ExtGState resource, we thus improve performance by:

    • Not having to fetch/validate/parse the same /ExtGState data over and over.
    • Handling of repeated setGState operators becomes synchronous during the OperatorList building, instead of having to defer to the event-loop/microtask-queue since the /ExtGState parsing is done asynchronously.

    Obviously I had intended to include (standard) benchmark results with this patch, but for reasons I don't understand the test run-time (even with master) of the document in issue Extrememly slow rendering on this pdf #2813 is a lot slower than in the development viewer (making normal benchmarking infeasible).
    However, testing this manually in the development viewer (using pdfBug=Stats) shows a reduction of ~10 % in the rendering time of the PDF document in issue Extrememly slow rendering on this pdf #2813.

  • Add local caching of non-font Graphics State (ExtGState) data in PartialEvaluator.getTextContent

    It turns out that getTextContent suffers from similar problems with repeated GStates as getOperatorList; please see the previous patch.

    While only /ExtGState resources containing Fonts will actually be parsed by PartialEvaluator.getTextContent, we're still forced to fetch/validate repeated /ExtGState resources even though most of them won't affect the textContent (since they mostly contain purely graphical state).

    With these changes we also no longer need to immediately reset the current text-state when encountering a setGState operator, which may thus improve text-selection in some cases.

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/de2b47a8d00ce69/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/5e608eacd2ef0bd/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/de2b47a8d00ce69/output.txt

Total script time: 26.66 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/de2b47a8d00ce69/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/5e608eacd2ef0bd/output.txt

Total script time: 31.29 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/5e608eacd2ef0bd/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/83903137cb25f06/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/30816e7aa97ed50/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/83903137cb25f06/output.txt

Total script time: 26.65 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/83903137cb25f06/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/30816e7aa97ed50/output.txt

Total script time: 30.57 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/30816e7aa97ed50/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator Author

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/b82ead3d1bc23ec/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/b82ead3d1bc23ec/output.txt

Total script time: 3.36 mins

Published

@Snuffleupagus Snuffleupagus marked this pull request as ready for review July 12, 2020 13:18
@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/637a5dd89842027/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/90d6977baa9680d/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/90d6977baa9680d/output.txt

Total script time: 26.63 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/90d6977baa9680d/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/637a5dd89842027/output.txt

Total script time: 29.76 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/637a5dd89842027/reftest-analyzer.html#web=eq.log

…tialEvaluator.getOperatorList` (issue 2813)

This patch will help pathological cases the most, with issue 2813 being a particularily problematic example. While there's only *four* `/ExtGState` resources, there's a total `29062` of `setGState` operators. Even though parsing of a single `/ExtGState` resource is quite fast, having to re-parse them thousands of times does add up quite significantly.

For simplicity we'll only cache "simple" `/ExtGState` resource, since e.g. the general `SMask` case cannot be easily cached (without re-factoring other code, which may have undesirable effects on general parsing).

By caching "simple" `/ExtGState` resource, we thus improve performance by:
 - Not having to fetch/validate/parse the same `/ExtGState` data over and over.
 - Handling of repeated `setGState` operators becomes *synchronous* during the `OperatorList` building, instead of having to defer to the event-loop/microtask-queue since the `/ExtGState` parsing is done asynchronously.

---

Obviously I had intended to include (standard) benchmark results with this patch, but for reasons I don't understand the test run-time (even with `master`) of the document in issue 2813 is *a lot* slower than in the development viewer (making normal benchmarking infeasible).
However, testing this manually in the development viewer (using `pdfBug=Stats`) shows a *reduction* of `~10 %` in the rendering time of the PDF document in issue 2813.
…tialEvaluator.getTextContent`

It turns out that `getTextContent` suffers from *similar* problems with repeated GStates as `getOperatorList`; please see the previous patch.

While only `/ExtGState` resources containing Fonts will actually be *parsed* by `PartialEvaluator.getTextContent`, we're still forced to fetch/validate repeated `/ExtGState` resources even though *most* of them won't affect the textContent (since they mostly contain purely graphical state).

With these changes we also no longer need to immediately reset the current text-state when encountering a `setGState` operator, which may thus improve text-selection in some cases.
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/5cb559f3e609ef5/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/39fc5ad6e7d7c5a/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/5cb559f3e609ef5/output.txt

Total script time: 26.55 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/5cb559f3e609ef5/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/39fc5ad6e7d7c5a/output.txt

Total script time: 30.33 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/39fc5ad6e7d7c5a/reftest-analyzer.html#web=eq.log

Since this method calls `Dict.get` to fetch data, there could thus be `Error`s thrown in corrupt PDF documents when attempting to resolve an indirect object.
To ensure that this won't ever become a problem, we change the method to be `async` such that a rejected Promise would be returned and general OperatorList parsing won't break.
@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/18ab602b53471b8/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/18ab602b53471b8/output.txt

Total script time: 3.29 mins

Published

@timvandermeij timvandermeij merged commit e63d1eb into mozilla:master Jul 17, 2020
@timvandermeij
Copy link
Contributor

Thank you for looking into this!

@Snuffleupagus Snuffleupagus deleted the LocalGStateCache branch July 17, 2020 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants