Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ChunkedStream.makeSubStream to actually check if (some) data exists when the length parameter is undefined #10696

Merged
merged 2 commits into from
Apr 13, 2019

Conversation

Snuffleupagus
Copy link
Collaborator

Note how XRef.fetchUncompressed, which is used a lot for most PDF documents, is calling the makeSubStream method without providing a length argument.
In practice this results in the makeSubStream method, on the ChunkedStream instance, calling the ensureRange method with NaN as the end position, thus resulting in no data being requested despite it possibly being necessary.

This may be quite bad, since in this particular case it will lead to a new ChunkedStream being created and also a new Parser/Lexer instance. Given that it's quite possible that even the very first Parser.getObj call could throw MissingDataException, this could thus lead to wasted time/resources (since re-parsing is necessary once the data finally arrives).

You obviously need to be very careful to not have ChunkedStream.makeSubStream accidentally requesting the entire file, hence its this.end property is of no use here, but it should be possible to at least check that the start of the data is present before any potentially expensive parsing occurs.

… exists when the `length` parameter is undefined

Note how `XRef.fetchUncompressed`, which is used *a lot* for most PDF documents, is calling the `makeSubStream` method without providing a `length` argument.
In practice this results in the `makeSubStream` method, on the `ChunkedStream` instance, calling the `ensureRange` method with `NaN` as the end position,  thus resulting in no data being requested despite it possibly being necessary.

This may be quite bad, since in this particular case it will lead to a new `ChunkedStream` being created *and* also a new `Parser`/`Lexer` instance. Given that it's quite possible that even the very first `Parser.getObj` call could throw `MissingDataException`, this could thus lead to wasted time/resources (since re-parsing is necessary once the data finally arrives).

You obviously need to be very careful to not have `ChunkedStream.makeSubStream` accidentally requesting the *entire* file, hence its `this.end` property is of no use here, but it should be possible to at least check that the `start` of the data is present before any potentially expensive parsing occurs.
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a5831309000cd1e/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/3ddeaa2f625efd8/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/a5831309000cd1e/output.txt

Total script time: 18.22 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/3ddeaa2f625efd8/output.txt

Total script time: 25.72 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

This is *similar* to the existing check using in `ChunkedStream.ensureRange`.
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a8ec9ca6b14ee07/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/916b59049e53914/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/a8ec9ca6b14ee07/output.txt

Total script time: 18.05 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/916b59049e53914/output.txt

Total script time: 25.67 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/916b59049e53914/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/1baf4504b136638/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/1bb16f98e064373/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/1baf4504b136638/output.txt

Total script time: 18.09 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/1bb16f98e064373/output.txt

Total script time: 25.63 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/1bb16f98e064373/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/dec17c1dbfd8e0b/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/df4b9669ae64cf9/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/df4b9669ae64cf9/output.txt

Total script time: 18.11 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/dec17c1dbfd8e0b/output.txt

Total script time: 25.72 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@mozilla mozilla deleted a comment from kaoeeng Mar 29, 2019
@Snuffleupagus
Copy link
Collaborator Author

Looking at the browsertest results it seems that the second patch may have reduced the runtime by approximately 1%, which is great considering just how simple it really is :-)

@@ -98,6 +98,10 @@ class ChunkedStream {
}

ensureByte(pos) {
if (pos < this.progressiveDataLength) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume pos is 0-indexed, so the < check will work. However, why are we using <= then in ensureRange below? I think the end is also a 0-indexed position, so it that wrong below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that I've understood MissingDataException, and related code, is that begin is inclusive while end is exclusive, which should thus explain things; note the formatting used in

function MissingDataException(begin, end) {
this.begin = begin;
this.end = end;
this.message = `Missing data [${begin}, ${end})`;
and compare with standard interval notation as outlined in https://en.wikipedia.org/wiki/Interval_(mathematics)#Including_or_excluding_endpoints

To be absolutely sure though, I suppose that you might want @yurydelendik to weigh in here!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurydelendik Sorry to bother you, but any chance that you have time to comment here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, given that a ensureByte(pos) call should essentially be equal to a ensureRange(pos, pos + 1) call, that would mean a if (pos + 1 <= this.progressiveDataLength) { check in the latter case which is how I arrived at the condition under discussion here.

@timvandermeij timvandermeij merged commit 2d0c38d into mozilla:master Apr 13, 2019
@timvandermeij
Copy link
Contributor

Thank you!

@Snuffleupagus Snuffleupagus deleted the makeSubStream-ensureByte branch April 13, 2019 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants