Update `ChunkedStream.makeSubStream` to actually check if (some) data exists when the `length` parameter is undefined #10696

Snuffleupagus · 2019-03-29T16:21:00Z

Note how XRef.fetchUncompressed, which is used a lot for most PDF documents, is calling the makeSubStream method without providing a length argument.
In practice this results in the makeSubStream method, on the ChunkedStream instance, calling the ensureRange method with NaN as the end position, thus resulting in no data being requested despite it possibly being necessary.

This may be quite bad, since in this particular case it will lead to a new ChunkedStream being created and also a new Parser/Lexer instance. Given that it's quite possible that even the very first Parser.getObj call could throw MissingDataException, this could thus lead to wasted time/resources (since re-parsing is necessary once the data finally arrives).

You obviously need to be very careful to not have ChunkedStream.makeSubStream accidentally requesting the entire file, hence its this.end property is of no use here, but it should be possible to at least check that the start of the data is present before any potentially expensive parsing occurs.

… exists when the `length` parameter is undefined Note how `XRef.fetchUncompressed`, which is used *a lot* for most PDF documents, is calling the `makeSubStream` method without providing a `length` argument. In practice this results in the `makeSubStream` method, on the `ChunkedStream` instance, calling the `ensureRange` method with `NaN` as the end position, thus resulting in no data being requested despite it possibly being necessary. This may be quite bad, since in this particular case it will lead to a new `ChunkedStream` being created *and* also a new `Parser`/`Lexer` instance. Given that it's quite possible that even the very first `Parser.getObj` call could throw `MissingDataException`, this could thus lead to wasted time/resources (since re-parsing is necessary once the data finally arrives). You obviously need to be very careful to not have `ChunkedStream.makeSubStream` accidentally requesting the *entire* file, hence its `this.end` property is of no use here, but it should be possible to at least check that the `start` of the data is present before any potentially expensive parsing occurs.

pdfjsbot · 2019-03-29T16:27:31Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a5831309000cd1e/output.txt

pdfjsbot · 2019-03-29T16:27:31Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/3ddeaa2f625efd8/output.txt

pdfjsbot · 2019-03-29T16:45:45Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/a5831309000cd1e/output.txt

Total script time: 18.22 mins

Font tests: Passed
Unit tests: FAILED
Regression tests: Passed

pdfjsbot · 2019-03-29T16:53:16Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/3ddeaa2f625efd8/output.txt

Total script time: 25.72 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

This is *similar* to the existing check using in `ChunkedStream.ensureRange`.

pdfjsbot · 2019-03-29T19:06:21Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a8ec9ca6b14ee07/output.txt

pdfjsbot · 2019-03-29T19:06:22Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/916b59049e53914/output.txt

pdfjsbot · 2019-03-29T19:24:25Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/a8ec9ca6b14ee07/output.txt

Total script time: 18.05 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2019-03-29T19:32:02Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/916b59049e53914/output.txt

Total script time: 25.67 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/916b59049e53914/reftest-analyzer.html#web=eq.log

pdfjsbot · 2019-03-29T19:55:44Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/1baf4504b136638/output.txt

pdfjsbot · 2019-03-29T19:55:44Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/1bb16f98e064373/output.txt

pdfjsbot · 2019-03-29T20:13:50Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/1baf4504b136638/output.txt

Total script time: 18.09 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2019-03-29T20:21:22Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/1bb16f98e064373/output.txt

Total script time: 25.63 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/1bb16f98e064373/reftest-analyzer.html#web=eq.log

pdfjsbot · 2019-03-29T20:28:07Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/dec17c1dbfd8e0b/output.txt

pdfjsbot · 2019-03-29T20:28:07Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/df4b9669ae64cf9/output.txt

pdfjsbot · 2019-03-29T20:46:14Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/df4b9669ae64cf9/output.txt

Total script time: 18.11 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2019-03-29T20:53:51Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/dec17c1dbfd8e0b/output.txt

Total script time: 25.72 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

Snuffleupagus · 2019-03-30T10:17:13Z

Looking at the browsertest results it seems that the second patch may have reduced the runtime by approximately 1%, which is great considering just how simple it really is :-)

timvandermeij · 2019-03-30T18:44:45Z

src/core/chunked_stream.js

@@ -98,6 +98,10 @@ class ChunkedStream {
  }

  ensureByte(pos) {
+    if (pos < this.progressiveDataLength) {


I assume pos is 0-indexed, so the < check will work. However, why are we using <= then in ensureRange below? I think the end is also a 0-indexed position, so it that wrong below?

The way that I've understood MissingDataException, and related code, is that begin is inclusive while end is exclusive, which should thus explain things; note the formatting used in

pdf.js/src/core/core_utils.js

Lines 32 to 35 in 47f208d

function MissingDataException(begin, end) {

this.begin = begin;

this.end = end;

this.message = `Missing data [${begin}, ${end})`;

and compare with standard interval notation as outlined in https://en.wikipedia.org/wiki/Interval_(mathematics)#Including_or_excluding_endpoints

To be absolutely sure though, I suppose that you might want @yurydelendik to weigh in here!

@yurydelendik Sorry to bother you, but any chance that you have time to comment here?

Also, given that a ensureByte(pos) call should essentially be equal to a ensureRange(pos, pos + 1) call, that would mean a if (pos + 1 <= this.progressiveDataLength) { check in the latter case which is how I arrived at the condition under discussion here.

timvandermeij · 2019-04-13T15:12:32Z

Thank you!

Add a progressiveDataLength fast-path to ChunkedStream.ensureByte

c6ddbd5

This is *similar* to the existing check using in `ChunkedStream.ensureRange`.

mozilla deleted a comment from kaoeeng Mar 29, 2019

timvandermeij added the core label Mar 29, 2019

timvandermeij reviewed Mar 30, 2019

View reviewed changes

Snuffleupagus mentioned this pull request Apr 6, 2019

Avoid dispatching range requests to fetch PDF data that's already loaded with streaming (PR 10675 follow-up) #10694

Merged

timvandermeij approved these changes Apr 13, 2019

View reviewed changes

timvandermeij merged commit 2d0c38d into mozilla:master Apr 13, 2019

Snuffleupagus deleted the makeSubStream-ensureByte branch April 13, 2019 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `ChunkedStream.makeSubStream` to actually check if (some) data exists when the `length` parameter is undefined #10696

Update `ChunkedStream.makeSubStream` to actually check if (some) data exists when the `length` parameter is undefined #10696

Snuffleupagus commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

Snuffleupagus commented Mar 30, 2019

timvandermeij Mar 30, 2019

Snuffleupagus Mar 30, 2019

Snuffleupagus Apr 4, 2019

Snuffleupagus Apr 12, 2019

timvandermeij commented Apr 13, 2019

	function MissingDataException(begin, end) {
	this.begin = begin;
	this.end = end;
	this.message = `Missing data [${begin}, ${end})`;

Update ChunkedStream.makeSubStream to actually check if (some) data exists when the length parameter is undefined #10696

Update ChunkedStream.makeSubStream to actually check if (some) data exists when the length parameter is undefined #10696

Conversation

Snuffleupagus commented Mar 29, 2019

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Success

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Failed

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Failed

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 29, 2019

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 29, 2019

From: Bot.io (Windows)

Success

Snuffleupagus commented Mar 30, 2019

timvandermeij Mar 30, 2019

Choose a reason for hiding this comment

Snuffleupagus Mar 30, 2019

Choose a reason for hiding this comment

Snuffleupagus Apr 4, 2019

Choose a reason for hiding this comment

Snuffleupagus Apr 12, 2019

Choose a reason for hiding this comment

timvandermeij commented Apr 13, 2019

Update `ChunkedStream.makeSubStream` to actually check if (some) data exists when the `length` parameter is undefined #10696

Update `ChunkedStream.makeSubStream` to actually check if (some) data exists when the `length` parameter is undefined #10696