Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Fix the way to chunk the strings #13257

Merged
merged 1 commit into from
Apr 30, 2021

Conversation

calixteman
Copy link
Contributor

  • Improve chunking in order to fix some bugs where the spaces aren't here:
    • track the last position where a glyph has been drawn;
    • when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break:
      • there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions;
      • and so there is now one function (i.e. "compareWithLastPosition") where all the job is done.
  • Add some breaks in order to get lines;
  • Remove the multiple whites spaces:
    • some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool;
    • other pdf readers replace spaces by one white space.

@calixteman
Copy link
Contributor Author

/botio test

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the bug1130815-text results, it appears that there's now multiple spaces being added in the footers on the pages. (Obviously that particular PDF document is corrupt, but I figured it cannot hurt to mention this.)

@calixteman
Copy link
Contributor Author

Looking at the bug1130815-text results, it appears that there's now multiple spaces being added in the footers on the pages. (Obviously that particular PDF document is corrupt, but I figured it cannot hurt to mention this.)

Fiche technique... is followed by a white space character and then there is an empty space before Mise a jour so the empty space is replaced by a white space thanks to this patch.
When the last char in the chunk is a white space then we could not add a new one for an empty space, wdyt ?

@Snuffleupagus
Copy link
Collaborator

When the last char in the chunk is a white space then we could not add a new one for an empty space, wdyt ?

That sounds like a nice idea, assuming it's reasonably straightforward to implement, since this situation might not be too uncommon in practice.

@mozilla mozilla deleted a comment from pdfjsbot Apr 27, 2021
@mozilla mozilla deleted a comment from pdfjsbot Apr 27, 2021
@mozilla mozilla deleted a comment from pdfjsbot Apr 27, 2021
@mozilla mozilla deleted a comment from pdfjsbot Apr 27, 2021
@timvandermeij
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/8115b848e9c7c37/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://3.101.106.178:8877/3f14bf2c5073810/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/8115b848e9c7c37/output.txt

Total script time: 25.44 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/8115b848e9c7c37/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://3.101.106.178:8877/3f14bf2c5073810/output.txt

Total script time: 29.06 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://3.101.106.178:8877/3f14bf2c5073810/reftest-analyzer.html#web=eq.log

Copy link
Contributor

@brendandahl brendandahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working nicely. One minor change needed.

It doesn't have to be done in this PR, but it would be good to add some integration tests that select the text of a page and compare it to what we expect so we don't regress this in the future.

  - Improve chunking in order to fix some bugs where the spaces aren't here:
    * track the last position where a glyph has been drawn;
    * when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break:
      - there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions;
      - and so there is now one function (i.e. "compareWithLastPosition") where all the job is done.
  - Add some breaks in order to get lines;
  - Remove the multiple whites spaces:
    * some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool;
    * other pdf readers replace spaces by one white space.

Update src/core/evaluator.js

Co-authored-by: Jonas Jenwald <[email protected]>
@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.67.70.0:8877/ed8c19356d68485/output.txt

calixteman added a commit to calixteman/pdf.js that referenced this pull request May 23, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request May 23, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request May 23, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request May 29, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request May 29, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
robertknight added a commit to hypothesis/client that referenced this pull request Aug 17, 2021
Fix anchoring of text quotes and positions in PDF.js releases (>=
2.9.359) that include mozilla/pdf.js#13257.

The client's anchoring relies on the text content of pages extracted via
PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent`
of the hidden text layer element. In older PDF.js releases acheiving
this alignment required excluding text items with all-whitespace text,
because PDF.js did not create elements in the text layer for these. In
PDF.js releases after mozilla/pdf.js#13257 this
filtering is no longer needed.

The fix in this commit is to feature-detect whether the active version
of PDF.js includes this change or not and filter or not filter text
items accordingly.

Future changes to PDF.js could cause mismatches between the result of
`PDFPage.getTextContent` and the rendered text layer in other ways, so a
sanity check has been added which logs a console warning if a mismatch
is detected.
robertknight added a commit to hypothesis/client that referenced this pull request Aug 17, 2021
Fix anchoring of text quotes and positions in PDF.js releases (>=
2.9.359) that include mozilla/pdf.js#13257.

The client's anchoring relies on the text content of pages extracted via
PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent`
of the hidden text layer element. In older PDF.js releases acheiving
this alignment required excluding text items with all-whitespace text,
because PDF.js did not create elements in the text layer for these. In
PDF.js releases after mozilla/pdf.js#13257 this
filtering is no longer needed.

The fix in this commit is to feature-detect whether the active version
of PDF.js includes this change or not and filter or not filter text
items accordingly.

Future changes to PDF.js could cause mismatches between the result of
`PDFPage.getTextContent` and the rendered text layer in other ways, so a
sanity check has been added which logs a console warning if a mismatch
is detected.
robertknight added a commit to hypothesis/client that referenced this pull request Aug 18, 2021
Fix anchoring of text quotes and positions in PDF.js releases (>=
2.9.359) that include mozilla/pdf.js#13257.

The client's anchoring relies on the text content of pages extracted via
PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent`
of the hidden text layer element. In older PDF.js releases acheiving
this alignment required excluding text items with all-whitespace text,
because PDF.js did not create elements in the text layer for these. In
PDF.js releases after mozilla/pdf.js#13257 this
filtering is no longer needed.

The fix in this commit is to feature-detect whether the active version
of PDF.js includes this change or not and filter or not filter text
items accordingly.

Future changes to PDF.js could cause mismatches between the result of
`PDFPage.getTextContent` and the rendered text layer in other ways, so a
sanity check has been added which logs a console warning if a mismatch
is detected.
robertknight added a commit to hypothesis/client that referenced this pull request Aug 31, 2021
Fix anchoring of text quotes and positions in PDF.js releases (>=
2.9.359) that include mozilla/pdf.js#13257.

The client's anchoring relies on the text content of pages extracted via
PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent`
of the hidden text layer element. In older PDF.js releases acheiving
this alignment required excluding text items with all-whitespace text,
because PDF.js did not create elements in the text layer for these. In
PDF.js releases after mozilla/pdf.js#13257 this
filtering is no longer needed.

The fix in this commit is to feature-detect whether the active version
of PDF.js includes this change or not and filter or not filter text
items accordingly.

Future changes to PDF.js could cause mismatches between the result of
`PDFPage.getTextContent` and the rendered text layer in other ways, so a
sanity check has been added which logs a console warning if a mismatch
is detected.
robertknight added a commit to hypothesis/client that referenced this pull request Aug 31, 2021
Fix anchoring of text quotes and positions in PDF.js releases (>=
2.9.359) that include mozilla/pdf.js#13257.

The client's anchoring relies on the text content of pages extracted via
PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent`
of the hidden text layer element. In older PDF.js releases acheiving
this alignment required excluding text items with all-whitespace text,
because PDF.js did not create elements in the text layer for these. In
PDF.js releases after mozilla/pdf.js#13257 this
filtering is no longer needed.

The fix in this commit is to feature-detect whether the active version
of PDF.js includes this change or not and filter or not filter text
items accordingly.

Future changes to PDF.js could cause mismatches between the result of
`PDFPage.getTextContent` and the rendered text layer in other ways, so a
sanity check has been added which logs a console warning if a mismatch
is detected.
calixteman added a commit to calixteman/pdf.js that referenced this pull request Oct 3, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request Oct 9, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request Oct 10, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request Oct 10, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request Oct 13, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
calixteman added a commit to calixteman/pdf.js that referenced this pull request Oct 17, 2021
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
bh213 pushed a commit to bh213/pdf.js that referenced this pull request Jun 3, 2022
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
rousek pushed a commit to signosoft/pdf.js that referenced this pull request Aug 10, 2022
  - PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants