-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[api-minor] Fix the way to chunk the strings #13257
Conversation
calixteman
commented
Apr 17, 2021
- Improve chunking in order to fix some bugs where the spaces aren't here:
- track the last position where a glyph has been drawn;
- when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break:
- there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions;
- and so there is now one function (i.e. "compareWithLastPosition") where all the job is done.
- Add some breaks in order to get lines;
- Remove the multiple whites spaces:
- some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool;
- other pdf readers replace spaces by one white space.
/botio test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the bug1130815-text
results, it appears that there's now multiple spaces being added in the footers on the pages. (Obviously that particular PDF document is corrupt, but I figured it cannot hurt to mention this.)
|
That sounds like a nice idea, assuming it's reasonably straightforward to implement, since this situation might not be too uncommon in practice. |
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/8115b848e9c7c37/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @timvandermeij received. Current queue size: 0 Live output at: http://3.101.106.178:8877/3f14bf2c5073810/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/8115b848e9c7c37/output.txt Total script time: 25.44 mins
Image differences available at: http://54.67.70.0:8877/8115b848e9c7c37/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://3.101.106.178:8877/3f14bf2c5073810/output.txt Total script time: 29.06 mins
Image differences available at: http://3.101.106.178:8877/3f14bf2c5073810/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working nicely. One minor change needed.
It doesn't have to be done in this PR, but it would be good to add some integration tests that select the text of a page and compare it to what we expect so we don't regress this in the future.
- Improve chunking in order to fix some bugs where the spaces aren't here: * track the last position where a glyph has been drawn; * when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break: - there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions; - and so there is now one function (i.e. "compareWithLastPosition") where all the job is done. - Add some breaks in order to get lines; - Remove the multiple whites spaces: * some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool; * other pdf readers replace spaces by one white space. Update src/core/evaluator.js Co-authored-by: Jonas Jenwald <[email protected]>
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.67.70.0:8877/ed8c19356d68485/output.txt |
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
Fix anchoring of text quotes and positions in PDF.js releases (>= 2.9.359) that include mozilla/pdf.js#13257. The client's anchoring relies on the text content of pages extracted via PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent` of the hidden text layer element. In older PDF.js releases acheiving this alignment required excluding text items with all-whitespace text, because PDF.js did not create elements in the text layer for these. In PDF.js releases after mozilla/pdf.js#13257 this filtering is no longer needed. The fix in this commit is to feature-detect whether the active version of PDF.js includes this change or not and filter or not filter text items accordingly. Future changes to PDF.js could cause mismatches between the result of `PDFPage.getTextContent` and the rendered text layer in other ways, so a sanity check has been added which logs a console warning if a mismatch is detected.
Fix anchoring of text quotes and positions in PDF.js releases (>= 2.9.359) that include mozilla/pdf.js#13257. The client's anchoring relies on the text content of pages extracted via PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent` of the hidden text layer element. In older PDF.js releases acheiving this alignment required excluding text items with all-whitespace text, because PDF.js did not create elements in the text layer for these. In PDF.js releases after mozilla/pdf.js#13257 this filtering is no longer needed. The fix in this commit is to feature-detect whether the active version of PDF.js includes this change or not and filter or not filter text items accordingly. Future changes to PDF.js could cause mismatches between the result of `PDFPage.getTextContent` and the rendered text layer in other ways, so a sanity check has been added which logs a console warning if a mismatch is detected.
Fix anchoring of text quotes and positions in PDF.js releases (>= 2.9.359) that include mozilla/pdf.js#13257. The client's anchoring relies on the text content of pages extracted via PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent` of the hidden text layer element. In older PDF.js releases acheiving this alignment required excluding text items with all-whitespace text, because PDF.js did not create elements in the text layer for these. In PDF.js releases after mozilla/pdf.js#13257 this filtering is no longer needed. The fix in this commit is to feature-detect whether the active version of PDF.js includes this change or not and filter or not filter text items accordingly. Future changes to PDF.js could cause mismatches between the result of `PDFPage.getTextContent` and the rendered text layer in other ways, so a sanity check has been added which logs a console warning if a mismatch is detected.
Fix anchoring of text quotes and positions in PDF.js releases (>= 2.9.359) that include mozilla/pdf.js#13257. The client's anchoring relies on the text content of pages extracted via PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent` of the hidden text layer element. In older PDF.js releases acheiving this alignment required excluding text items with all-whitespace text, because PDF.js did not create elements in the text layer for these. In PDF.js releases after mozilla/pdf.js#13257 this filtering is no longer needed. The fix in this commit is to feature-detect whether the active version of PDF.js includes this change or not and filter or not filter text items accordingly. Future changes to PDF.js could cause mismatches between the result of `PDFPage.getTextContent` and the rendered text layer in other ways, so a sanity check has been added which logs a console warning if a mismatch is detected.
Fix anchoring of text quotes and positions in PDF.js releases (>= 2.9.359) that include mozilla/pdf.js#13257. The client's anchoring relies on the text content of pages extracted via PDF.js's text APIs (`PDFPage.getTextContent`) to match the `textContent` of the hidden text layer element. In older PDF.js releases acheiving this alignment required excluding text items with all-whitespace text, because PDF.js did not create elements in the text layer for these. In PDF.js releases after mozilla/pdf.js#13257 this filtering is no longer needed. The fix in this commit is to feature-detect whether the active version of PDF.js includes this change or not and filter or not filter text items accordingly. Future changes to PDF.js could cause mismatches between the result of `PDFPage.getTextContent` and the rendered text layer in other ways, so a sanity check has been added which logs a console warning if a mismatch is detected.
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
- PR mozilla#13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues. - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn; - no space are "drawn": it just moves the cursor but they aren't added in the chunk; - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one. - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font) - it was a pretty good idea in general but it fails with some fonts where space was too big: - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).