Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863

Closed
rossj opened this issue Jan 13, 2016 · 10 comments

Comments

@rossj
Copy link
Contributor

rossj commented Jan 13, 2016

Whilst upgrading PDF.js, I noticed a potential regression where text selection / highlighting of hidden text in OCRd documents appears vertically offset from the text in the background image.

I tracked the issue down to pull request #6601 (merged in commit bb29e13), which adds the overriding of PDF FontDescriptor metrics with metrics from the header of the embedded font file. Below is a link to a test file that displays the issue, along with screenshots using commit 9a830a7 and the previous commit 9a830a7.

ocr-highlight.pdf

Commit bb29e13
Commit bb29e13307a813226e08db6db412b505ab9dc781

Commit 9a830a7
Commit 9a830a7b624679ace8fcae5b91ad80d70d91ed1f

It seems that in order to create the overlaid invisible text, some OCR software uses a dummy TTF file with dummy metric information and a single glyph, and the PDF FontDescriptor information is expected to be used instead. Tesseract OCR uses this technique. You can see a discussion of the method in their code here.

It doesn't seem that Tesseract is changing the PDF FontDescriptor information based on the font that is being OCRd, so it does seem reasonable for them to update the dummy .ttf file to contain the same metrics that are used in the FontDescriptor, which I believe would fix the issue for new PDFs generated by Tesseract. However, this won't help PDFs that are already out in the wild with this descrepancy.

Perhaps it would be possible to detect these dummy fonts and then trust the FontDescriptor information?

@timvandermeij
Copy link
Contributor

This might be related to #6509.

/cc @yurydelendik

@jbreiden
Copy link

@rossj Can you help me determine exactly which font metrics disagree? I am the right person to make an update with respect to Tesseract. Here's the current situation at HEAD in Tesseract.

font-metrics.ttx.txt
pdf-metrics.txt

simple-1.pdf
screenshot

@rlucha
Copy link

rlucha commented Jun 6, 2016

We have the same problem with our CRD'ed pdf's with tesseract. Is there any plan so fix this in the future?

If we can help in any way to test the issue please ask.

@rossj
Copy link
Contributor Author

rossj commented Jun 7, 2016

Sorry for the delayed response @jbreiden; I missed your reply notification.

I believe the discrepancy is in the ascent and descent values. In the PDF metrics, ascent = 500 / 1000 = 0.5 "text space" units and descent = -1 / 1000 = -0.001 text space units. However, pdf.js overrides these values with those from the TTF font metrics, where ascent = 1 / 2048 = 0.00048 text space units and descent = -1 / 2048 = -0.00048 text space units.

I believe setting ascent in the TTF font metrics to be 1024 should fix the discrepancy. In both cases the descent value is essentially 0 (and is probably only not 0 because the PDF spec says it must be negative).

As a potentially separate but related issue, I believe Tesseract's PDF generator is intended to create half-hight highlights (see this comment). I think setting the PDF font metric ascent to be 1000 (and setting the corresponding ascent value in the TTF metrics to 2048) might produce better full-height highlights.

@jbreiden
Copy link

jbreiden commented Jun 8, 2016

@rossj Thank you, I raised ascent to 1024 in the TTF and tested in Firefox version 46.0.1. Although there are significant improvements, we're not quite there yet. Any thoughts? (Images below are Before vs After the TTF change)

control

experiment

@jbreiden
Copy link

jbreiden commented Jun 8, 2016

I get better but not perfect results on Firefox with additional fiddling. Specifically switching some of the various yMax numbers in the font to 2048 and switching PDF height metrics to 1000 as per your suggestions. My guess is most PDF viewers are primarily looking at the metrics from the PDF file itself, and getting good results. Whereas Firefox is listening to the font and still finding some trouble in there. I just don't know what that trouble is.

a

b

c

2.pdf

pdf.ttx.txt

simple-1.pdf

2.tif.zip

@jbreiden
Copy link

jbreiden commented Jun 10, 2016

Here's another example to play with. Recommend taking a look at the calculation that handles Tz especially when it is doing a horizontal squeeze (e.g. anything less than 100).

align.pdf

screenshot

@bryanph
Copy link

bryanph commented Sep 6, 2018

Any progress on this?

@jbreiden
Copy link

jbreiden commented Sep 6, 2018

I just tested against Firefox 52.8.1. Looks even worse, especially copy-paste performance. Would love to discuss with a pdf.js programmer.

@timvandermeij
Copy link
Contributor

Closing since most of these files are fixed now, and this issue contains lots of different files with different problems, making it less actionable. The Tesseract problem is still tracked in another issue. If there are remaining issues, please open a new issue per problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants