Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863

rossj · 2016-01-13T07:41:23Z

Whilst upgrading PDF.js, I noticed a potential regression where text selection / highlighting of hidden text in OCRd documents appears vertically offset from the text in the background image.

I tracked the issue down to pull request #6601 (merged in commit bb29e13), which adds the overriding of PDF FontDescriptor metrics with metrics from the header of the embedded font file. Below is a link to a test file that displays the issue, along with screenshots using commit 9a830a7 and the previous commit 9a830a7.

ocr-highlight.pdf

Commit bb29e13

Commit 9a830a7

It seems that in order to create the overlaid invisible text, some OCR software uses a dummy TTF file with dummy metric information and a single glyph, and the PDF FontDescriptor information is expected to be used instead. Tesseract OCR uses this technique. You can see a discussion of the method in their code here.

It doesn't seem that Tesseract is changing the PDF FontDescriptor information based on the font that is being OCRd, so it does seem reasonable for them to update the dummy .ttf file to contain the same metrics that are used in the FontDescriptor, which I believe would fix the issue for new PDFs generated by Tesseract. However, this won't help PDFs that are already out in the wild with this descrepancy.

Perhaps it would be possible to detect these dummy fonts and then trust the FontDescriptor information?

The text was updated successfully, but these errors were encountered:

timvandermeij · 2016-01-13T12:24:58Z

This might be related to #6509.

/cc @yurydelendik

jbreiden · 2016-02-13T01:02:05Z

@rossj Can you help me determine exactly which font metrics disagree? I am the right person to make an update with respect to Tesseract. Here's the current situation at HEAD in Tesseract.

font-metrics.ttx.txt
pdf-metrics.txt

simple-1.pdf

rlucha · 2016-06-06T11:06:03Z

We have the same problem with our CRD'ed pdf's with tesseract. Is there any plan so fix this in the future?

If we can help in any way to test the issue please ask.

rossj · 2016-06-07T21:12:25Z

Sorry for the delayed response @jbreiden; I missed your reply notification.

I believe the discrepancy is in the ascent and descent values. In the PDF metrics, ascent = 500 / 1000 = 0.5 "text space" units and descent = -1 / 1000 = -0.001 text space units. However, pdf.js overrides these values with those from the TTF font metrics, where ascent = 1 / 2048 = 0.00048 text space units and descent = -1 / 2048 = -0.00048 text space units.

I believe setting ascent in the TTF font metrics to be 1024 should fix the discrepancy. In both cases the descent value is essentially 0 (and is probably only not 0 because the PDF spec says it must be negative).

As a potentially separate but related issue, I believe Tesseract's PDF generator is intended to create half-hight highlights (see this comment). I think setting the PDF font metric ascent to be 1000 (and setting the corresponding ascent value in the TTF metrics to 2048) might produce better full-height highlights.

jbreiden · 2016-06-08T00:27:20Z

@rossj Thank you, I raised ascent to 1024 in the TTF and tested in Firefox version 46.0.1. Although there are significant improvements, we're not quite there yet. Any thoughts? (Images below are Before vs After the TTF change)

jbreiden · 2016-06-08T00:36:30Z

I get better but not perfect results on Firefox with additional fiddling. Specifically switching some of the various yMax numbers in the font to 2048 and switching PDF height metrics to 1000 as per your suggestions. My guess is most PDF viewers are primarily looking at the metrics from the PDF file itself, and getting good results. Whereas Firefox is listening to the font and still finding some trouble in there. I just don't know what that trouble is.

jbreiden · 2016-06-10T20:40:40Z

Here's another example to play with. Recommend taking a look at the calculation that handles Tz especially when it is doing a horizontal squeeze (e.g. anything less than 100).

align.pdf

bryanph · 2018-09-06T20:01:21Z

Any progress on this?

jbreiden · 2018-09-06T21:57:07Z

I just tested against Firefox 52.8.1. Looks even worse, especially copy-paste performance. Would love to discuss with a pdf.js programmer.

timvandermeij · 2021-02-13T14:51:27Z

Closing since most of these files are fixed now, and this issue contains lots of different files with different problems, making it less actionable. The Tesseract problem is still tracked in another issue. If there are remaining issues, please open a new issue per problem.

timvandermeij added the text-selection label Jan 13, 2016

rossj mentioned this issue Feb 12, 2016

Replace pdf.ttf with sharp2.ttf, keep name the same tesseract-ocr/tesseract#220

Merged

jbreiden mentioned this issue Jun 14, 2016

text highlighting quirk on PDF files produced by Tesseract #6509

Open

jbreiden mentioned this issue Jul 19, 2016

Glyphless font in pdf leads to spaces between characters tesseract-ocr/tesseract#373

Closed

Rob--W mentioned this issue Feb 21, 2018

ocr-ed pdfs from tesseract not searchable. #9096

Closed

jbreiden mentioned this issue Mar 2, 2018

Add interword space option to HOCR pdf renderer ocrmypdf/OCRmyPDF#225

Closed

jbreiden3 mentioned this issue Feb 10, 2020

Invisible glyph bounds at wrong positions in PDF tesseract-ocr/tesseract#2879

Closed

timvandermeij closed this as completed Feb 13, 2021

yeus mentioned this issue Jan 27, 2022

Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files pdfminer/pdfminer.six#618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863

Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863

rossj commented Jan 13, 2016

timvandermeij commented Jan 13, 2016

jbreiden commented Feb 13, 2016

rlucha commented Jun 6, 2016

rossj commented Jun 7, 2016

jbreiden commented Jun 8, 2016 •

edited

Loading

jbreiden commented Jun 8, 2016 •

edited

Loading

jbreiden commented Jun 10, 2016 •

edited

Loading

bryanph commented Sep 6, 2018

jbreiden commented Sep 6, 2018

timvandermeij commented Feb 13, 2021

Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863

Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863

Comments

rossj commented Jan 13, 2016

timvandermeij commented Jan 13, 2016

jbreiden commented Feb 13, 2016

rlucha commented Jun 6, 2016

rossj commented Jun 7, 2016

jbreiden commented Jun 8, 2016 • edited Loading

jbreiden commented Jun 8, 2016 • edited Loading

jbreiden commented Jun 10, 2016 • edited Loading

bryanph commented Sep 6, 2018

jbreiden commented Sep 6, 2018

timvandermeij commented Feb 13, 2021

jbreiden commented Jun 8, 2016 •

edited

Loading

jbreiden commented Jun 8, 2016 •

edited

Loading

jbreiden commented Jun 10, 2016 •

edited

Loading