text highlighting quirk on PDF files produced by Tesseract #6509

jbreiden · 2015-10-06T21:21:11Z

Programs like Tesseract are used to OCR documents. Basically, we take
a photographic image, recognize any symbolic text, and then compose a PDF
consisting of the photograph and an invisible symbolic text layer for copy-paste
and search.

I am the author of the relevant pdf generation code, and similar code in other
programs. We get very good results in many PDF renderers including pdfium
and poppler, but get misaligned highlighting from pdf.js in Firefox.

GitHub is refusing to let me post a simple example PDF here, so I am
providing a URL instead of attachment. This is a very simple example from
our test suite. I have 100% control over the PDF generation code and
understand everything about it, so if there is any complaint about it let me
know and we'll work it out.

http://leptonica.org/jbreiden/simple-1.pdf

Build identifier: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0

brendandahl · 2015-10-06T23:44:39Z

There seems to be an issue with how the top value for the text over lay is calculated, but nothing seem obviously wrong to me. The relevant code is at https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164-L180

jbreiden · 2015-10-07T03:13:07Z

The relevant PDF objects and the embedded glyphless font say that the Hebrew and English word should highlight identically. Suggest tracing what is causing the difference.

jbreiden · 2015-10-07T20:59:56Z

Here you can see that we are placing the Hebrew and English words on the exact same y position.

5 0 obj
<< /Length 197 >>
stream
q 132.686 0 0 47.314 0 0 cm /Im1 Do Q
BT
3 Tr 1 0 0 1 16.457 19.229 Tm /f-0-0 26 Tf 97.582 Tz [ <0061><006C><006F> ] TJ -1 0 0 1 122.4 19.229 Tm 90.212 Tz [ <05D1><05D0><05D7><05E8><200E> ] TJ 
ET
endstream
endobj

Here you can see that the PDF claims there are no descenders.

11 0 obj
<< /Ascent 500 /CapHeight 500 /Descent -1 /Flags 5 /FontBBox [ 0 0 500 500 ] /FontFile2 12 0 R /FontName /GlyphLessFont /ItalicAngle 0 /StemV 80 /Type /FontDescriptor >>
endobj

And most importantly, we are mapping every single character to the same invisible, empty glyph. Here are links to the code/documentation and to the custom designed font.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf

My best guess (without reading pdf.js code) is that the dimension information in the font itself and the relevant PDF objects are being ignored. Instead, a heuristic looks at the Unicode mapping of the characters and says "This might be English, English has descenders, move stuff around!" If that is the case, I'd really like to know what I can do to avoid triggering such heuristics.

brendandahl · 2015-10-08T00:38:28Z

I missed that they should be on the same line before. It appears the issue is with how we calculate the angle for the text. For the hebrew word there is a negative x scale component which seems to be causing issues on our side. Looking into how this should be working....

1 0 0 1 16.457 19.229 Tm
-1 0 0 1 122.4 19.229 Tm

jbreiden · 2015-10-08T05:48:02Z

The -1 is just means that I am placing characters from right-to-left. (Because Hebrew is an right-to-left language). This is not terribly common practice, but makes sense especially when working with an invisible glyphless font.

Please note that I'm claiming the problem is with the English word. The highlight region extends way below the baseline, and it should not be doing that. This problem is 100% reproducible, and occurs for every document produced by Tesseract including pure-English.

jbreiden · 2015-10-08T05:55:21Z

I see this amazing image after a successful copy-paste operation. There is a ghostlike, white-on-gray symbolic text overlayed on the image. I have no idea what it means, or where the font is coming from. It certainly is not the font embedded in the PDF, because that one is glyphless. The English word is too low, and the Hebrew word has each character rotated 180 degrees. Maybe this provides a clue.

brendandahl · 2015-10-08T16:19:48Z

To enable text selection we create an invisible dom overlay, so that is what you're seeing. The overlay doesn't use the the embedded font. PDF.js tries to line it up the text layer with the underlying canvas, but as we see above this doesn't always work correctly.

jbreiden · 2015-10-08T18:47:18Z

Just in case it helps, this is a dump of the font embedded in the PDF, using ttx.

<?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="\x00\x01\x00\x00" ttLibVersion="2.5">

  <GlyphOrder>
    <!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
    <GlyphID id="0" name=".notdef"/>
    <GlyphID id="1" name=".null"/>
  </GlyphOrder>

  <head>
    <!-- Most of this table will be recalculated by the compiler -->
    <tableVersion value="1.0"/>
    <fontRevision value="1.0"/>
    <checkSumAdjustment value="0xa737b34c"/>
    <magicNumber value="0x5f0f3cf5"/>
    <flags value="00000100 00000111"/>
    <unitsPerEm value="256"/>
    <created value="Thu May 15 23:21:18 2014"/>
    <modified value="Thu May 15 23:21:18 2014"/>
    <xMin value="0"/>
    <yMin value="-32768"/>
    <xMax value="0"/>
    <yMax value="1"/>
    <macStyle value="00000000 00000000"/>
    <lowestRecPPEM value="16"/>
    <fontDirectionHint value="2"/>
    <indexToLocFormat value="0"/>
    <glyphDataFormat value="0"/>
  </head>

  <hhea>
    <tableVersion value="1.0"/>
    <ascent value="1"/>
    <descent value="-1"/>
    <lineGap value="0"/>
    <advanceWidthMax value="0"/>
    <minLeftSideBearing value="0"/>
    <minRightSideBearing value="0"/>
    <xMaxExtent value="0"/>
    <caretSlopeRise value="1"/>
    <caretSlopeRun value="0"/>
    <caretOffset value="0"/>
    <reserved0 value="0"/>
    <reserved1 value="0"/>
    <reserved2 value="0"/>
    <reserved3 value="0"/>
    <metricDataFormat value="0"/>
    <numberOfHMetrics value="2"/>
  </hhea>

  <maxp>
    <!-- Most of this table will be recalculated by the compiler -->
    <tableVersion value="0x10000"/>
    <numGlyphs value="2"/>
    <maxPoints value="0"/>
    <maxContours value="0"/>
    <maxCompositePoints value="0"/>
    <maxCompositeContours value="0"/>
    <maxZones value="1"/>
    <maxTwilightPoints value="0"/>
    <maxStorage value="0"/>
    <maxFunctionDefs value="0"/>
    <maxInstructionDefs value="0"/>
    <maxStackElements value="0"/>
    <maxSizeOfInstructions value="0"/>
    <maxComponentElements value="0"/>
    <maxComponentDepth value="0"/>
  </maxp>

  <OS_2>
    <!-- The fields 'usFirstCharIndex' and 'usLastCharIndex'
         will be recalculated by the compiler -->
    <version value="3"/>
    <xAvgCharWidth value="0"/>
    <usWeightClass value="400"/>
    <usWidthClass value="5"/>
    <fsType value="00000000 00000000"/>
    <ySubscriptXSize value="0"/>
    <ySubscriptYSize value="0"/>
    <ySubscriptXOffset value="0"/>
    <ySubscriptYOffset value="0"/>
    <ySuperscriptXSize value="0"/>
    <ySuperscriptYSize value="0"/>
    <ySuperscriptXOffset value="0"/>
    <ySuperscriptYOffset value="0"/>
    <yStrikeoutSize value="0"/>
    <yStrikeoutPosition value="0"/>
    <sFamilyClass value="0"/>
    <panose>
      <bFamilyType value="5"/>
      <bSerifStyle value="0"/>
      <bWeight value="1"/>
      <bProportion value="0"/>
      <bContrast value="1"/>
      <bStrokeVariation value="0"/>
      <bArmStyle value="0"/>
      <bLetterForm value="0"/>
      <bMidline value="0"/>
      <bXHeight value="0"/>
    </panose>
    <ulUnicodeRange1 value="00000000 00000000 00000000 00000000"/>
    <ulUnicodeRange2 value="00000000 00000000 00000000 00000000"/>
    <ulUnicodeRange3 value="00000000 00000000 00000000 00000000"/>
    <ulUnicodeRange4 value="00000000 00000000 00000000 00000000"/>
    <achVendID value="GOOG"/>
    <fsSelection value="00000000 01000000"/>
    <usFirstCharIndex value="65535"/>
    <usLastCharIndex value="0"/>
    <sTypoAscender value="1"/>
    <sTypoDescender value="-1"/>
    <sTypoLineGap value="0"/>
    <usWinAscent value="1"/>
    <usWinDescent value="1"/>
    <ulCodePageRange1 value="10000000 00000000 00000000 00000000"/>
    <ulCodePageRange2 value="00000000 00000000 00000000 00000000"/>
    <sxHeight value="0"/>
    <sCapHeight value="0"/>
    <usDefaultChar value="0"/>
    <usBreakChar value="1"/>
    <usMaxContext value="0"/>
  </OS_2>

  <hmtx>
    <mtx name=".notdef" width="0" lsb="0"/>
    <mtx name=".null" width="0" lsb="0"/>
  </hmtx>

  <cmap>
    <tableVersion version="0"/>
    <cmap_format_6 platformID="1" platEncID="0" language="0">
      <map code="0x0" name=".notdef"/>
    </cmap_format_6>
    <cmap_format_6 platformID="3" platEncID="0" language="0">
      <map code="0x0" name=".notdef"/><!-- ???? -->
    </cmap_format_6>
  </cmap>

  <loca>
    <!-- The 'loca' table will be calculated by the compiler -->
  </loca>

  <glyf>

    <!-- The xMin, yMin, xMax and yMax values
         will be recalculated by the compiler. -->

    <TTGlyph name=".notdef"/><!-- contains no outline data -->

    <TTGlyph name=".null"/><!-- contains no outline data -->

  </glyf>

  <name>
    <namerecord nameID="5" platformID="0" platEncID="3" langID="0x0">
      Version 1.0
    </namerecord>
    <namerecord nameID="5" platformID="1" platEncID="0" langID="0x0" unicode="True">
      Version 1.0
    </namerecord>
    <namerecord nameID="5" platformID="3" platEncID="1" langID="0x409">
      Version 1.0
    </namerecord>
  </name>

  <post>
    <formatType value="1.0"/>
    <italicAngle value="0.0"/>
    <underlinePosition value="0"/>
    <underlineThickness value="0"/>
    <isFixedPitch value="1"/>
    <minMemType42 value="0"/>
    <maxMemType42 value="0"/>
    <minMemType1 value="0"/>
    <maxMemType1 value="0"/>
  </post>

</ttFont>

yurydelendik · 2015-10-09T18:09:16Z

See https://github.com/mozilla/pdf.js/wiki/Debugging-PDF.js how to enable debugging tools. PDF.js will use browser's font to render text layer and the text layer on Mac OSX looks differently, probably due metrics of the browser's fonts.

The font you posted above is somewhat unrelated one, however metrics in it does not match metrics in PDFs (check http://brendandahl.github.io/pdf.js.utils/browser/).

Checking the angle value at https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L174, looks like it is reporting unexpected -π value for [-1,0,0,1] transform -- I think you would expect 0 there, that causes ascender value be used during top coordinate calculation.

jbreiden · 2015-10-09T20:04:05Z

Problem occurs without any Hebrew involved.

1 0 0 1 16 18 Tm /f-0-0 25 Tf 98.666 Tz [ <0061><006C><006F> ] TJ

jbreiden · 2015-10-22T21:47:02Z

FYI, millions of digitized books are affected.

rlucha · 2016-06-06T11:03:34Z

We have the same problem with our OCR'ed pdfs with tesseract. Is there any plan so fix this in the future?

jbreiden · 2016-06-14T00:24:44Z

Duplicate of #6863

timvandermeij · 2021-02-13T15:35:24Z

This changed after PR #12896 in the sense that the alo bit of the original PDF file is now correct, but the Hebrew part is unfortunately not yet correct.

timvandermeij added the text-selection label Oct 6, 2015

timvandermeij mentioned this issue Jan 13, 2016

Misaligned text selection due to overriding font metrics from dummy invisible font file (used when OCRing) #6863

Closed

jbreiden mentioned this issue Feb 13, 2016

Replace pdf.ttf with sharp2.ttf, keep name the same tesseract-ocr/tesseract#220

Merged

Rob--W mentioned this issue Feb 21, 2018

ocr-ed pdfs from tesseract not searchable. #9096

Closed

jbreiden mentioned this issue Mar 2, 2018

Add interword space option to HOCR pdf renderer ocrmypdf/OCRmyPDF#225

Closed

jbreiden3 mentioned this issue Feb 10, 2020

Invisible glyph bounds at wrong positions in PDF tesseract-ocr/tesseract#2879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text highlighting quirk on PDF files produced by Tesseract #6509

text highlighting quirk on PDF files produced by Tesseract #6509

jbreiden commented Oct 6, 2015

brendandahl commented Oct 6, 2015

jbreiden commented Oct 7, 2015

jbreiden commented Oct 7, 2015

brendandahl commented Oct 8, 2015

jbreiden commented Oct 8, 2015

jbreiden commented Oct 8, 2015

brendandahl commented Oct 8, 2015

jbreiden commented Oct 8, 2015

yurydelendik commented Oct 9, 2015

jbreiden commented Oct 9, 2015

jbreiden commented Oct 22, 2015

rlucha commented Jun 6, 2016 •

edited

Loading

jbreiden commented Jun 14, 2016

timvandermeij commented Feb 13, 2021

text highlighting quirk on PDF files produced by Tesseract #6509

text highlighting quirk on PDF files produced by Tesseract #6509

Comments

jbreiden commented Oct 6, 2015

brendandahl commented Oct 6, 2015

jbreiden commented Oct 7, 2015

jbreiden commented Oct 7, 2015

brendandahl commented Oct 8, 2015

jbreiden commented Oct 8, 2015

jbreiden commented Oct 8, 2015

brendandahl commented Oct 8, 2015

jbreiden commented Oct 8, 2015

yurydelendik commented Oct 9, 2015

jbreiden commented Oct 9, 2015

jbreiden commented Oct 22, 2015

rlucha commented Jun 6, 2016 • edited Loading

jbreiden commented Jun 14, 2016

timvandermeij commented Feb 13, 2021

rlucha commented Jun 6, 2016 •

edited

Loading