-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text highlighting quirk on PDF files produced by Tesseract #6509
Comments
There seems to be an issue with how the top value for the text over lay is calculated, but nothing seem obviously wrong to me. The relevant code is at https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164-L180 |
The relevant PDF objects and the embedded glyphless font say that the Hebrew and English word should highlight identically. Suggest tracing what is causing the difference. |
Here you can see that we are placing the Hebrew and English words on the exact same y position.
Here you can see that the PDF claims there are no descenders.
And most importantly, we are mapping every single character to the same invisible, empty glyph. Here are links to the code/documentation and to the custom designed font. https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp My best guess (without reading pdf.js code) is that the dimension information in the font itself and the relevant PDF objects are being ignored. Instead, a heuristic looks at the Unicode mapping of the characters and says "This might be English, English has descenders, move stuff around!" If that is the case, I'd really like to know what I can do to avoid triggering such heuristics. |
I missed that they should be on the same line before. It appears the issue is with how we calculate the angle for the text. For the hebrew word there is a negative x scale component which seems to be causing issues on our side. Looking into how this should be working....
|
The -1 is just means that I am placing characters from right-to-left. (Because Hebrew is an right-to-left language). This is not terribly common practice, but makes sense especially when working with an invisible glyphless font. Please note that I'm claiming the problem is with the English word. The highlight region extends way below the baseline, and it should not be doing that. This problem is 100% reproducible, and occurs for every document produced by Tesseract including pure-English. |
I see this amazing image after a successful copy-paste operation. There is a ghostlike, white-on-gray symbolic text overlayed on the image. I have no idea what it means, or where the font is coming from. It certainly is not the font embedded in the PDF, because that one is glyphless. The English word is too low, and the Hebrew word has each character rotated 180 degrees. Maybe this provides a clue. |
To enable text selection we create an invisible dom overlay, so that is what you're seeing. The overlay doesn't use the the embedded font. PDF.js tries to line it up the text layer with the underlying canvas, but as we see above this doesn't always work correctly. |
Just in case it helps, this is a dump of the font embedded in the PDF, using ttx. <?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="\x00\x01\x00\x00" ttLibVersion="2.5">
<GlyphOrder>
<!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
<GlyphID id="0" name=".notdef"/>
<GlyphID id="1" name=".null"/>
</GlyphOrder>
<head>
<!-- Most of this table will be recalculated by the compiler -->
<tableVersion value="1.0"/>
<fontRevision value="1.0"/>
<checkSumAdjustment value="0xa737b34c"/>
<magicNumber value="0x5f0f3cf5"/>
<flags value="00000100 00000111"/>
<unitsPerEm value="256"/>
<created value="Thu May 15 23:21:18 2014"/>
<modified value="Thu May 15 23:21:18 2014"/>
<xMin value="0"/>
<yMin value="-32768"/>
<xMax value="0"/>
<yMax value="1"/>
<macStyle value="00000000 00000000"/>
<lowestRecPPEM value="16"/>
<fontDirectionHint value="2"/>
<indexToLocFormat value="0"/>
<glyphDataFormat value="0"/>
</head>
<hhea>
<tableVersion value="1.0"/>
<ascent value="1"/>
<descent value="-1"/>
<lineGap value="0"/>
<advanceWidthMax value="0"/>
<minLeftSideBearing value="0"/>
<minRightSideBearing value="0"/>
<xMaxExtent value="0"/>
<caretSlopeRise value="1"/>
<caretSlopeRun value="0"/>
<caretOffset value="0"/>
<reserved0 value="0"/>
<reserved1 value="0"/>
<reserved2 value="0"/>
<reserved3 value="0"/>
<metricDataFormat value="0"/>
<numberOfHMetrics value="2"/>
</hhea>
<maxp>
<!-- Most of this table will be recalculated by the compiler -->
<tableVersion value="0x10000"/>
<numGlyphs value="2"/>
<maxPoints value="0"/>
<maxContours value="0"/>
<maxCompositePoints value="0"/>
<maxCompositeContours value="0"/>
<maxZones value="1"/>
<maxTwilightPoints value="0"/>
<maxStorage value="0"/>
<maxFunctionDefs value="0"/>
<maxInstructionDefs value="0"/>
<maxStackElements value="0"/>
<maxSizeOfInstructions value="0"/>
<maxComponentElements value="0"/>
<maxComponentDepth value="0"/>
</maxp>
<OS_2>
<!-- The fields 'usFirstCharIndex' and 'usLastCharIndex'
will be recalculated by the compiler -->
<version value="3"/>
<xAvgCharWidth value="0"/>
<usWeightClass value="400"/>
<usWidthClass value="5"/>
<fsType value="00000000 00000000"/>
<ySubscriptXSize value="0"/>
<ySubscriptYSize value="0"/>
<ySubscriptXOffset value="0"/>
<ySubscriptYOffset value="0"/>
<ySuperscriptXSize value="0"/>
<ySuperscriptYSize value="0"/>
<ySuperscriptXOffset value="0"/>
<ySuperscriptYOffset value="0"/>
<yStrikeoutSize value="0"/>
<yStrikeoutPosition value="0"/>
<sFamilyClass value="0"/>
<panose>
<bFamilyType value="5"/>
<bSerifStyle value="0"/>
<bWeight value="1"/>
<bProportion value="0"/>
<bContrast value="1"/>
<bStrokeVariation value="0"/>
<bArmStyle value="0"/>
<bLetterForm value="0"/>
<bMidline value="0"/>
<bXHeight value="0"/>
</panose>
<ulUnicodeRange1 value="00000000 00000000 00000000 00000000"/>
<ulUnicodeRange2 value="00000000 00000000 00000000 00000000"/>
<ulUnicodeRange3 value="00000000 00000000 00000000 00000000"/>
<ulUnicodeRange4 value="00000000 00000000 00000000 00000000"/>
<achVendID value="GOOG"/>
<fsSelection value="00000000 01000000"/>
<usFirstCharIndex value="65535"/>
<usLastCharIndex value="0"/>
<sTypoAscender value="1"/>
<sTypoDescender value="-1"/>
<sTypoLineGap value="0"/>
<usWinAscent value="1"/>
<usWinDescent value="1"/>
<ulCodePageRange1 value="10000000 00000000 00000000 00000000"/>
<ulCodePageRange2 value="00000000 00000000 00000000 00000000"/>
<sxHeight value="0"/>
<sCapHeight value="0"/>
<usDefaultChar value="0"/>
<usBreakChar value="1"/>
<usMaxContext value="0"/>
</OS_2>
<hmtx>
<mtx name=".notdef" width="0" lsb="0"/>
<mtx name=".null" width="0" lsb="0"/>
</hmtx>
<cmap>
<tableVersion version="0"/>
<cmap_format_6 platformID="1" platEncID="0" language="0">
<map code="0x0" name=".notdef"/>
</cmap_format_6>
<cmap_format_6 platformID="3" platEncID="0" language="0">
<map code="0x0" name=".notdef"/><!-- ???? -->
</cmap_format_6>
</cmap>
<loca>
<!-- The 'loca' table will be calculated by the compiler -->
</loca>
<glyf>
<!-- The xMin, yMin, xMax and yMax values
will be recalculated by the compiler. -->
<TTGlyph name=".notdef"/><!-- contains no outline data -->
<TTGlyph name=".null"/><!-- contains no outline data -->
</glyf>
<name>
<namerecord nameID="5" platformID="0" platEncID="3" langID="0x0">
Version 1.0
</namerecord>
<namerecord nameID="5" platformID="1" platEncID="0" langID="0x0" unicode="True">
Version 1.0
</namerecord>
<namerecord nameID="5" platformID="3" platEncID="1" langID="0x409">
Version 1.0
</namerecord>
</name>
<post>
<formatType value="1.0"/>
<italicAngle value="0.0"/>
<underlinePosition value="0"/>
<underlineThickness value="0"/>
<isFixedPitch value="1"/>
<minMemType42 value="0"/>
<maxMemType42 value="0"/>
<minMemType1 value="0"/>
<maxMemType1 value="0"/>
</post>
</ttFont> |
See https://github.com/mozilla/pdf.js/wiki/Debugging-PDF.js how to enable debugging tools. PDF.js will use browser's font to render text layer and the text layer on Mac OSX looks differently, probably due metrics of the browser's fonts. The font you posted above is somewhat unrelated one, however metrics in it does not match metrics in PDFs (check http://brendandahl.github.io/pdf.js.utils/browser/). Checking the angle value at https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L174, looks like it is reporting unexpected -π value for [-1,0,0,1] transform -- I think you would expect 0 there, that causes ascender value be used during top coordinate calculation. |
FYI, millions of digitized books are affected. |
We have the same problem with our OCR'ed pdfs with tesseract. Is there any plan so fix this in the future? |
Duplicate of #6863 |
This changed after PR #12896 in the sense that the |
Programs like Tesseract are used to OCR documents. Basically, we take
a photographic image, recognize any symbolic text, and then compose a PDF
consisting of the photograph and an invisible symbolic text layer for copy-paste
and search.
I am the author of the relevant pdf generation code, and similar code in other
programs. We get very good results in many PDF renderers including pdfium
and poppler, but get misaligned highlighting from pdf.js in Firefox.
GitHub is refusing to let me post a simple example PDF here, so I am
providing a URL instead of attachment. This is a very simple example from
our test suite. I have 100% control over the PDF generation code and
understand everything about it, so if there is any complaint about it let me
know and we'll work it out.
http://leptonica.org/jbreiden/simple-1.pdf
Build identifier: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0
The text was updated successfully, but these errors were encountered: