Some programs can't find OCR text in Tesseract's PDFs (3.04) #182

jbarlow83 · 2016-01-04T20:42:24Z

While Acrobat XI can find text in a PDF, it appears that poppler's pdftotext program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.

pdftotext produces empty output.
Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched.
PyPDF2 extractText also produces an empty string as text.

The text was updated successfully, but these errors were encountered:

amitdo · 2016-01-04T21:13:16Z

See #170.

jbarlow83 · 2016-01-04T21:25:41Z

#170 might be related, but the files I checked did not have tilted or skewed text.

Input file:

Output:
linn.pdf

tesseract version
tesseract 3.04.00 leptonica-1.72 libjpeg 8d : libpng 1.6.19 : libtiff 4.0.6 : zlib 1.2.5

amitdo · 2016-01-04T22:34:29Z

Chromium's pdf reader output (cut&paste):

182-chromium.txt

amitdo · 2016-01-04T23:06:43Z

I run Tesseract (latest commit from the repo) with your jpg image.

tesseract i182.jpg i182 -l eng txt pdf hocr

Evince's output (cut&paste):

182-evince.txt

Evince is based on Poppler.

amitdo · 2016-01-04T23:33:14Z

Here are the output files...

i182.pdf
i182.txt
i182-hocr.zip

jbarlow83 · 2016-01-04T23:34:46Z

Chrome's PDF reader works for me.

I have poppler 0.39.0 installed (homebrew/OS X/El Capitan).

I believe I found the reason. It appears that the readers that struggle with it do not support Tesseract's usage of hexadecimal code points rather than literal characters in the output stream.

The PostScript content stream for this page as generated by Tesseract for the first word, "The" appears as follows:

 Tz [ <0054><0068><0065> ] TJ

where <0054> = U+0054 = T, <0068> = U+0068 = h, etc. I have run into other situations where this hexadecimal notation causes parsing difficulties for some PDF readers.

Acrobat generates the equivalent segment with ASCII literals.

[...omitted...] Tm
(The )Tj

Longer excerpts for comparison:

Tesseract

BT    
3 Tr 1 0 0 1 211.68 744 Tm /f-0-0 21 Tf 117.334 Tz [ <0054><0068><0065> ] TJ

Acrobat

BT
0.196 0.184 0.188 rg
/T1_0 1 Tf
-0.035 Tc 3 Tr 23.4905 0 0 23.7001 211.43 744.24 Tm
(The )Tj
ET

amitdo · 2016-01-04T23:45:33Z

Did you see my 2 last comments?
The latest commit from the repo produces better pdf results than version 3.04.

jbarlow83 · 2016-01-05T00:00:43Z

Yes. Preview and poppler are still incapable of reading your i182.pdf. I observed no difference.

My comparison didn't address how Acrobat handles Unicode and Unicode literals cannot appear in Postscript so I checked how this is done. When Acrobat encodes a Unicode string it uses UTF-16 big endian code points in hexadecimal, like this:

... Tm
<4E8B5F97771F5BF9770B89C152A066F4591A5C11>Tj

That string encodes 10 characters all below U+7FFF, which are these:
事得真对看见加更多少

So it appears that Tesseract's method of encoding text strings is nonstandard. I checked the PDF 1.7 reference manual, and couldn't find an example matching Tesseract's output syntax.

amitdo · 2016-01-05T00:16:01Z

My libpopler version is 0.24.5. Ubuntu 14.04.

pdftotext i182.pdf i182t.txt

Here is the pdftotext output:
i182t.txt

amitdo · 2016-01-05T00:28:23Z

cc: @jbreiden
jbreiden wrote Tesseract's pdf renderer code.

jbarlow83 · 2016-01-05T00:41:36Z

Okay, for some reason pdftotext will not output to stdout but will produce a valid text file for the files we've been working on. My quick guess is that pdftotext suppresses its stdout if high ASCII characters are present, which tesseract finds here (some n-dashes and smart quotes). Both poppler 0.24.5 and 0.34 behave as expected when asked to save to a file, so the text stream is accessible to pdftotext. In short, poppler is working fine for me.

That said, OS X Preview and parsers like PyPDF2 still struggle with how Tesseract encodes text, as far as I can tell.

I checked that reportlab also encodes text strings in the manner of Acrobat, and Preview has no problems with PDFs produced by Tesseract -> hOCR -> reportlab PDF. This is an example of such a file:

linn_hocr_unc.pdf

amitdo · 2016-01-05T01:28:21Z

From https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/XllxjvK5HtU

Jeff Breidenbach 7/17/15
PROBLEM #2: PDF
I was looking at a PDF problem report and noticed that Tesseract PDF output
is no longer validating. (It fails qpdf --check). As the author of the pdf module,
I'm biased, but producing corrupt data is a disaster and I think we need to cut
a new release once it is figured out. Most PDF viewers will recover and silently
ignore, but this is no good at all. I wonder what happened.

amitdo · 2016-01-05T01:55:20Z

Try this to output to stdout:

pdftotext i182.pdf -

Jeff mentioned qpdf.
Links:
http://qpdf.sourceforge.net
https://github.com/qpdf/qpdf

jbarlow83 · 2016-01-05T02:38:51Z

Qpdf says it okay, but it doesn't check everything.
On Mon, Jan 4, 2016 at 17:55 Amit Dovev [email protected] wrote:

Try this to output to stdout:

pdftotext i182.pdf -

Jeff mentioned qpdf.
Links:
http://qpdf.sourceforge.net
https://github.com/qpdf/qpdf

—
Reply to this email directly or view it on GitHub
#182 (comment)
.

jbreiden · 2016-01-05T03:42:52Z

I might not have time to take a look until Wednesday. Validators of various flavors include jhove, jhove-pdf-a, pdfbox, ITextRUPS, and http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx. (Note that Tesseract PDF are not expected to be PDF/A compliant). I did compatibility testing with Apple's Preview at design time, but I don't test against it regularly. Never tried PyPDF2. If I had to guess right now, I'd suspect it might be the invisible font improvement that was written for better ghostscript compatibility. Unlikely to be the hex encoding.

https://code.google.com/p/tesseract-ocr/issues/detail?id=1434
http://bugs.ghostscript.com/show_bug.cgi?id=695869

jbreiden · 2016-01-05T04:52:50Z

Looking at issue 181, it's looking more and more like Preview is unhappy with the revised glyphless font, possible due to the zero advance width. Will try to borrow a Mac and play with it, hopefully on Wednesday.

jbarlow83 · 2016-01-05T08:12:01Z

@jbreiden I agree that the glyphless font issue seems more probable.

Aside: I wouldn't trust JHOVE for PDF validation. For JHOVE to approve is better than not approving, but its analysis is rudimentary, and in my experience it produce more false positives and negatives than useful diagnostics.

jbreiden · 2016-01-06T01:18:43Z

I produced this PDF using Tesseract, then borrowed a laptop running Mac OS X version 10.10.5 and was able to both search and copy-paste from Preview (Although the copy-paste highlighting was kind of weird). My testing copy of Tesseract is not completely synchronized with GitHub, so if needed we can investigate that. How does this PDF perform for you on Preview, @jbarlow83 ?

2.pdf

There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince. I don't notice any compatibility differences at all, but mentioning in case someone wants to play with it. Have not checked compatibility with Ghostscript.

https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true

Finally, this was my test image (I was actually using TIFF but GitHub doesn't let me attach that)

jbarlow83 · 2016-01-06T01:29:19Z

Doesn't work in Preview OS X 10.11.2 (highlights properly, but no copy-paste or search). I have access to two other OS X machines - will check those later day.

I check with my iPhone too. Both Chrome iOS (PDFium?) via Gmail app and Safari struggle to highlight text (they only allow highlighting a single character) and cannot copy.

jbreiden · 2016-01-06T04:03:40Z

This one uses the alternate font that has an advance width.

alternate.pdf

jbarlow83 · 2016-01-06T06:47:48Z

alternate works on OS X Preview and my iPhone.

I did notice that spaces are sometimes missing in OS X's copy and paste text, while pdftotext shows the spaces, so perhaps it's not 100% but clearly this was the main issue.

components of the relative motions of the fixed , stars with respect to the earth on the colour of thelightreachingusfromthem. Thelattereffect manifests itself in a slight displacement of the spectral lines of the light transmitted to us from

a fixed star, as compared with the position of the same spectral lines when they are produced by a terrestrial source of light (Doppler principle). The experimental arguments in favour of the Maxwell-Lorentz theory, which are at the;same time arguments in favour of the theory of rela- tivity,aretoonumeroustobesetforthhere. In reality they limit the theoretical possibilities to such an extent, that no other theory than that of Maxwell and Lorentz has been able to hold its ownwhentestedbyexperience.

But there are two classes of experimental facts hitherto obtained which can be represented in the Maxwell-Lorentz theory only by the introduction of an auxiliary hypothesis, which in itself—i.e. without making use of the theory of relativity— appears extraneous.

Itisknownthatcathoderaysandtheso-called B—rays emitted by radioactive substances consist of negatively electrified particles (electrons) of verysmallinertiaandlargevelocity. By examin- ing the deflection of these rays under the influence of electric and magnetic fields, we can study the

law of motion of these particles very exactly.

behdad · 2016-01-06T10:37:32Z

Output:
linn.pdf

For me, pdftotext outputs no text, but Evince, which also uses Poppler, correctly selects and extracts text.

amitdo · 2016-01-06T11:43:54Z

@behdad, try this:

pdftotext linn.pdf -

behdad · 2016-01-06T11:55:46Z

@behdad, try this:
pdftotext linn.pdf -

Hah. My bad. Thanks :)

jbreiden · 2016-01-06T17:48:53Z

I got my hands on an iPad running iOS 9.2 and reproduced the problem. On iOS/Safari I cannot search 2.pdf (Ken Sharp's font) but can search with alternate.pdf (Behdad's font). Took me quite a while to figure out how how to make the search controls work.

So for your immediate problem, go ahead and substitute in Behdad's font into tessdata/pdf.ttf and you should be okay. We won't do that officially without a whole bunch more compatibility testing and reports, including the harder languages (Cherokee, vertical Japanese, Arabic) and additional renderers including Ghostscript and Firefox. Compatibility reports are appreciated.

https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true

Regarding the words running together on the Apple PDF renderer, that's not new. Apple PDF seems to do a worse job than everyone else at deciding word boundaries, and I've seen them screw up plenty of regular born-digital PDF files in the same way. Of course the root cause is the PDF spec itself, which does not explicitly define the concept of a word boundary. So I can't help you, but at least it isn't a regression. It's possible that Apple will get their act together a little better on this some day, but I have no reason to believe that it is on their radar.

amitdo · 2016-01-08T09:32:38Z

There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince.

It looks terrible :(

behdad · 2016-01-08T11:28:19Z

My font has a huge advance width, because it was designed for another purpose. Someone should create one with an advance width of 1024 instead of my 20480.

jbreiden · 2016-01-08T19:23:36Z

The PDF is keeping the advance width under control for Behdad's font. We're probably seeing something else. It's kind of cute zebra pattern. You get a black underline, and black boxes in all word gaps and in some letter gaps. (Obviously evince is doing a really bad job, but this is much worse than with Ken Sharp's font, which highlights as a solid black bar.) A little hard for me to investigate, since my copy of ttx is not cooperating.

P.S. The font advance width should probably be 512 to match what we specify in the PDF. But again, I don't expect that to change anything for evince.

amitdo · 2016-01-08T20:38:26Z

If you search for a phrase in evince, the highlighting looks more normal.
Strange!

iikka-v · 2016-02-01T19:22:04Z

Yes, utterly. I ran the tofu.ttf and the old pdf.ttf through Apples font validator. Both produced errors, but tofu.ttf only one, whereas the old pdf.ttf had additional "name table usability" errors. Please post the above font files (or diffs) and I'll run them through the validator as well. Perhaps this will give some insight to the issue.

jbreiden · 2016-02-01T20:09:45Z

Fonts as per request. I do not know if my modification tool (ttx) corrupts anything along the way. So far the experiments suggest that Apple software requires a contour, and a contour cosmetically messes with evince.

pdf.ttf - currently shipping font, by Ken Sharp
sharp.ttf - with advance width added

tofu.ttf - alternate font from behdad
behdad.ttf - with advance width reduced
behdad2.ttf - with contour removed

fonts.zip

iikka-v · 2016-02-01T20:18:12Z

Thanks. Here's the verbose error report as given by Apples ftxvalidator (there's not really a version for 10.11, so some of this might be inaccurate). All report fatal errors and most errors are beyond my (admittedly limited) expertise on the subject. I hope they make more sense to you.

Uploading ftxvalidator_report.txt…

jbreiden · 2016-02-01T20:19:41Z

Can you please edit that report and make it an attachment or something? The giant wall of text makes this bug harder to read.

behdad · 2016-02-02T03:53:46Z

Partially blocked by fonttools/fonttools#497

Fixed now.

jbreiden · 2016-02-05T21:48:07Z

For completeness, here is Ken Sharp's font with a contour added in.

FONT
sharp2.zip

PDF
sharp2.pdf

At this point, sharp2.ttf and behdad.ttf are the only fonts compatible with Apple Preview. They both come at the cost highlight aesthetics with evince. I think Preview is incorrect to require a contour for the glyph, and I think evince is incorrect to consider a contour when highlighting an invisible font. I do not have any reason so far to prefer one over the other, and I do not yet have compatibility test results from ghostscript, firefox, Microsoft Edge, etc.

jbreiden · 2016-02-05T22:51:06Z

I have filed a bug with Apple. This is not publicly visible and I do not know what the response will be. Noting it here simply simply for future reference. radr://24533090

jbreiden · 2016-02-09T20:13:14Z

In progress testing compatibility with candidates "sharp2" and "behdad" including getting some assistance with ghostscript. So far no user visible differences between them, and the former is the smaller change. Is there general consensus to work around the Apple compatibility problem, at the expense of Evince highlight aesthetics?

jbarlow83 · 2016-02-10T09:52:50Z

@jbreiden I agree. OS X Preview is installed on ~10% of all desktop computers. Evince is just one of many PDF viewers for Linux users.

bekirserifoglu · 2016-02-10T11:58:13Z

@jbarlow83 and @jbreiden This bug also affects the Amazon Kindles. As an avid user of Amazon Kindle and Tesseract, I feel crippled now. And don't forget that all those pdfs generated with Tesseract won't work with Kindle either around the world.

jbreiden · 2016-02-10T18:30:48Z

@bekirserifoglu - can you please confirm that both proposed workarounds found in previous comments (sharp2.pdf, behdad.pdf) solve the problem on Kindle?

bekirserifoglu · 2016-02-10T19:05:27Z

@jbreiden I can confirm that both sharp and tofu fonts work great with Kindle Voyage and Preview on Os X. Feel free to mention me if you need anymore testing.

jbreiden · 2016-02-10T20:26:53Z

@bekirserifoglu - Is the failure case on Kindle broken search and broken copy-paste? Or is it even worse than that?

bekirserifoglu · 2016-02-10T20:29:11Z

@jbreiden Kindle just treats the pdf as a non-ocr'ed pdf. It is worse than OS X preview.

jbreiden · 2016-02-11T23:30:23Z

@theraysmith

Okay, I've decided. We're going to use the sharp2 font.

For various embarassing reasons, I'd appreciate some help. Could someone please download this zip file, extract sharp2.ttf, and use it to replace pdf.ttf in the repository. The resulting file should still be called pdf.ttf. I apologize for not doing this myself and promise to get my act together with respect to GitHub pull requests in the future.

sharp2.zip

https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf

As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.

jbarlow83 · 2016-02-11T23:48:53Z

Done. PR #220.

As discussed at length in issue #182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.

iikka-v · 2016-02-13T08:54:32Z

Thank you for fixing this.

amitdo · 2016-02-13T09:08:23Z

This was a short discussion... :-)

As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.

jbarlow83 mentioned this issue Jan 4, 2016

OCRmyPDF fails to detect text on pages created by Tesseract 3.04 ocrmypdf/OCRmyPDF#26

Closed

jbarlow83 mentioned this issue Feb 11, 2016

Replace pdf.ttf with sharp2.ttf, keep name the same #220

Merged

zdenop closed this as completed Feb 12, 2016

zuphilip mentioned this issue May 17, 2016

Check hocr-pdf for possible update ocropus/hocr-tools#7

Open

amitdo added bug PDF labels May 27, 2016

amitdo mentioned this issue Jul 21, 2018

White glyphs when selecting ocr-text in Evince ocrmypdf/OCRmyPDF#249

Closed

Some programs can't find OCR text in Tesseract's PDFs (3.04) #182

Some programs can't find OCR text in Tesseract's PDFs (3.04) #182

Comments

jbarlow83 commented Jan 4, 2016

amitdo commented Jan 4, 2016

jbarlow83 commented Jan 4, 2016

amitdo commented Jan 4, 2016

amitdo commented Jan 4, 2016

amitdo commented Jan 4, 2016

jbarlow83 commented Jan 4, 2016

amitdo commented Jan 4, 2016

jbarlow83 commented Jan 5, 2016

amitdo commented Jan 5, 2016

amitdo commented Jan 5, 2016

jbarlow83 commented Jan 5, 2016

amitdo commented Jan 5, 2016

amitdo commented Jan 5, 2016

jbarlow83 commented Jan 5, 2016

jbreiden commented Jan 5, 2016

jbreiden commented Jan 5, 2016

jbarlow83 commented Jan 5, 2016

jbreiden commented Jan 6, 2016

jbarlow83 commented Jan 6, 2016

jbreiden commented Jan 6, 2016

jbarlow83 commented Jan 6, 2016

behdad commented Jan 6, 2016

amitdo commented Jan 6, 2016

behdad commented Jan 6, 2016

jbreiden commented Jan 6, 2016

amitdo commented Jan 8, 2016

behdad commented Jan 8, 2016

jbreiden commented Jan 8, 2016

amitdo commented Jan 8, 2016

iikka-v commented Feb 1, 2016

jbreiden commented Feb 1, 2016

iikka-v commented Feb 1, 2016

jbreiden commented Feb 1, 2016

behdad commented Feb 2, 2016

jbreiden commented Feb 5, 2016

jbreiden commented Feb 5, 2016

jbreiden commented Feb 9, 2016

jbarlow83 commented Feb 10, 2016

bekirserifoglu commented Feb 10, 2016

jbreiden commented Feb 10, 2016

bekirserifoglu commented Feb 10, 2016

jbreiden commented Feb 10, 2016

bekirserifoglu commented Feb 10, 2016

jbreiden commented Feb 11, 2016

jbarlow83 commented Feb 11, 2016

iikka-v commented Feb 13, 2016

amitdo commented Feb 13, 2016