Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some programs can't find OCR text in Tesseract's PDFs (3.04) #182

Closed
jbarlow83 opened this issue Jan 4, 2016 · 56 comments
Closed

Some programs can't find OCR text in Tesseract's PDFs (3.04) #182

jbarlow83 opened this issue Jan 4, 2016 · 56 comments

Comments

@jbarlow83
Copy link

While Acrobat XI can find text in a PDF, it appears that poppler's pdftotext program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.

pdftotext produces empty output.
Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched.
PyPDF2 extractText also produces an empty string as text.

@amitdo
Copy link
Collaborator

amitdo commented Jan 4, 2016

See #170.

@jbarlow83
Copy link
Author

#170 might be related, but the files I checked did not have tilted or skewed text.

Input file:
linnsequencer

Output:
linn.pdf

tesseract version
tesseract 3.04.00 leptonica-1.72 libjpeg 8d : libpng 1.6.19 : libtiff 4.0.6 : zlib 1.2.5

@amitdo
Copy link
Collaborator

amitdo commented Jan 4, 2016

Chromium's pdf reader output (cut&paste):

182-chromium.txt

@amitdo
Copy link
Collaborator

amitdo commented Jan 4, 2016

I run Tesseract (latest commit from the repo) with your jpg image.

tesseract i182.jpg i182 -l eng txt pdf hocr

Evince's output (cut&paste):

182-evince.txt

Evince is based on Poppler.

@amitdo
Copy link
Collaborator

amitdo commented Jan 4, 2016

Here are the output files...

i182.pdf
i182.txt
i182-hocr.zip

@jbarlow83
Copy link
Author

Chrome's PDF reader works for me.

I have poppler 0.39.0 installed (homebrew/OS X/El Capitan).

I believe I found the reason. It appears that the readers that struggle with it do not support Tesseract's usage of hexadecimal code points rather than literal characters in the output stream.

The PostScript content stream for this page as generated by Tesseract for the first word, "The" appears as follows:

 Tz [ <0054><0068><0065> ] TJ  

where <0054> = U+0054 = T, <0068> = U+0068 = h, etc. I have run into other situations where this hexadecimal notation causes parsing difficulties for some PDF readers.

Acrobat generates the equivalent segment with ASCII literals.

[...omitted...] Tm
(The )Tj

Longer excerpts for comparison:

Tesseract

BT    
3 Tr 1 0 0 1 211.68 744 Tm /f-0-0 21 Tf 117.334 Tz [ <0054><0068><0065> ] TJ  

Acrobat

BT
0.196 0.184 0.188 rg
/T1_0 1 Tf
-0.035 Tc 3 Tr 23.4905 0 0 23.7001 211.43 744.24 Tm
(The )Tj
ET

@amitdo
Copy link
Collaborator

amitdo commented Jan 4, 2016

Did you see my 2 last comments?
The latest commit from the repo produces better pdf results than version 3.04.

@jbarlow83
Copy link
Author

Yes. Preview and poppler are still incapable of reading your i182.pdf. I observed no difference.

My comparison didn't address how Acrobat handles Unicode and Unicode literals cannot appear in Postscript so I checked how this is done. When Acrobat encodes a Unicode string it uses UTF-16 big endian code points in hexadecimal, like this:

... Tm
<4E8B5F97771F5BF9770B89C152A066F4591A5C11>Tj

That string encodes 10 characters all below U+7FFF, which are these:
事得真对看见加更多少

So it appears that Tesseract's method of encoding text strings is nonstandard. I checked the PDF 1.7 reference manual, and couldn't find an example matching Tesseract's output syntax.

@amitdo
Copy link
Collaborator

amitdo commented Jan 5, 2016

My libpopler version is 0.24.5. Ubuntu 14.04.

pdftotext i182.pdf i182t.txt

Here is the pdftotext output:
i182t.txt

@amitdo
Copy link
Collaborator

amitdo commented Jan 5, 2016

cc: @jbreiden
jbreiden wrote Tesseract's pdf renderer code.

@jbarlow83
Copy link
Author

Okay, for some reason pdftotext will not output to stdout but will produce a valid text file for the files we've been working on. My quick guess is that pdftotext suppresses its stdout if high ASCII characters are present, which tesseract finds here (some n-dashes and smart quotes). Both poppler 0.24.5 and 0.34 behave as expected when asked to save to a file, so the text stream is accessible to pdftotext. In short, poppler is working fine for me.

That said, OS X Preview and parsers like PyPDF2 still struggle with how Tesseract encodes text, as far as I can tell.

I checked that reportlab also encodes text strings in the manner of Acrobat, and Preview has no problems with PDFs produced by Tesseract -> hOCR -> reportlab PDF. This is an example of such a file:

linn_hocr_unc.pdf

@amitdo
Copy link
Collaborator

amitdo commented Jan 5, 2016

From https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/XllxjvK5HtU

Jeff Breidenbach 7/17/15
PROBLEM #2: PDF
I was looking at a PDF problem report and noticed that Tesseract PDF output
is no longer validating. (It fails qpdf --check). As the author of the pdf module,
I'm biased, but producing corrupt data is a disaster and I think we need to cut
a new release once it is figured out. Most PDF viewers will recover and silently
ignore, but this is no good at all. I wonder what happened.

@amitdo
Copy link
Collaborator

amitdo commented Jan 5, 2016

Try this to output to stdout:

pdftotext i182.pdf -

Jeff mentioned qpdf.
Links:
http://qpdf.sourceforge.net
https://github.com/qpdf/qpdf

@jbarlow83
Copy link
Author

Qpdf says it okay, but it doesn't check everything.
On Mon, Jan 4, 2016 at 17:55 Amit Dovev [email protected] wrote:

Try this to output to stdout:

pdftotext i182.pdf -

Jeff mentioned qpdf.
Links:
http://qpdf.sourceforge.net
https://github.com/qpdf/qpdf


Reply to this email directly or view it on GitHub
#182 (comment)
.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 5, 2016

I might not have time to take a look until Wednesday. Validators of various flavors include jhove, jhove-pdf-a, pdfbox, ITextRUPS, and http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx. (Note that Tesseract PDF are not expected to be PDF/A compliant). I did compatibility testing with Apple's Preview at design time, but I don't test against it regularly. Never tried PyPDF2. If I had to guess right now, I'd suspect it might be the invisible font improvement that was written for better ghostscript compatibility. Unlikely to be the hex encoding.

https://code.google.com/p/tesseract-ocr/issues/detail?id=1434
http://bugs.ghostscript.com/show_bug.cgi?id=695869

@jbreiden
Copy link
Contributor

jbreiden commented Jan 5, 2016

Looking at issue 181, it's looking more and more like Preview is unhappy with the revised glyphless font, possible due to the zero advance width. Will try to borrow a Mac and play with it, hopefully on Wednesday.

@jbarlow83
Copy link
Author

@jbreiden I agree that the glyphless font issue seems more probable.

Aside: I wouldn't trust JHOVE for PDF validation. For JHOVE to approve is better than not approving, but its analysis is rudimentary, and in my experience it produce more false positives and negatives than useful diagnostics.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 6, 2016

I produced this PDF using Tesseract, then borrowed a laptop running Mac OS X version 10.10.5 and was able to both search and copy-paste from Preview (Although the copy-paste highlighting was kind of weird). My testing copy of Tesseract is not completely synchronized with GitHub, so if needed we can investigate that. How does this PDF perform for you on Preview, @jbarlow83 ?

2.pdf

There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince. I don't notice any compatibility differences at all, but mentioning in case someone wants to play with it. Have not checked compatibility with Ghostscript.

https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true

Finally, this was my test image (I was actually using TIFF but GitHub doesn't let me attach that)

relativity

@jbarlow83
Copy link
Author

Doesn't work in Preview OS X 10.11.2 (highlights properly, but no copy-paste or search). I have access to two other OS X machines - will check those later day.

I check with my iPhone too. Both Chrome iOS (PDFium?) via Gmail app and Safari struggle to highlight text (they only allow highlighting a single character) and cannot copy.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 6, 2016

This one uses the alternate font that has an advance width.

alternate.pdf

@jbarlow83
Copy link
Author

alternate works on OS X Preview and my iPhone.

I did notice that spaces are sometimes missing in OS X's copy and paste text, while pdftotext shows the spaces, so perhaps it's not 100% but clearly this was the main issue.

components of the relative motions of the fixed , stars with respect to the earth on the colour of thelightreachingusfromthem. Thelattereffect manifests itself in a slight displacement of the spectral lines of the light transmitted to us from

a fixed star, as compared with the position of the same spectral lines when they are produced by a terrestrial source of light (Doppler principle). The experimental arguments in favour of the Maxwell-Lorentz theory, which are at the;same time arguments in favour of the theory of rela- tivity,aretoonumeroustobesetforthhere. In reality they limit the theoretical possibilities to such an extent, that no other theory than that of Maxwell and Lorentz has been able to hold its ownwhentestedbyexperience.

But there are two classes of experimental facts hitherto obtained which can be represented in the Maxwell-Lorentz theory only by the introduction of an auxiliary hypothesis, which in itself—i.e. without making use of the theory of relativity— appears extraneous.

Itisknownthatcathoderaysandtheso-called B—rays emitted by radioactive substances consist of negatively electrified particles (electrons) of verysmallinertiaandlargevelocity. By examin- ing the deflection of these rays under the influence of electric and magnetic fields, we can study the

law of motion of these particles very exactly.

@behdad
Copy link

behdad commented Jan 6, 2016

Output:
linn.pdf

For me, pdftotext outputs no text, but Evince, which also uses Poppler, correctly selects and extracts text.

@amitdo
Copy link
Collaborator

amitdo commented Jan 6, 2016

@behdad, try this:

pdftotext linn.pdf -

@behdad
Copy link

behdad commented Jan 6, 2016

@behdad, try this:

pdftotext linn.pdf -

Hah. My bad. Thanks :)

@jbreiden
Copy link
Contributor

jbreiden commented Jan 6, 2016

I got my hands on an iPad running iOS 9.2 and reproduced the problem. On iOS/Safari I cannot search 2.pdf (Ken Sharp's font) but can search with alternate.pdf (Behdad's font). Took me quite a while to figure out how how to make the search controls work.

So for your immediate problem, go ahead and substitute in Behdad's font into tessdata/pdf.ttf and you should be okay. We won't do that officially without a whole bunch more compatibility testing and reports, including the harder languages (Cherokee, vertical Japanese, Arabic) and additional renderers including Ghostscript and Firefox. Compatibility reports are appreciated.

https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true

Regarding the words running together on the Apple PDF renderer, that's not new. Apple PDF seems to do a worse job than everyone else at deciding word boundaries, and I've seen them screw up plenty of regular born-digital PDF files in the same way. Of course the root cause is the PDF spec itself, which does not explicitly define the concept of a word boundary. So I can't help you, but at least it isn't a regression. It's possible that Apple will get their act together a little better on this some day, but I have no reason to believe that it is on their radar.

@amitdo
Copy link
Collaborator

amitdo commented Jan 8, 2016

There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince.

It looks terrible :(

@behdad
Copy link

behdad commented Jan 8, 2016

My font has a huge advance width, because it was designed for another purpose. Someone should create one with an advance width of 1024 instead of my 20480.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 8, 2016

The PDF is keeping the advance width under control for Behdad's font. We're probably seeing something else. It's kind of cute zebra pattern. You get a black underline, and black boxes in all word gaps and in some letter gaps. (Obviously evince is doing a really bad job, but this is much worse than with Ken Sharp's font, which highlights as a solid black bar.) A little hard for me to investigate, since my copy of ttx is not cooperating.

P.S. The font advance width should probably be 512 to match what we specify in the PDF. But again, I don't expect that to change anything for evince.

evince

@amitdo
Copy link
Collaborator

amitdo commented Jan 8, 2016

If you search for a phrase in evince, the highlighting looks more normal.
Strange!

@iikka-v
Copy link

iikka-v commented Feb 1, 2016

Yes, utterly. I ran the tofu.ttf and the old pdf.ttf through Apples font validator. Both produced errors, but tofu.ttf only one, whereas the old pdf.ttf had additional "name table usability" errors. Please post the above font files (or diffs) and I'll run them through the validator as well. Perhaps this will give some insight to the issue.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 1, 2016

Fonts as per request. I do not know if my modification tool (ttx) corrupts anything along the way. So far the experiments suggest that Apple software requires a contour, and a contour cosmetically messes with evince.

pdf.ttf - currently shipping font, by Ken Sharp
sharp.ttf - with advance width added

tofu.ttf - alternate font from behdad
behdad.ttf - with advance width reduced
behdad2.ttf - with contour removed

fonts.zip

@iikka-v
Copy link

iikka-v commented Feb 1, 2016

Thanks. Here's the verbose error report as given by Apples ftxvalidator (there's not really a version for 10.11, so some of this might be inaccurate). All report fatal errors and most errors are beyond my (admittedly limited) expertise on the subject. I hope they make more sense to you.

Uploading ftxvalidator_report.txt…

@jbreiden
Copy link
Contributor

jbreiden commented Feb 1, 2016

Can you please edit that report and make it an attachment or something? The giant wall of text makes this bug harder to read.

@behdad
Copy link

behdad commented Feb 2, 2016

Partially blocked by fonttools/fonttools#497

Fixed now.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 5, 2016

For completeness, here is Ken Sharp's font with a contour added in.

FONT
sharp2.zip

PDF
sharp2.pdf

At this point, sharp2.ttf and behdad.ttf are the only fonts compatible with Apple Preview. They both come at the cost highlight aesthetics with evince. I think Preview is incorrect to require a contour for the glyph, and I think evince is incorrect to consider a contour when highlighting an invisible font. I do not have any reason so far to prefer one over the other, and I do not yet have compatibility test results from ghostscript, firefox, Microsoft Edge, etc.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 5, 2016

I have filed a bug with Apple. This is not publicly visible and I do not know what the response will be. Noting it here simply simply for future reference. radr://24533090

@jbreiden
Copy link
Contributor

jbreiden commented Feb 9, 2016

In progress testing compatibility with candidates "sharp2" and "behdad" including getting some assistance with ghostscript. So far no user visible differences between them, and the former is the smaller change. Is there general consensus to work around the Apple compatibility problem, at the expense of Evince highlight aesthetics?

@jbarlow83
Copy link
Author

@jbreiden I agree. OS X Preview is installed on ~10% of all desktop computers. Evince is just one of many PDF viewers for Linux users.

@bekirserifoglu
Copy link

@jbarlow83 and @jbreiden This bug also affects the Amazon Kindles. As an avid user of Amazon Kindle and Tesseract, I feel crippled now. And don't forget that all those pdfs generated with Tesseract won't work with Kindle either around the world.

@jbreiden
Copy link
Contributor

@bekirserifoglu - can you please confirm that both proposed workarounds found in previous comments (sharp2.pdf, behdad.pdf) solve the problem on Kindle?

@bekirserifoglu
Copy link

@jbreiden I can confirm that both sharp and tofu fonts work great with Kindle Voyage and Preview on Os X. Feel free to mention me if you need anymore testing.

@jbreiden
Copy link
Contributor

@bekirserifoglu - Is the failure case on Kindle broken search and broken copy-paste? Or is it even worse than that?

@bekirserifoglu
Copy link

@jbreiden Kindle just treats the pdf as a non-ocr'ed pdf. It is worse than OS X preview.

@jbreiden
Copy link
Contributor

@theraysmith

Okay, I've decided. We're going to use the sharp2 font.

For various embarassing reasons, I'd appreciate some help. Could someone please download this zip file, extract sharp2.ttf, and use it to replace pdf.ttf in the repository. The resulting file should still be called pdf.ttf. I apologize for not doing this myself and promise to get my act together with respect to GitHub pull requests in the future.

sharp2.zip

https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf

jbarlow83 pushed a commit to jbarlow83/tesseract that referenced this issue Feb 11, 2016
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
@jbarlow83
Copy link
Author

Done. PR #220.

zdenop pushed a commit that referenced this issue Feb 12, 2016
As discussed at length in issue #182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
@zdenop zdenop closed this as completed Feb 12, 2016
@iikka-v
Copy link

iikka-v commented Feb 13, 2016

Thank you for fixing this.

@amitdo
Copy link
Collaborator

amitdo commented Feb 13, 2016

This was a short discussion... :-)

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants