-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some programs can't find OCR text in Tesseract's PDFs (3.04) #182
Comments
See #170. |
Chromium's pdf reader output (cut&paste): |
I run Tesseract (latest commit from the repo) with your jpg image.
Evince's output (cut&paste): Evince is based on Poppler. |
Here are the output files... |
Chrome's PDF reader works for me. I have poppler 0.39.0 installed (homebrew/OS X/El Capitan). I believe I found the reason. It appears that the readers that struggle with it do not support Tesseract's usage of hexadecimal code points rather than literal characters in the output stream. The PostScript content stream for this page as generated by Tesseract for the first word, "The" appears as follows: Tz [ <0054><0068><0065> ] TJ where <0054> = U+0054 = T, <0068> = U+0068 = h, etc. I have run into other situations where this hexadecimal notation causes parsing difficulties for some PDF readers. Acrobat generates the equivalent segment with ASCII literals. [...omitted...] Tm
(The )Tj Longer excerpts for comparison: Tesseract BT
3 Tr 1 0 0 1 211.68 744 Tm /f-0-0 21 Tf 117.334 Tz [ <0054><0068><0065> ] TJ Acrobat BT
0.196 0.184 0.188 rg
/T1_0 1 Tf
-0.035 Tc 3 Tr 23.4905 0 0 23.7001 211.43 744.24 Tm
(The )Tj
ET |
Did you see my 2 last comments? |
Yes. Preview and poppler are still incapable of reading your i182.pdf. I observed no difference. My comparison didn't address how Acrobat handles Unicode and Unicode literals cannot appear in Postscript so I checked how this is done. When Acrobat encodes a Unicode string it uses UTF-16 big endian code points in hexadecimal, like this: ... Tm
<4E8B5F97771F5BF9770B89C152A066F4591A5C11>Tj That string encodes 10 characters all below U+7FFF, which are these: So it appears that Tesseract's method of encoding text strings is nonstandard. I checked the PDF 1.7 reference manual, and couldn't find an example matching Tesseract's output syntax. |
My libpopler version is 0.24.5. Ubuntu 14.04.
Here is the pdftotext output: |
cc: @jbreiden |
Okay, for some reason pdftotext will not output to stdout but will produce a valid text file for the files we've been working on. My quick guess is that pdftotext suppresses its stdout if high ASCII characters are present, which tesseract finds here (some n-dashes and smart quotes). Both poppler 0.24.5 and 0.34 behave as expected when asked to save to a file, so the text stream is accessible to pdftotext. In short, poppler is working fine for me. That said, OS X Preview and parsers like PyPDF2 still struggle with how Tesseract encodes text, as far as I can tell. I checked that reportlab also encodes text strings in the manner of Acrobat, and Preview has no problems with PDFs produced by Tesseract -> hOCR -> reportlab PDF. This is an example of such a file: |
From https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/XllxjvK5HtU
|
Try this to output to stdout:
Jeff mentioned qpdf. |
Qpdf says it okay, but it doesn't check everything.
|
I might not have time to take a look until Wednesday. Validators of various flavors include jhove, jhove-pdf-a, pdfbox, ITextRUPS, and http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx. (Note that Tesseract PDF are not expected to be PDF/A compliant). I did compatibility testing with Apple's Preview at design time, but I don't test against it regularly. Never tried PyPDF2. If I had to guess right now, I'd suspect it might be the invisible font improvement that was written for better ghostscript compatibility. Unlikely to be the hex encoding. https://code.google.com/p/tesseract-ocr/issues/detail?id=1434 |
Looking at issue 181, it's looking more and more like Preview is unhappy with the revised glyphless font, possible due to the zero advance width. Will try to borrow a Mac and play with it, hopefully on Wednesday. |
@jbreiden I agree that the glyphless font issue seems more probable. Aside: I wouldn't trust JHOVE for PDF validation. For JHOVE to approve is better than not approving, but its analysis is rudimentary, and in my experience it produce more false positives and negatives than useful diagnostics. |
I produced this PDF using Tesseract, then borrowed a laptop running Mac OS X version 10.10.5 and was able to both search and copy-paste from Preview (Although the copy-paste highlighting was kind of weird). My testing copy of Tesseract is not completely synchronized with GitHub, so if needed we can investigate that. How does this PDF perform for you on Preview, @jbarlow83 ? There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince. I don't notice any compatibility differences at all, but mentioning in case someone wants to play with it. Have not checked compatibility with Ghostscript. https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true Finally, this was my test image (I was actually using TIFF but GitHub doesn't let me attach that) |
Doesn't work in Preview OS X 10.11.2 (highlights properly, but no copy-paste or search). I have access to two other OS X machines - will check those later day. I check with my iPhone too. Both Chrome iOS (PDFium?) via Gmail app and Safari struggle to highlight text (they only allow highlighting a single character) and cannot copy. |
This one uses the alternate font that has an advance width. |
alternate works on OS X Preview and my iPhone. I did notice that spaces are sometimes missing in OS X's copy and paste text, while pdftotext shows the spaces, so perhaps it's not 100% but clearly this was the main issue.
|
For me, pdftotext outputs no text, but Evince, which also uses Poppler, correctly selects and extracts text. |
@behdad, try this:
|
Hah. My bad. Thanks :) |
I got my hands on an iPad running iOS 9.2 and reproduced the problem. On iOS/Safari I cannot search 2.pdf (Ken Sharp's font) but can search with alternate.pdf (Behdad's font). Took me quite a while to figure out how how to make the search controls work. So for your immediate problem, go ahead and substitute in Behdad's font into tessdata/pdf.ttf and you should be okay. We won't do that officially without a whole bunch more compatibility testing and reports, including the harder languages (Cherokee, vertical Japanese, Arabic) and additional renderers including Ghostscript and Firefox. Compatibility reports are appreciated. https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true Regarding the words running together on the Apple PDF renderer, that's not new. Apple PDF seems to do a worse job than everyone else at deciding word boundaries, and I've seen them screw up plenty of regular born-digital PDF files in the same way. Of course the root cause is the PDF spec itself, which does not explicitly define the concept of a word boundary. So I can't help you, but at least it isn't a regression. It's possible that Apple will get their act together a little better on this some day, but I have no reason to believe that it is on their radar. |
It looks terrible :( |
My font has a huge advance width, because it was designed for another purpose. Someone should create one with an advance width of 1024 instead of my 20480. |
The PDF is keeping the advance width under control for Behdad's font. We're probably seeing something else. It's kind of cute zebra pattern. You get a black underline, and black boxes in all word gaps and in some letter gaps. (Obviously evince is doing a really bad job, but this is much worse than with Ken Sharp's font, which highlights as a solid black bar.) A little hard for me to investigate, since my copy of ttx is not cooperating. P.S. The font advance width should probably be 512 to match what we specify in the PDF. But again, I don't expect that to change anything for evince. |
If you search for a phrase in evince, the highlighting looks more normal. |
Yes, utterly. I ran the tofu.ttf and the old pdf.ttf through Apples font validator. Both produced errors, but tofu.ttf only one, whereas the old pdf.ttf had additional "name table usability" errors. Please post the above font files (or diffs) and I'll run them through the validator as well. Perhaps this will give some insight to the issue. |
Fonts as per request. I do not know if my modification tool (ttx) corrupts anything along the way. So far the experiments suggest that Apple software requires a contour, and a contour cosmetically messes with evince. pdf.ttf - currently shipping font, by Ken Sharp tofu.ttf - alternate font from behdad |
Thanks. Here's the verbose error report as given by Apples ftxvalidator (there's not really a version for 10.11, so some of this might be inaccurate). All report fatal errors and most errors are beyond my (admittedly limited) expertise on the subject. I hope they make more sense to you. |
Can you please edit that report and make it an attachment or something? The giant wall of text makes this bug harder to read. |
Fixed now. |
For completeness, here is Ken Sharp's font with a contour added in. FONT PDF At this point, sharp2.ttf and behdad.ttf are the only fonts compatible with Apple Preview. They both come at the cost highlight aesthetics with evince. I think Preview is incorrect to require a contour for the glyph, and I think evince is incorrect to consider a contour when highlighting an invisible font. I do not have any reason so far to prefer one over the other, and I do not yet have compatibility test results from ghostscript, firefox, Microsoft Edge, etc. |
I have filed a bug with Apple. This is not publicly visible and I do not know what the response will be. Noting it here simply simply for future reference. radr://24533090 |
In progress testing compatibility with candidates "sharp2" and "behdad" including getting some assistance with ghostscript. So far no user visible differences between them, and the former is the smaller change. Is there general consensus to work around the Apple compatibility problem, at the expense of Evince highlight aesthetics? |
@jbreiden I agree. OS X Preview is installed on ~10% of all desktop computers. Evince is just one of many PDF viewers for Linux users. |
@jbarlow83 and @jbreiden This bug also affects the Amazon Kindles. As an avid user of Amazon Kindle and Tesseract, I feel crippled now. And don't forget that all those pdfs generated with Tesseract won't work with Kindle either around the world. |
@bekirserifoglu - can you please confirm that both proposed workarounds found in previous comments (sharp2.pdf, behdad.pdf) solve the problem on Kindle? |
@jbreiden I can confirm that both sharp and tofu fonts work great with Kindle Voyage and Preview on Os X. Feel free to mention me if you need anymore testing. |
@bekirserifoglu - Is the failure case on Kindle broken search and broken copy-paste? Or is it even worse than that? |
@jbreiden Kindle just treats the pdf as a non-ocr'ed pdf. It is worse than OS X preview. |
Okay, I've decided. We're going to use the sharp2 font. For various embarassing reasons, I'd appreciate some help. Could someone please download this zip file, extract sharp2.ttf, and use it to replace pdf.ttf in the repository. The resulting file should still be called pdf.ttf. I apologize for not doing this myself and promise to get my act together with respect to GitHub pull requests in the future. https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf |
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.
Done. PR #220. |
As discussed at length in issue #182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.
Thank you for fixing this. |
This was a short discussion... :-) |
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.
As discussed at length in issue tesseract-ocr#182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.
While Acrobat XI can find text in a PDF, it appears that poppler's
pdftotext
program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.pdftotext
produces empty output.Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched.
PyPDF2 extractText also produces an empty string as text.
The text was updated successfully, but these errors were encountered: