Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debian Buster: ocrmypdf outdated #46

Closed
joergmschulz opened this issue Jan 27, 2021 · 6 comments · Fixed by #47
Closed

Debian Buster: ocrmypdf outdated #46

joergmschulz opened this issue Jan 27, 2021 · 6 comments · Fixed by #47
Labels
documentation Improvements or additions to documentation

Comments

@joergmschulz
Copy link
Contributor

joergmschulz commented Jan 27, 2021

Possibly, this wonderful tool can't be used in Debian Buster. It uses ocrmypdf 8.0.1 which issues warnings like

WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR
This warning leaves the pdf alone / does not add the text layer.

The version 9.8.1 of Alpine works perfectly with the same input file.
Nextcloud log:

OCR for file /joerg.schulz/files/FDS Bau - Sanierung Haus Sonnenblick/Bauherr/Dokumentationen/Projektantrag-SoftwareAG2014.pdf not possible. Message: OCRmyPDF exited abnormally with exit-code 0. Message: WARNING - 4: [tesseract] lots of diacritics - possibly poor OCR WARNING - 2: [tesseract] lots of diacritics - possibly poor OCR
@R0Wi
Copy link
Contributor

R0Wi commented Jan 28, 2021

Hi @joergmschulz, a shame that you're experiencing issues with OCRmyPDF. As a first information we'd like to mention that we are not willing to replace the tool under the hood. We tried some other tools and packages to achieve the same result earlier which leads to even more problems. So generally speaking we found that OCRmyPDF is the best tool to achieve exactly what we want to do with this app in combination with PDF files.

In your case i see the following options:

  1. Try to install a newer version of OCRmyPDF outside the regular package source. You could give a try on the python installation mentioned here for Ubuntu 18.04.
  2. In our app we could just ignore warnings in general. Even that quite dangerous in my opinion you could try if you're able to process the mentioned file "by hand" when invoking ocrmypdf inpup.pdf output.pdf on the commandline. If it just outputs some warnings but the output is generated properly we could think of a more fault tolerant handling inside the app. Please give us some feedback on this or attach the mentioned PDF file if this is possible.

@bahnwaerter anything to add on this?

Btw.: i'm also using Debian Buster with OCRmyPDF 8.0.1 installed and i did not see similar errors. So it also might be related to the PDF files you want to be processed.

@joergmschulz
Copy link
Contributor Author

joergmschulz commented Jan 28, 2021

attaching one file here // doesn't work, but: https://cloud.faudin.de/s/e2AbYxXR9njcZRC , password dddjjj

messages:

ocrmypdf --force-ocr /data/nc/joerg.schulz/files/Documents/Pferdezüchter\ Jens.pdf /tmp/ppp.pdf 
   INFO - Optimize ratio: 1.14 savings: 12.1%
   INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ ocrmypdf --version
8.0.1+dfsg
www-data@c:~$ ocrmypdf  /data/nc/joerg.schulz/files/Documents/Projektantrag_SoftwareAG_2014.pdf /tmp/ppp.pdf 
WARNING -    4: [tesseract] lots of diacritics - possibly poor OCR
WARNING -    2: [tesseract] lots of diacritics - possibly poor OCR
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)
www-data@c:~$ php -f /var/www/nc/cron.php

@joergmschulz
Copy link
Contributor Author

joergmschulz commented Jan 28, 2021

confirming: when I install the OCRmyPDF version as documented by @R0Wi above, all works perfectly. Maybe that should go into the README?

@R0Wi
Copy link
Contributor

R0Wi commented Jan 28, 2021

Glad to hear that everything works as expected now. I'll leave this open until we added the info to the README :-)

@R0Wi R0Wi added the documentation Improvements or additions to documentation label Jan 28, 2021
@joergmschulz
Copy link
Contributor Author

see #47

@R0Wi R0Wi linked a pull request Jan 28, 2021 that will close this issue
@R0Wi
Copy link
Contributor

R0Wi commented Jan 28, 2021

Thanks for the PR @joergmschulz !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants