-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesseract 4.00.00alpha image ocr correct, final PDF not #178
Comments
Please make sure you have ocrmypdf 5.3 and try the following
ocrmypdf's |
Hi there, sorry for the dealy.
The sandwich renderer as well as pdfsandwich produce mirrored text, though there are some dfiferences as to what happens when you highlight/select text in a pdf reader (Evince/Document Viewr in my case). p.s. How can I easily add tesseract-heb package to your docker.tess4 image? i.e. not rebuild... I would typicaly attach to a runing container, install some additional packges and save/clone. Your image of course exists immidiately... |
How does this one look to you? It's possible that the PDF readers cannot display mirrored text or Unicode correctly. It could also be that Ghostscript messes up the right to left text. Try disabling it
pdftotext seems to display the text correctly in my terminal: To add a new language package you could do the following:
That gets you a terminal inside the Docker container. Now
to make a copy of all tessdata on your local machine at Then add any compatible Hebrew language files. Run docker by adding the modified volume to the list
so it will replace the Docker image's tessdata with files from a local directory. You cannot mix tess3/4 files. You have to download the appropriate files for tess4. |
Closing due to lack of response/inability to reproduce |
I'm testing a simple single page in which text is mostly in Hebrew, and I can see that the hocr working files contain ocr'd text (with some mistakes, but mostly ok), but the pdf produced does not seem to contain any text (just default unicode glyphs)... what could be the problem.
See attached screenshots (highlited word is shown and copied as a dummy unicode string).
The 2nd image is the hocr file created for the 1st image.
Invocation:
The text was updated successfully, but these errors were encountered: