-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCRing images written in Hebrew with diacritics is completely not working #4119
Comments
Does it work with other output formats like |
Please have a look at issue #238. Is that the same problem, related to RTL script? |
It also does not work in txt as well. It's just harder to understand what is coming from what source text, but even looking into it in detail its fairly obvious that none of the words are correct. I have not tried old versions of tesseract, but I have tried the legacy engine and it gives different, but also incorrect results.
I wouldn't be able to tell at this point because however you read it, the letters are just wrong. |
You didn't attach the txt output. The training data used for training Hebrew with nikud marks had many issues, so it's not surprising that OCRing Hebrew images with nikud is giving you bad result. |
I am by no means a Hebrew expert and I have no issue reading it. This is not something unusual for uncommon or rare Hebrew books. |
Current Behavior
Running tesseract on a hebrew scan:
tesseract --oem 1 -l heb image00041.jpg image00041.jpg pdf
Try copying text from resulting PDF file and observe that the copied text is nothing like the original.
Tried with the default models installed from arch repos and with tessdata_best model.
Expected Behavior
OCR text should match original.
Suggested Fix
No response
tesseract -v
tesseract 5.3.2
leptonica-1.83.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.3.1 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.1 zlib/1.2.13 liblzma/5.4.3 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.2.1 OpenSSL/3.1.2 zlib/1.2.13 brotli/1.0.9 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.4) libssh2/1.11.0 nghttp2/1.55.1
Operating System
No response
Other Operating System
Manjaro
uname -a
Linux Maxwell-Main 6.3.13-2-MANJARO #1 SMP PREEMPT_DYNAMIC Sun Jul 16 16:48:53 UTC 2023 x86_64 GNU/Linux
Compiler
N/A
CPU
AMD Ryzen Threadripper 2950X
Virtualization / Containers
No response
Other Information
image00041.jpg.pdf
![image00041](https://private-user-images.githubusercontent.com/1660330/261514371-1a5262b8-da40-4e0c-88ae-1e18716d2f6d.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNzQyNzMsIm5iZiI6MTczOTI3Mzk3MywicGF0aCI6Ii8xNjYwMzMwLzI2MTUxNDM3MS0xYTUyNjJiOC1kYTQwLTRlMGMtODhhZS0xZTE4NzE2ZDJmNmQuanBnP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTFUMTEzOTMzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MmJjYTdlNjFkNTdhMWY1MGUxZTIwMGEyOTI4MDJiNWZmZmU3ZmQ1NjVhNjIyZTBhNTA0YjNlYmVmYzlmOWUxMyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.haKSPRWy8bWJWnh_J8iMXBnncESzy7ll-zYdoLnktY4)
The text was updated successfully, but these errors were encountered: