-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
psm 3 and psm 6 skip different parts of text based on font size #538
Comments
The result is similar with the best traineddata and current code, though there are differences between best/Devanagari, best/hin and best/san. psm 3 treats smaller text as 'diacritics' - jpg at 600dpi
|
Same image at 300 dpi, gets fewer blobs recognized as diacritics.
|
See HOCR Bbox related issue, withe differences in processing for psm 3 and 6, posted in forum - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/W19nmACyri8/Ipi5a-6hBQAJ
|
I built Tesseract 4 Beta from git repository on Debian 64. here is generated hocr file: The bbox tag for the word PLAINTIFF is set to 0,0,3400,4400 |
Still same result...
|
In an image with Hindi text in various fonts, some of it at very large size
psm 3 - recognizes text at large font size
psm 6 - recognizes text at smaller font size
input image and output files are attached.
![sample6](https://cloud.githubusercontent.com/assets/5095331/20925445/04033a88-bbdd-11e6-861b-100e33bf5177.jpg)
sample6-psm3.txt
sample6-psm6.txt
The text was updated successfully, but these errors were encountered: