-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latin.traineddata(best) - Words missing in OCR #1080
Comments
Known problem for all langs/scripts |
In your case the problem is probably in the layout analysis stage. |
I've another example: tesseract test/1502442621178.png test/1502442621178 -l Latin Here a word in the middle of the sentence is skipped. |
Try cutting the non-text areas with gimp and retest. |
Only the red areas are processed by tesseract, they are written to separate png-files. (I've uploaded the separate files as well and the resulting output-files) |
Which word? Might be related to #681 in this case. |
Pöllan - the ö exists in unicharset Thank you for the link to #681 |
Try to debug with stopper_debug_level=2 https://github.com/tesseract-ocr/tesseract/blob/3ec11bd37a56/ccmain/linerec.cpp#L293 |
Best choice certainty=-2.96366, space=-0.206939, scaled=-20.7456, final=-20.7456 |
Best choice certainty=-0.106016, space=-2.97242, scaled=-20.8069, final=-20.8069 |
So it does recognize them, but still decides to drop them... |
'Pöllan' is dropped because it's not in the dictionary and the 'ö' has low certainty. |
'USA' shares the same low space certainty with '/NORDKOREA' but escapes from punishment because it's in the dictionary. |
@TheSeiko Please try with the changes suggested in #681 (comment) to see if you get improved recognition of these words without impacting others. |
After applying my patch: are recognized in the final text output. All other words are recognized the same as before. |
Thank you to both of you, your help is much appreciated! I'm on holiday till end of next week then I'll try to compile a windows version with the changes you suggested and test it. |
Did you try my suggestion? |
I'm still on it. |
I've been able to compile it now and starting a test run against 50k frames. |
looks good |
Tried it as well with 64bit but there I get some errors but I don't think the problem is the fix: E:\Tesseract-OCR4.0ab1>tesseract test/1502442621178.png stdout --oem 1 -l Latin x86 output: 21-Jähriger geriet mit seinem Der Beifahrer des 21-Jährigen I'll try to find the error and keep you updated. |
That one is related to image processing. Seems like a bug on (Windows?) 64 bit environment. Please open a new issue for that. |
@zdenop Please close this issue. Words in https://user-images.githubusercontent.com/30631253/29272525-b9762536-8100-11e7-8d82-0961dd49663b.png mentioned above in #1080 (comment) are not being dropped after patch from @amitdo was merged.
|
Fixed in #1264. |
Environment
Current Behavior:
Tesseract skippes words when doing OCR
OCR-Result: USA
Expected Behavior:
OCR-Result: USA/NORDKOREA
Suggested Fix:
not to skip words
Today I saw multiple times, that tesseract skips words, sometimes in the middle of a paragraph.
i.e. tesseract test/1502433849760_1.png test/1502433849760 -l Latin
1502433849760.txt
The text was updated successfully, but these errors were encountered: