-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM: Words dropped during recognition #681
Comments
I would suggest that if the best match for a word comes to be blank, then it be replaced by a string such as @@@@@ so that it is easy to identify missing text and correct the OCR output. |
Other similar issues: |
Is it possibly related to |
I think the cause of all of these is the precision-recall tradeoff that takes place in linerec.cpp The underlying question is, if there is a word that is almost certainly incorrect, would it be better to have it with the error, or have it disappear? Historically Tesseract has not dropped words in such a way, but when I tried it, I found that almost every word dropped was incorrect anyway, so the word-level precision was improved without hurting recall. I might disable the word dropping altogether, or maybe it would work better for deleting garbage if most of the characters are bad instead of just the worst. While I am looking at this though, I am not convinced that the unicharset and/or compression are applied correctly to Kannada, which might explain its rather stubborn refusal to improve in accuracy. |
I do not think it should disappear. However, if the word is almost certainly incorrect, then it should be marked in some easy way for users to fix the OCRed txt. Request feedback from others too - @zdenop @jbreiden @amitdo @stweil etc. |
I find that the words which are getting dropped are also the same ones which are not being picked up by tesseract when using 'makebox'. I has posted a sample with devanagari in another thread. (#664 (comment) ) Here is a kannada sample: |
Hi Yeah @Shreeshrii is correct ,The words not being picked,Is it problem with Segmentation?? |
For |
It also might be useful to add a few(1-4) alternative words for each word when using the hOCR format. |
Please see
https://pdfs.semanticscholar.org/dc3e/f1e05b4b629de5db721efb156d82556ff362.pdf
The ISRI Analytic Tools for OCR Evaluation
A tilde (~) in an OCR-generated text file is treated as a reject character. A circumflex (^) is interpreted as a suspect marker and serves to mark the following character as suspect. For example, in Ne^vada, the v is marked as suspect. The value of these special characters is assessed when computing marked character efficiency.
I thought that this maybe a standard in OCR evaluation and hence had suggested a marker.
Regardless, I do not think that incorrect words should just disappear.
|
@theraysmith Are there any specific issues that you have noticed which I can check with native Kannada speakers? |
Some feedback regarding Kannada recognition from MNS Rao
|
👍 to Shree and MNS Rao ! |
https://shreeshrii.github.io/tess4eval_kannada/ CER 7.74 Images and gt are in |
@theraysmith What page segmentation mode do you use for the testing/accuracy reports? I am getting better results for kannada with --psm 6 compared to --psm 3 (default) or --psm 4. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. 6 Assume a single uniform block of text. |
The tests are done with |
Are there unlv test files for indian languages?
- excuse the brevity, sent from mobile
…On 12-Feb-2017 1:38 PM, "Amit D." ***@***.***> wrote:
The tests are done with .uzn files.
https://github.com/tesseract-ocr/tesseract/blob/
a1c22fb/ccmain/pagesegmain.cpp#L111
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#681 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o6oGsQw2bBnbF5zxyIj3jnXquAzjks5rbr4agaJpZM4LuojH>
.
|
If you are asking about the original UNLV dataset, the ansswer is 'No'. It's possible that someone prepared such files as part of indic dataset. |
@theraysmith Please see page 18 onwards for kannada specific info in the following pdf |
I made a discovery yesterday that the web-derived text corpus for Kannada
is missing ZWNJ, which AFAICT is an *essential* unicode character in
Kannada.
The same applies to other Indic languages, although the use varies, and
some use ZWJ as well.
I'm still working on this and investigating where they are lost.
The implications for fixing it could be higher accuracy in several
languages, although I don't know by how much, as I haven't measured the
frequency of ZWNJ in my test sets.
…On Fri, Feb 17, 2017 at 12:45 AM, Shreeshrii ***@***.***> wrote:
More Kannada OCR related papers:
http://mile.ee.iisc.ernet.in/mile/publications/softCopy/
DocumentAnalysis/Madhav_SPCOM2014.pdf
http://mile.ee.iisc.ernet.in/mile/publications/softCopy/
DocumentAnalysis/Nethra_ICFHR2010_Data.pdf
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#681 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056SQADDz1OYsnSygJmWFUA7rCatPiks5rdV4PgaJpZM4LuojH>
.
--
Ray.
|
Yes, both ZWJ and ZWNJ are important for Indic languages. Please see http://unicode.org/faq/indic.html If the sequence U+0924, U+094D is not followed by another consonant letter (such as "na") it is always displayed as a full ta glyph combined with the virama glyph "dev-ta-virama". |
There are at times multiple ways of typing a character, some of the web text may look ok but maynot be correct. There probably needs to be a normalisation step before training. Please see kannada chapter in Vowel letters are encoded atomically in Unicode, even if they can be ana- Table 12-28. Kannada Vowel Letters |
Related _ #604 |
Related - Marathi recognition of repha (sanskrit loan words) and eyelash ra I found that when using Marathi traineddata words which used half ra (repha) were not being recognized correctly, it could be related to the ZWJ and ZWNJ problem. eg. पूर्वक सूर्य धर्म सर्व कार्य वर्ग Since unicode has evolved over time, there maybe legacy representations still around in the webtext. Please see issue 7 listed on http://www.baraha.com/help/kb/unicode_issues.htm which has examples of different unicode encodings being used. |
Common normalization in the above includes
In case the LSTM training data build uses something similar on the webtext, you may want to disable that. |
See #664 for details about dropped words during Devanagari recognition. |
@theraysmith Are there any config values I can change so that words are not dropped? |
Ok, I changed some constants and now the words are being dropped rarely. @theraysmith Is this the right approach? It will be good if these values can be changed via config variables rather than needing recompile to test different values. I changed the following: Line 32 in a1c22fb
changed from -20.0f to -50.0f
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/linerec.cpp#L36-40 changed from 5.0f to 0.0f
changed from -25.0f to -99.0f
|
In addition to training for unknown fonts by using scanned box/tiff pairs, they would also be useful for 'printing conventions' which may not be in concert with the current unicode conventions. See attached image, where the anusvar is printed before the reph whereas the valid rendering would display anusvar later. Training by using only synthetic images and valid cases will not train for such cases. Currently these words get dropped during recognition. |
Do the changes so far address the missing text / dropped words issue? Should I test these or wait for new models? |
I don't think new models can help with this issue. |
This feature drops perfect words! I also was hit by this issue. Two examples from one page from old (1929) Hebrew newspaper: GT: GT:
In both examples there are two words separated by hyphen. The hyphen looks unclear, and thus Tesseract replaces it with another character. Other than this char all the chars in the two words are recognizes well. |
https://github.com/tesseract-ocr/tesseract/blob/3ec11bd37a56/ccmain/linerec.cpp#L293 This block of code
Should be replaced with:
|
Tested on a few pages. Seems to be working well. |
@amitdo Have you created a PR for this? |
No, I haven't. |
@zc813 Please share a sample image for testing. |
Missing lines are probably due to errors in page layout analysis. I suggest you open a new issue and include these samples there. Thanks. |
If PSM 11 is giving you better results, use that. Or, if using the API, try to OCR one line at a time. |
Please see attached OCR evaluation reports. The words highlighted in green in the ground truth are being dropped during recognition.
kan-words-dropped.zip
The text was updated successfully, but these errors were encountered: