LSTM: Words dropped during recognition #681

Shreeshrii · 2017-01-26T13:25:46Z

Please see attached OCR evaluation reports. The words highlighted in green in the ground truth are being dropped during recognition.

kan-words-dropped.zip

Shreeshrii · 2017-01-26T13:37:48Z

Processing word with lang kan at:Bounding box=(163,2011)->(1231,2056)
Trying word using lang kan, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : ಪ್ರೆಸ್, : R=3.81155, C=-2.00782, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM
str	ಪ್	ರೆ	ಸ್	,
state:	1 	1 	1 	1 
C	-0.137	-0.113	-0.287	-0.108
Best choice: accepted=1, adaptable=0, done=1 : Lang result :        : R=60, C=-1, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM	NORM
str	 	 	 	 	 	 
state:	1 	1 	1 	1 	1 	1 
C	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000

I would suggest that if the best match for a word comes to be blank, then it be replaced by a string such as @@@@@ so that it is easy to identify missing text and correct the OCR output.

Shreeshrii · 2017-01-26T13:39:43Z

Other similar issues:

#673

#664

#633 (comment)

Shreeshrii · 2017-01-27T06:11:04Z

@theraysmith

Is it possibly related to --strip_unrenderable_words during training? I have noticed that the images created by text2image have a blank space instead of the missing word when the font is not able to render it.

theraysmith · 2017-01-27T23:22:34Z

I think the cause of all of these is the precision-recall tradeoff that takes place in linerec.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/linerec.cpp#L313

The underlying question is, if there is a word that is almost certainly incorrect, would it be better to have it with the error, or have it disappear?

Historically Tesseract has not dropped words in such a way, but when I tried it, I found that almost every word dropped was incorrect anyway, so the word-level precision was improved without hurting recall.
Character-level recall OTOH is reduced by this word-dropping, since the rest of the characters in a word are usually correct.

I might disable the word dropping altogether, or maybe it would work better for deleting garbage if most of the characters are bad instead of just the worst.

While I am looking at this though, I am not convinced that the unicharset and/or compression are applied correctly to Kannada, which might explain its rather stubborn refusal to improve in accuracy.

Shreeshrii · 2017-01-28T07:16:26Z

The underlying question is, if there is a word that is almost certainly incorrect, would it be better to have it with the error, or have it disappear?

I do not think it should disappear. However, if the word is almost certainly incorrect, then it should be marked in some easy way for users to fix the OCRed txt.

Request feedback from others too - @zdenop @jbreiden @amitdo @stweil etc.

Shreeshrii · 2017-01-28T07:19:22Z

I think the cause of all of these is the precision-recall tradeoff that takes place in linerec.cpp

I find that the words which are getting dropped are also the same ones which are not being picked up by tesseract when using 'makebox'. I has posted a sample with devanagari in another thread. (#664 (comment) )

Here is a kannada sample:

harinath141 · 2017-01-28T07:56:05Z

Hi

Yeah @Shreeshrii is correct ,The words not being picked,Is it problem with Segmentation??
@theraysmith Even though the recognition is wrong it should be display with any alternate character either "-" or any thing...

stweil · 2017-02-07T11:51:33Z

The underlying question is, if there is a word that is almost certainly incorrect, would it be better to have it with the error, or have it disappear?

I do not think it should disappear. However, if the word is almost certainly incorrect, then it should be marked in some easy way for users to fix the OCRed txt.

For hOCR output, the answer is simple: output the word and add the information on the certainty as additional information (I think this is already one). For other formats which don't support such additional information, a special textual mark might be considered, but I consider that a very special case (so no special handling would be fine for me, too).

amitdo · 2017-02-07T12:17:19Z

It also might be useful to add a few(1-4) alternative words for each word when using the hOCR format.

Shreeshrii · 2017-02-07T12:27:00Z

Please see https://pdfs.semanticscholar.org/dc3e/f1e05b4b629de5db721efb156d82556ff362.pdf The ISRI Analytic Tools for OCR Evaluation

A tilde (~) in an OCR-generated text file is treated as a reject character. A circumflex (^) is interpreted as a suspect marker and serves to mark the following character as suspect. For example, in Ne^vada, the v is marked as suspect. The value of these special characters is assessed when computing marked character efficiency.

I thought that this maybe a standard in OCR evaluation and hence had suggested a marker. Regardless, I do not think that incorrect words should just disappear.

Shreeshrii · 2017-02-07T13:11:42Z

I am not convinced that the unicharset and/or compression are applied correctly to Kannada, which might explain its rather stubborn refusal to improve in accuracy.

@theraysmith Are there any specific issues that you have noticed which I can check with native Kannada speakers?

Shreeshrii · 2017-02-10T17:25:15Z

Some feedback regarding Kannada recognition from MNS Rao

I request you to analyse where special efforts are required to improve the program

Essentially kannada script has many problems from OCR point of view.

1. many characters with a very little difference making recognition difficult.
eg: ಅ ಆ; ಉ ಊ; ಎ ಏ ಐ; ಒ ಓ ಔ; ಅಂ ಅಃ
     ಡ ಢ; ದ ಧ ಥ; ರ ಠ ಝ; ಪ ಫ ಘ ; ಬ ಭ ; ವ ಮ ; ೦ ಂ ; ಕ೯ರ್ಕ; ೬ ಕ್ಮ 
    
2. in-consistencies in guNita formations
eg: ವು ಪು 

3. vottu of ತ ನ ಮ ಯ ರ ಲ  not like main character.

4. ಯ ಝ ಮ can lead to wrong recognition because of splitting of parts of the character by OCR process

5. ೕ is part in three different situations ಕೀ ಕೇ, ಕೋ 

I would like to know if the input from my side is helpful to improve the program.

Regards.
MNS Rao

amitdo · 2017-02-10T20:21:23Z

👍 to Shree and MNS Rao !
Hope that Ray can make something out of this feedback.

Shreeshrii · 2017-02-11T07:03:14Z

https://shreeshrii.github.io/tess4eval_kannada/
has OCR eval reports for 4.0.0-alpha kan.traineddata

CER 7.74
WER 9.26
WER (order independent) 5.70

Images and gt are in
https://github.com/Shreeshrii/tess4eval_kannada

Shreeshrii · 2017-02-12T04:26:45Z

@theraysmith What page segmentation mode do you use for the testing/accuracy reports?

I am getting better results for kannada with --psm 6 compared to --psm 3 (default) or --psm 4.

3 Fully automatic page segmentation, but no OSD. (Default)

4 Assume a single column of text of variable sizes.

6 Assume a single uniform block of text.

amitdo · 2017-02-12T08:08:51Z

The tests are done with .uzn files.
https://github.com/tesseract-ocr/tesseract/blob/a1c22fb0d0/ccmain/pagesegmain.cpp#L111

Shreeshrii · 2017-02-12T08:40:01Z

Are there unlv test files for indian languages? - excuse the brevity, sent from mobile

…

On 12-Feb-2017 1:38 PM, "Amit D." ***@***.***> wrote: The tests are done with .uzn files. https://github.com/tesseract-ocr/tesseract/blob/ a1c22fb/ccmain/pagesegmain.cpp#L111 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#681 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6oGsQw2bBnbF5zxyIj3jnXquAzjks5rbr4agaJpZM4LuojH> .

amitdo · 2017-02-12T10:19:33Z

If you are asking about the original UNLV dataset, the ansswer is 'No'.

It's possible that someone prepared such files as part of indic dataset.

Shreeshrii · 2017-02-17T08:29:35Z

@theraysmith Please see page 18 onwards for kannada specific info in the following pdf

http://tdil-dc.in/tdildcMain/articles/644564990964Kannada%20Script%20Grammar%20TDIL%20Version_Ver1.0.pdf

Shreeshrii · 2017-02-17T08:44:50Z

More Kannada OCR related papers:

http://mile.ee.iisc.ernet.in/mile/publications/softCopy/DocumentAnalysis/Madhav_SPCOM2014.pdf

http://mile.ee.iisc.ernet.in/mile/publications/softCopy/DocumentAnalysis/Nethra_ICFHR2010_Data.pdf

http://mile.ee.iisc.ernet.in/mile/publications/softCopy/DocumentAnalysis/

theraysmith · 2017-02-17T15:39:18Z

I made a discovery yesterday that the web-derived text corpus for Kannada is missing ZWNJ, which AFAICT is an *essential* unicode character in Kannada. The same applies to other Indic languages, although the use varies, and some use ZWJ as well. I'm still working on this and investigating where they are lost. The implications for fixing it could be higher accuracy in several languages, although I don't know by how much, as I haven't measured the frequency of ZWNJ in my test sets.

…

On Fri, Feb 17, 2017 at 12:45 AM, Shreeshrii ***@***.***> wrote: More Kannada OCR related papers: http://mile.ee.iisc.ernet.in/mile/publications/softCopy/ DocumentAnalysis/Madhav_SPCOM2014.pdf http://mile.ee.iisc.ernet.in/mile/publications/softCopy/ DocumentAnalysis/Nethra_ICFHR2010_Data.pdf — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#681 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056SQADDz1OYsnSygJmWFUA7rCatPiks5rdV4PgaJpZM4LuojH> .

-- Ray.

Shreeshrii · 2017-02-17T15:56:52Z

Yes, both ZWJ and ZWNJ are important for Indic languages. Please see

http://unicode.org/faq/indic.html

If the sequence U+0924, U+094D is not followed by another consonant letter (such as "na") it is always displayed as a full ta glyph combined with the virama glyph "dev-ta-virama".
Unicode provides a way to force the display engine to show a half letter form. To do this, an invisible character called ZERO WIDTH JOINER should be inserted after the virama:
U+0924 0924 DEVANAGARI LETTER TA
U+094D 094D DEVANAGARI SIGN VIRAMA (= halant)
U+200D 200D ZERO WIDTH JOINER
U+0928 0928 DEVANAGARI LETTER NA
This sequence is always displayed as a half ta glyph followed by a full na glyph "dev-half-ta-na". Even if the consonant "na" is not present, the sequence U+0924, U+094D, U+200D is displayed as a half ta glyph "dev-half-ta".
Unicode also provides a way to force the display engine to show the virama glyph. To do this, an invisible character called ZERO WIDTH NON-JOINER should be inserted after the virama:
U+0924 0924 DEVANAGARI LETTER TA
U+094D 094D DEVANAGARI SIGN VIRAMA (= halant)
U+200C 200C ZERO WIDTH NON-JOINER
U+0928 0928 DEVANAGARI LETTER NA
This sequence is always displayed as a full ta glyph combined with a virama glyph and followed by a full na glyph "dev-full-ta-virama-full-na
For more detailed information, see Chapter 12, South Asian Scripts-I in The Unicode Standard. For related issues, see "Where is My Character?" [MC]

Shreeshrii · 2017-02-17T16:04:53Z

There are at times multiple ways of typing a character, some of the web text may look ok but maynot be correct. There probably needs to be a normalisation step before training.

Please see kannada chapter in
http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf

Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-28 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.

Table 12-28. Kannada Vowel Letters
For Use Do Not Use
r 0C8A <0C89, 0CBE>
p 0C94 <0C92, 0CCC>
s 0CE0 <0C8B, 0CBE>

Shreeshrii · 2017-02-17T16:13:56Z

Related _ #604

Shreeshrii · 2017-02-18T04:05:42Z

Related - Marathi recognition of repha (sanskrit loan words) and eyelash ra

@theraysmith

I found that when using Marathi traineddata words which used half ra (repha) were not being recognized correctly, it could be related to the ZWJ and ZWNJ problem.

eg. पूर्वक सूर्य धर्म सर्व कार्य वर्ग

Since unicode has evolved over time, there maybe legacy representations still around in the webtext.

Please see issue 7 listed on http://www.baraha.com/help/kb/unicode_issues.htm which has examples of different unicode encodings being used.

Shreeshrii · 2017-02-18T11:39:31Z

@theraysmith

https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/src/indicnlp/normalize/indic_normalize.py

Common normalization in the above includes

ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal

In case the LSTM training data build uses something similar on the webtext, you may want to disable that.

Shreeshrii · 2017-05-11T11:15:19Z

See #664 for details about dropped words during Devanagari recognition.

Shreeshrii · 2017-05-16T09:18:06Z

I think the cause of all of these is the precision-recall tradeoff that takes place in linerec.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/linerec.cpp#L313

@theraysmith Are there any config values I can change so that words are not dropped?

Shreeshrii · 2017-05-22T08:02:59Z

Ok, I changed some constants and now the words are being dropped rarely.

@theraysmith Is this the right approach? It will be good if these values can be changed via config variables rather than needing recompile to test different values.

I changed the following:

tesseract/lstm/recodebeam.cpp

Line 32 in a1c22fb

const float RecodeBeamSearch::kMinCertainty = -20.0f;

changed from -20.0f to -50.0f

// Clipping value for certainty inside Tesseract. Reflects the minimum value
// of certainty that will be returned by ExtractBestPathAsUnicharIds.
// Supposedly on a uniform scale that can be compared across languages and
// engines.
const float RecodeBeamSearch::kMinCertainty = -50.0f;

https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/linerec.cpp#L36-40

changed from 5.0f to 0.0f

// Arbitarary penalty for non-dictionary words.
// TODO(rays) How to learn this?
const float kNonDictionaryPenalty = 0.0f;

changed from -25.0f to -99.0f

// Worst acceptable certainty for a dictionary word.
const float kWorstDictCertainty = -99.0f;

Shreeshrii · 2017-07-16T16:59:12Z

@theraysmith

Will changes in training also support scanned box/tiff pairs?

In addition to training for unknown fonts by using scanned box/tiff pairs, they would also be useful for 'printing conventions' which may not be in concert with the current unicode conventions.

See attached image, where the anusvar is printed before the reph whereas the valid rendering would display anusvar later. Training by using only synthetic images and valid cases will not train for such cases. Currently these words get dropped during recognition.

Shreeshrii · 2017-07-29T17:30:33Z

@theraysmith

Do the changes so far address the missing text / dropped words issue? Should I test these or wait for new models?

amitdo · 2017-07-30T06:46:17Z

I don't think new models can help with this issue.

amitdo · 2017-08-15T09:55:25Z

@theraysmith,

This feature drops perfect words!
#1080 (comment)

I also was hit by this issue.

Two examples from one page from old (1929) Hebrew newspaper:

GT:
מגוונת-דעות.
OCR:
מגוונת.דעות.

GT:
מצב-רוחותיהם
OCR:
מצבברוחותיהם

Best choice certainty=-2.90379, space=-0.193423, scaled=-20.3265, final=-20.3265
 : .תועד.תנווגמ : R=11.8952, C=-2.90379, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM
str	.	ת	ו	ע	ד	.	ת	נ	ו	ו	ג	מ
state:	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 
C	-0.295	-0.192	-0.192	-0.192	-0.191	-2.904	-0.197	-0.192	-0.197	-0.230	-0.193	-0.192
Deleting word with certainty -20.3265

Best choice certainty=-3.59027, space=-0.21691, scaled=-25.1319, final=-25.1319
 : םהיתוחורבבצמ : R=14.0075, C=-3.59027, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM
str	ם	ה	י	ת	ו	ח	ו	ר	ב	ב	צ	מ
state:	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 
C	-0.193	-0.193	-0.191	-0.194	-0.196	-0.192	-0.192	-0.201	-3.590	-0.217	-0.207	-0.202
Deleting word with certainty -25.1319
 : םהיתוחורבבצמ :

In both examples there are two words separated by hyphen. The hyphen looks unclear, and thus Tesseract replaces it with another character. Other than this char all the chars in the two words are recognizes well.

amitdo · 2017-08-15T10:25:16Z

~~I didn't test it, but~~ this is probably the way to disable this feature:

https://github.com/tesseract-ocr/tesseract/blob/3ec11bd37a56/ccmain/linerec.cpp#L293

This block of code

      // Discard words that are impossibly bad, but allow a bit more for
      // dictionary words, and keep bad words in non-space-delimited langs.
      if (word_certainty >= RecodeBeamSearch::kMinCertainty ||
          any_nonspace_delimited ||
          (word_certainty >= kWorstDictCertainty &&
           Dict::valid_word_permuter(word->best_choice->permuter(), true))) {
        word->tess_accepted = stopper_dict->AcceptableResult(word);
      } else {
        if (getDict().stopper_debug_level >= 1) {
          tprintf("Deleting word with certainty %g\n", word_certainty);
          word->best_choice->print();
        }
        // It is a dud.
        word->SetupFake(lstm_recognizer_->GetUnicharset());
      }

Should be replaced with:

      word->tess_accepted = stopper_dict->AcceptableResult(word);

amitdo · 2017-08-15T11:06:16Z

Tested on a few pages. Seems to be working well.

Shreeshrii · 2018-01-09T16:46:02Z

@amitdo Have you created a PR for this?

amitdo · 2018-01-09T17:39:25Z

No, I haven't.
I will send a PR later this week.

Fix #681.

Shreeshrii · 2018-02-21T09:21:10Z

@zc813 Please share a sample image for testing.

Shreeshrii · 2018-02-21T13:14:59Z

Missing lines are probably due to errors in page layout analysis. I suggest you open a new issue and include these samples there. Thanks.

Shreeshrii · 2018-02-21T13:18:32Z

If PSM 11 is giving you better results, use that.

Or, if using the API, try to OCR one line at a time.

amitdo mentioned this issue Mar 8, 2017

RFC: Remove the legacy OCR Engine #707

Closed

Shreeshrii changed the title ~~LSTM: Words dropped during Kannada recognition~~ LSTM: Words dropped during recognition May 11, 2017

Shreeshrii mentioned this issue May 11, 2017

LSTM: Words dropped during Devanagari recognition #664

Closed

amitdo mentioned this issue Aug 14, 2017

Latin.traineddata(best) - Words missing in OCR #1080

Closed

amitdo mentioned this issue Jan 10, 2018

Don't drop words with low certainty #1264

Merged

This was referenced Jan 21, 2018

Tesseract segmentation fault when using Arabic and English #1275

Closed

Handle null raw_choice - fixes #235 #246

Closed

zdenop closed this as completed in #1264 Feb 20, 2018

zdenop pushed a commit that referenced this issue Feb 20, 2018

Don't drop words with low certainty (#1264)

766b7bd

Fix #681.

zc813 mentioned this issue Feb 21, 2018

Entire lines of text missing. Different missing when psm = 3, 6, 11 #1339

Open

amitdo added the feature request label Mar 21, 2021

LSTM: Words dropped during recognition #681

LSTM: Words dropped during recognition #681

Comments

Shreeshrii commented Jan 26, 2017 • edited Loading

Shreeshrii commented Jan 26, 2017

Shreeshrii commented Jan 26, 2017

Shreeshrii commented Jan 27, 2017

theraysmith commented Jan 27, 2017

Shreeshrii commented Jan 28, 2017 • edited Loading

Shreeshrii commented Jan 28, 2017 • edited Loading

harinath141 commented Jan 28, 2017

stweil commented Feb 7, 2017

amitdo commented Feb 7, 2017

Shreeshrii commented Feb 7, 2017 via email • edited Loading

Shreeshrii commented Feb 7, 2017

Shreeshrii commented Feb 10, 2017

amitdo commented Feb 10, 2017

Shreeshrii commented Feb 11, 2017 • edited Loading

Shreeshrii commented Feb 12, 2017

amitdo commented Feb 12, 2017 • edited Loading

Shreeshrii commented Feb 12, 2017 via email

amitdo commented Feb 12, 2017

Shreeshrii commented Feb 17, 2017

Shreeshrii commented Feb 17, 2017 • edited Loading

theraysmith commented Feb 17, 2017 via email

Shreeshrii commented Feb 17, 2017

Shreeshrii commented Feb 17, 2017

Shreeshrii commented Feb 17, 2017

Shreeshrii commented Feb 18, 2017

Shreeshrii commented Feb 18, 2017

Shreeshrii commented May 11, 2017

Shreeshrii commented May 16, 2017

Shreeshrii commented May 22, 2017 • edited Loading

Shreeshrii commented Jul 16, 2017

Shreeshrii commented Jul 29, 2017

amitdo commented Jul 30, 2017

amitdo commented Aug 15, 2017 • edited Loading

amitdo commented Aug 15, 2017 • edited Loading

amitdo commented Aug 15, 2017

Shreeshrii commented Jan 9, 2018

amitdo commented Jan 9, 2018

Shreeshrii commented Feb 21, 2018

Shreeshrii commented Feb 21, 2018

Shreeshrii commented Feb 21, 2018

Shreeshrii commented Jan 26, 2017 •

edited

Loading

Shreeshrii commented Jan 28, 2017 •

edited

Loading

Shreeshrii commented Jan 28, 2017 •

edited

Loading

Shreeshrii commented Feb 7, 2017 via email •

edited

Loading

Shreeshrii commented Feb 11, 2017 •

edited

Loading

amitdo commented Feb 12, 2017 •

edited

Loading

Shreeshrii commented Feb 17, 2017 •

edited

Loading

Shreeshrii commented May 22, 2017 •

edited

Loading

amitdo commented Aug 15, 2017 •

edited

Loading

amitdo commented Aug 15, 2017 •

edited

Loading