Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Words dropped during recognition #681

Closed
Shreeshrii opened this issue Jan 26, 2017 · 40 comments · Fixed by #1264
Closed

LSTM: Words dropped during recognition #681

Shreeshrii opened this issue Jan 26, 2017 · 40 comments · Fixed by #1264

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 26, 2017

Please see attached OCR evaluation reports. The words highlighted in green in the ground truth are being dropped during recognition.

kan-words-dropped.zip

@Shreeshrii
Copy link
Collaborator Author

Processing word with lang kan at:Bounding box=(163,2011)->(1231,2056)
Trying word using lang kan, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : ಪ್ರೆಸ್, : R=3.81155, C=-2.00782, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM
str	ಪ್	ರೆ	ಸ್	,
state:	1 	1 	1 	1 
C	-0.137	-0.113	-0.287	-0.108
Best choice: accepted=1, adaptable=0, done=1 : Lang result :        : R=60, C=-1, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM	NORM
str	 	 	 	 	 	 
state:	1 	1 	1 	1 	1 	1 
C	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000

I would suggest that if the best match for a word comes to be blank, then it be replaced by a string such as @@@@@ so that it is easy to identify missing text and correct the OCR output.

@Shreeshrii
Copy link
Collaborator Author

Other similar issues:

#673

#664

#633 (comment)

@Shreeshrii
Copy link
Collaborator Author

@theraysmith

Is it possibly related to --strip_unrenderable_words during training? I have noticed that the images created by text2image have a blank space instead of the missing word when the font is not able to render it.

@theraysmith
Copy link
Contributor

I think the cause of all of these is the precision-recall tradeoff that takes place in linerec.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/linerec.cpp#L313

The underlying question is, if there is a word that is almost certainly incorrect, would it be better to have it with the error, or have it disappear?

Historically Tesseract has not dropped words in such a way, but when I tried it, I found that almost every word dropped was incorrect anyway, so the word-level precision was improved without hurting recall.
Character-level recall OTOH is reduced by this word-dropping, since the rest of the characters in a word are usually correct.

I might disable the word dropping altogether, or maybe it would work better for deleting garbage if most of the characters are bad instead of just the worst.

While I am looking at this though, I am not convinced that the unicharset and/or compression are applied correctly to Kannada, which might explain its rather stubborn refusal to improve in accuracy.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 28, 2017

The underlying question is, if there is a word that is almost certainly incorrect, would it be better to have it with the error, or have it disappear?

I do not think it should disappear. However, if the word is almost certainly incorrect, then it should be marked in some easy way for users to fix the OCRed txt.

Request feedback from others too - @zdenop @jbreiden @amitdo @stweil etc.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 28, 2017

I think the cause of all of these is the precision-recall tradeoff that takes place in linerec.cpp

I find that the words which are getting dropped are also the same ones which are not being picked up by tesseract when using 'makebox'. I has posted a sample with devanagari in another thread. (#664 (comment) )

Here is a kannada sample:

kan box missing
kan recognition missing

@harinath141
Copy link

Hi

Yeah @Shreeshrii is correct ,The words not being picked,Is it problem with Segmentation??
@theraysmith Even though the recognition is wrong it should be display with any alternate character either "-" or any thing...

@stweil
Copy link
Member

stweil commented Feb 7, 2017

The underlying question is, if there is a word that is almost certainly incorrect, would it be better to have it with the error, or have it disappear?

I do not think it should disappear. However, if the word is almost certainly incorrect, then it should be marked in some easy way for users to fix the OCRed txt.

For hOCR output, the answer is simple: output the word and add the information on the certainty as additional information (I think this is already one). For other formats which don't support such additional information, a special textual mark might be considered, but I consider that a very special case (so no special handling would be fine for me, too).

@amitdo
Copy link
Collaborator

amitdo commented Feb 7, 2017

It also might be useful to add a few(1-4) alternative words for each word when using the hOCR format.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 7, 2017 via email

@Shreeshrii
Copy link
Collaborator Author

I am not convinced that the unicharset and/or compression are applied correctly to Kannada, which might explain its rather stubborn refusal to improve in accuracy.

@theraysmith Are there any specific issues that you have noticed which I can check with native Kannada speakers?

@Shreeshrii
Copy link
Collaborator Author

Some feedback regarding Kannada recognition from MNS Rao

I request you to analyse where special efforts are required to improve the program

Essentially kannada script has many problems from OCR point of view.

1. many characters with a very little difference making recognition difficult.
eg: ಅ ಆ; ಉ ಊ; ಎ ಏ ಐ; ಒ ಓ ಔ; ಅಂ ಅಃ
     ಡ ಢ; ದ ಧ ಥ; ರ ಠ ಝ; ಪ ಫ ಘ ; ಬ ಭ ; ವ ಮ ; ೦ ಂ ; ಕ೯ರ್ಕ; ೬ ಕ್ಮ 
    
2. in-consistencies in guNita formations
eg: ವು ಪು 

3. vottu of ತ ನ ಮ ಯ ರ ಲ  not like main character.

4. ಯ ಝ ಮ can lead to wrong recognition because of splitting of parts of the character by OCR process

5. ೕ is part in three different situations ಕೀ ಕೇ, ಕೋ 

I would like to know if the input from my side is helpful to improve the program.

Regards.
MNS Rao

@amitdo
Copy link
Collaborator

amitdo commented Feb 10, 2017

👍 to Shree and MNS Rao !
Hope that Ray can make something out of this feedback.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 11, 2017

https://shreeshrii.github.io/tess4eval_kannada/
has OCR eval reports for 4.0.0-alpha kan.traineddata

CER 7.74
WER 9.26
WER (order independent) 5.70

Images and gt are in
https://github.com/Shreeshrii/tess4eval_kannada

@Shreeshrii
Copy link
Collaborator Author

@theraysmith What page segmentation mode do you use for the testing/accuracy reports?

I am getting better results for kannada with --psm 6 compared to --psm 3 (default) or --psm 4.

3 Fully automatic page segmentation, but no OSD. (Default)

4 Assume a single column of text of variable sizes.

6 Assume a single uniform block of text.

@amitdo
Copy link
Collaborator

amitdo commented Feb 12, 2017

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 12, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Feb 12, 2017

If you are asking about the original UNLV dataset, the ansswer is 'No'.

It's possible that someone prepared such files as part of indic dataset.

@Shreeshrii
Copy link
Collaborator Author

@theraysmith Please see page 18 onwards for kannada specific info in the following pdf

http://tdil-dc.in/tdildcMain/articles/644564990964Kannada%20Script%20Grammar%20TDIL%20Version_Ver1.0.pdf

@theraysmith
Copy link
Contributor

theraysmith commented Feb 17, 2017 via email

@Shreeshrii
Copy link
Collaborator Author

Yes, both ZWJ and ZWNJ are important for Indic languages. Please see

http://unicode.org/faq/indic.html

If the sequence U+0924, U+094D is not followed by another consonant letter (such as "na") it is always displayed as a full ta glyph combined with the virama glyph "dev-ta-virama".
Unicode provides a way to force the display engine to show a half letter form. To do this, an invisible character called ZERO WIDTH JOINER should be inserted after the virama:
U+0924 0924 DEVANAGARI LETTER TA
U+094D 094D DEVANAGARI SIGN VIRAMA (= halant)
U+200D 200D ZERO WIDTH JOINER
U+0928 0928 DEVANAGARI LETTER NA
This sequence is always displayed as a half ta glyph followed by a full na glyph "dev-half-ta-na". Even if the consonant "na" is not present, the sequence U+0924, U+094D, U+200D is displayed as a half ta glyph "dev-half-ta".
Unicode also provides a way to force the display engine to show the virama glyph. To do this, an invisible character called ZERO WIDTH NON-JOINER should be inserted after the virama:
U+0924 0924 DEVANAGARI LETTER TA
U+094D 094D DEVANAGARI SIGN VIRAMA (= halant)
U+200C 200C ZERO WIDTH NON-JOINER
U+0928 0928 DEVANAGARI LETTER NA
This sequence is always displayed as a full ta glyph combined with a virama glyph and followed by a full na glyph "dev-full-ta-virama-full-na
For more detailed information, see Chapter 12, South Asian Scripts-I in The Unicode Standard. For related issues, see "Where is My Character?" [MC]

@Shreeshrii
Copy link
Collaborator Author

There are at times multiple ways of typing a character, some of the web text may look ok but maynot be correct. There probably needs to be a normalisation step before training.

Please see kannada chapter in
http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf

Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-28 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.

Table 12-28. Kannada Vowel Letters
For Use Do Not Use
r 0C8A <0C89, 0CBE>
p 0C94 <0C92, 0CCC>
s 0CE0 <0C8B, 0CBE>

@Shreeshrii
Copy link
Collaborator Author

Related _ #604

@Shreeshrii
Copy link
Collaborator Author

Related - Marathi recognition of repha (sanskrit loan words) and eyelash ra

@theraysmith

I found that when using Marathi traineddata words which used half ra (repha) were not being recognized correctly, it could be related to the ZWJ and ZWNJ problem.

eg. पूर्वक सूर्य धर्म सर्व कार्य वर्ग

Since unicode has evolved over time, there maybe legacy representations still around in the webtext.

Please see issue 7 listed on http://www.baraha.com/help/kb/unicode_issues.htm which has examples of different unicode encodings being used.

@Shreeshrii
Copy link
Collaborator Author

@theraysmith

https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/src/indicnlp/normalize/indic_normalize.py

Common normalization in the above includes

  • ZERO_WIDTH_NON_JOINER and ZERO_WIDTH_JOINER removal

In case the LSTM training data build uses something similar on the webtext, you may want to disable that.

@Shreeshrii Shreeshrii changed the title LSTM: Words dropped during Kannada recognition LSTM: Words dropped during recognition May 11, 2017
@Shreeshrii
Copy link
Collaborator Author

See #664 for details about dropped words during Devanagari recognition.

@Shreeshrii
Copy link
Collaborator Author

I think the cause of all of these is the precision-recall tradeoff that takes place in linerec.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/linerec.cpp#L313

@theraysmith Are there any config values I can change so that words are not dropped?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented May 22, 2017

Ok, I changed some constants and now the words are being dropped rarely.

@theraysmith Is this the right approach? It will be good if these values can be changed via config variables rather than needing recompile to test different values.

I changed the following:

const float RecodeBeamSearch::kMinCertainty = -20.0f;

changed from -20.0f to -50.0f

// Clipping value for certainty inside Tesseract. Reflects the minimum value
// of certainty that will be returned by ExtractBestPathAsUnicharIds.
// Supposedly on a uniform scale that can be compared across languages and
// engines.
const float RecodeBeamSearch::kMinCertainty = -50.0f;

https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/linerec.cpp#L36-40

changed from 5.0f to 0.0f

// Arbitarary penalty for non-dictionary words.
// TODO(rays) How to learn this?
const float kNonDictionaryPenalty = 0.0f;

changed from -25.0f to -99.0f

// Worst acceptable certainty for a dictionary word.
const float kWorstDictCertainty = -99.0f;

@Shreeshrii
Copy link
Collaborator Author

@theraysmith

Will changes in training also support scanned box/tiff pairs?

In addition to training for unknown fonts by using scanned box/tiff pairs, they would also be useful for 'printing conventions' which may not be in concert with the current unicode conventions.

See attached image, where the anusvar is printed before the reph whereas the valid rendering would display anusvar later. Training by using only synthetic images and valid cases will not train for such cases. Currently these words get dropped during recognition.

srisubodhini00vall_0013

@Shreeshrii
Copy link
Collaborator Author

@theraysmith

Do the changes so far address the missing text / dropped words issue? Should I test these or wait for new models?

@amitdo
Copy link
Collaborator

amitdo commented Jul 30, 2017

I don't think new models can help with this issue.

@amitdo
Copy link
Collaborator

amitdo commented Aug 15, 2017

@theraysmith,

This feature drops perfect words!
#1080 (comment)

I also was hit by this issue.

Two examples from one page from old (1929) Hebrew newspaper:

GT:
מגוונת-דעות.
OCR:
מגוונת.דעות.

GT:
מצב-רוחותיהם
OCR:
מצבברוחותיהם

Best choice certainty=-2.90379, space=-0.193423, scaled=-20.3265, final=-20.3265
 : .תועד.תנווגמ : R=11.8952, C=-2.90379, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM
str	.	ת	ו	ע	ד	.	ת	נ	ו	ו	ג	מ
state:	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 
C	-0.295	-0.192	-0.192	-0.192	-0.191	-2.904	-0.197	-0.192	-0.197	-0.230	-0.193	-0.192
Deleting word with certainty -20.3265

Best choice certainty=-3.59027, space=-0.21691, scaled=-25.1319, final=-25.1319
 : םהיתוחורבבצמ : R=14.0075, C=-3.59027, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM
str	ם	ה	י	ת	ו	ח	ו	ר	ב	ב	צ	מ
state:	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 
C	-0.193	-0.193	-0.191	-0.194	-0.196	-0.192	-0.192	-0.201	-3.590	-0.217	-0.207	-0.202
Deleting word with certainty -25.1319
 : םהיתוחורבבצמ : 

In both examples there are two words separated by hyphen. The hyphen looks unclear, and thus Tesseract replaces it with another character. Other than this char all the chars in the two words are recognizes well.

@amitdo
Copy link
Collaborator

amitdo commented Aug 15, 2017

I didn't test it, but this is probably the way to disable this feature:

https://github.com/tesseract-ocr/tesseract/blob/3ec11bd37a56/ccmain/linerec.cpp#L293

This block of code

      // Discard words that are impossibly bad, but allow a bit more for
      // dictionary words, and keep bad words in non-space-delimited langs.
      if (word_certainty >= RecodeBeamSearch::kMinCertainty ||
          any_nonspace_delimited ||
          (word_certainty >= kWorstDictCertainty &&
           Dict::valid_word_permuter(word->best_choice->permuter(), true))) {
        word->tess_accepted = stopper_dict->AcceptableResult(word);
      } else {
        if (getDict().stopper_debug_level >= 1) {
          tprintf("Deleting word with certainty %g\n", word_certainty);
          word->best_choice->print();
        }
        // It is a dud.
        word->SetupFake(lstm_recognizer_->GetUnicharset());
      }

Should be replaced with:

      word->tess_accepted = stopper_dict->AcceptableResult(word);

@amitdo
Copy link
Collaborator

amitdo commented Aug 15, 2017

Tested on a few pages. Seems to be working well.

@Shreeshrii
Copy link
Collaborator Author

@amitdo Have you created a PR for this?

@amitdo
Copy link
Collaborator

amitdo commented Jan 9, 2018

No, I haven't.
I will send a PR later this week.

@Shreeshrii
Copy link
Collaborator Author

@zc813 Please share a sample image for testing.

@Shreeshrii
Copy link
Collaborator Author

Missing lines are probably due to errors in page layout analysis. I suggest you open a new issue and include these samples there. Thanks.

@Shreeshrii
Copy link
Collaborator Author

If PSM 11 is giving you better results, use that.

Or, if using the API, try to OCR one line at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants