Latin.traineddata(best) - Words missing in OCR #1080

TheSeiko · 2017-08-14T12:27:47Z

Environment

Tesseract Version: tesseract-ocr-setup-4.0.0-alpha.20170804.exe - Latin.traineddata from best
Platform: Win10 64bit

Current Behavior:

Tesseract skippes words when doing OCR

OCR-Result: USA

Expected Behavior:

OCR-Result: USA/NORDKOREA

Suggested Fix:

not to skip words

Today I saw multiple times, that tesseract skips words, sometimes in the middle of a paragraph.

i.e. tesseract test/1502433849760_1.png test/1502433849760 -l Latin

1502433849760.txt

amitdo · 2017-08-14T12:43:32Z

Known problem for all langs/scripts
#681

amitdo · 2017-08-14T12:53:38Z

In your case the problem is probably in the layout analysis stage.

TheSeiko · 2017-08-14T12:56:26Z

I've another example:

tesseract test/1502442621178.png test/1502442621178 -l Latin

Here a word in the middle of the sentence is skipped.

1502442621178.txt

amitdo · 2017-08-14T12:57:09Z

Try cutting the non-text areas with gimp and retest.

TheSeiko · 2017-08-14T13:00:01Z

Only the red areas are processed by tesseract, they are written to separate png-files. (I've uploaded the separate files as well and the resulting output-files)

TheSeiko · 2017-08-14T13:20:37Z

The interesting thing is, two similar frames were processed without problems:

amitdo · 2017-08-14T13:37:15Z

Here a word in the middle of the sentence is skipped.

Which word?

Might be related to #681 in this case.

TheSeiko · 2017-08-14T13:56:25Z

Pöllan - the ö exists in unicharset

Thank you for the link to #681

amitdo · 2017-08-14T14:31:52Z

Try to debug with stopper_debug_level=2

https://github.com/tesseract-ocr/tesseract/blob/3ec11bd37a56/ccmain/linerec.cpp#L293

TheSeiko · 2017-08-14T14:54:31Z

Best choice certainty=-2.96366, space=-0.206939, scaled=-20.7456, final=-20.7456
: Pöllan : R=11.9817, C=-2.96366, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM
str P ö l l a n
state: 1 1 1 1 1 1
C -0.237 -2.964 -0.232 -0.209 -0.192 -0.199
Deleting word with certainty -20.7456
: Pöllan : R=11.9817, C=-20.7456, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM
str P ö l l a n
state: 1 1 1 1 1 1
C -0.237 -2.964 -0.232 -0.209 -0.192 -0.199

test.txt

TheSeiko · 2017-08-14T15:00:31Z

Best choice certainty=-0.106016, space=-2.97242, scaled=-20.8069, final=-20.8069
: USA : R=2.24336, C=-0.106016, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM
str U S A
state: 1 1 1
C -0.090 -0.086 -0.106
Best choice certainty=-0.229416, space=-2.97242, scaled=-20.8069, final=-20.8069
: /NORDKOREA : R=16.2872, C=-0.229416, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM NORM NORM NORM NORM
str / N O R D K O R E A
state: 1 1 1 1 1 1 1 1 1 1
C -0.223 -0.202 -0.207 -0.201 -0.192 -0.208 -0.197 -0.193 -0.229 -0.228
Deleting word with certainty -20.8069
: /NORDKOREA : R=16.2872, C=-20.8069, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM NORM NORM NORM NORM
str / N O R D K O R E A
state: 1 1 1 1 1 1 1 1 1 1
C -0.223 -0.202 -0.207 -0.201 -0.192 -0.208 -0.197 -0.193 -0.229 -0.228

output_1502433849760_1.txt

amitdo · 2017-08-14T16:04:32Z

So it does recognize them, but still decides to drop them...

amitdo · 2017-08-14T18:02:10Z

'Pöllan' is dropped because it's not in the dictionary and the 'ö' has low certainty.
'/NORDKOREA' is dropped because it's not in the dictionary and has low space certainty.

amitdo · 2017-08-14T18:20:12Z

'USA' shares the same low space certainty with '/NORDKOREA' but escapes from punishment because it's in the dictionary.

Shreeshrii · 2017-08-15T04:25:06Z

@TheSeiko Please try with the changes suggested in #681 (comment) to see if you get improved recognition of these words without impacting others.

amitdo · 2017-08-15T11:46:38Z

After applying my patch:
'/NORDKOREA' ('USA /NORDKOREA')
and
'Pöllan'

are recognized in the final text output.

All other words are recognized the same as before.

TheSeiko · 2017-08-17T06:57:21Z

Thank you to both of you, your help is much appreciated! I'm on holiday till end of next week then I'll try to compile a windows version with the changes you suggested and test it.

amitdo · 2017-09-13T11:03:20Z

I'll try to compile a windows version with the changes you suggested and test it.

Did you try my suggestion?

TheSeiko · 2017-09-13T11:30:08Z

I'm still on it.
It takes me some time as I'm used to Java but I only used C/C++/... at University, which is quite some time ago. Then some other projects took my time.
I've already installed Visual Studio 2017 and the Git client.
I'll keep you updated.

TheSeiko · 2017-09-13T15:28:48Z

I've been able to compile it now and starting a test run against 50k frames.

TheSeiko · 2017-09-13T15:55:38Z

looks good

TheSeiko · 2017-09-13T16:16:11Z

Tried it as well with 64bit but there I get some errors but I don't think the problem is the fix:

E:\Tesseract-OCR4.0ab1>tesseract test/1502442621178.png stdout --oem 1 -l Latin
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 726
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made

x86 output:
E:\Tesseract-OCR4.0ab1>tesseract test/1502442621178.png stdout --oem 1 -l Latin
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 726
PATERNION

21-Jähriger geriet mit seinem
Auto in Pöllan zu weit in die
Fahrbahnmitte. Er rammte
das Auto eines 62-Jährigen.

Der Beifahrer des 21-Jährigen
wurde schwer verletzt.

I'll try to find the error and keep you updated.

amitdo · 2017-09-13T16:37:48Z

That one is related to image processing. Seems like a bug on (Windows?) 64 bit environment.

Please open a new issue for that.

Shreeshrii · 2018-02-21T09:18:09Z

@zdenop Please close this issue.

Words in https://user-images.githubusercontent.com/30631253/29272525-b9762536-8100-11e7-8d82-0961dd49663b.png mentioned above in #1080 (comment) are not being dropped after patch from @amitdo was merged.

PATERNION

21-Jähriger geriet mit seinem
Auto in Pöllan zu weit in die
Fahrbahnmitte. Er rammte
das Auto eines 62-Jährigen.

Der Beifahrer des 21-Jährigen
wurde schwer verletzt.

amitdo · 2021-03-19T13:28:09Z

Fixed in #1264.

amitdo mentioned this issue Aug 15, 2017

LSTM: Words dropped during recognition #681

Closed

zdenop closed this as completed Sep 20, 2018

amitdo mentioned this issue Sep 29, 2018

Define the / sign as a word delimiter. #1221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin.traineddata(best) - Words missing in OCR #1080

Latin.traineddata(best) - Words missing in OCR #1080

TheSeiko commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017

amitdo commented Aug 14, 2017 •

edited

Loading

TheSeiko commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017

TheSeiko commented Aug 14, 2017 •

edited

Loading

TheSeiko commented Aug 14, 2017

amitdo commented Aug 14, 2017

TheSeiko commented Aug 14, 2017

amitdo commented Aug 14, 2017

TheSeiko commented Aug 14, 2017 •

edited

Loading

TheSeiko commented Aug 14, 2017

amitdo commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017 •

edited

Loading

Shreeshrii commented Aug 15, 2017

amitdo commented Aug 15, 2017 •

edited

Loading

TheSeiko commented Aug 17, 2017

amitdo commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

amitdo commented Sep 13, 2017 •

edited

Loading

Shreeshrii commented Feb 21, 2018

amitdo commented Mar 19, 2021

Latin.traineddata(best) - Words missing in OCR #1080

Latin.traineddata(best) - Words missing in OCR #1080

Comments

TheSeiko commented Aug 14, 2017 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

amitdo commented Aug 14, 2017

amitdo commented Aug 14, 2017 • edited Loading

TheSeiko commented Aug 14, 2017 • edited Loading

amitdo commented Aug 14, 2017

TheSeiko commented Aug 14, 2017 • edited Loading

TheSeiko commented Aug 14, 2017

amitdo commented Aug 14, 2017

TheSeiko commented Aug 14, 2017

amitdo commented Aug 14, 2017

TheSeiko commented Aug 14, 2017 • edited Loading

TheSeiko commented Aug 14, 2017

amitdo commented Aug 14, 2017 • edited Loading

amitdo commented Aug 14, 2017 • edited Loading

amitdo commented Aug 14, 2017 • edited Loading

Shreeshrii commented Aug 15, 2017

amitdo commented Aug 15, 2017 • edited Loading

TheSeiko commented Aug 17, 2017

amitdo commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

TheSeiko commented Sep 13, 2017

amitdo commented Sep 13, 2017 • edited Loading

Shreeshrii commented Feb 21, 2018

amitdo commented Mar 19, 2021

TheSeiko commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017 •

edited

Loading

TheSeiko commented Aug 14, 2017 •

edited

Loading

TheSeiko commented Aug 14, 2017 •

edited

Loading

TheSeiko commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 14, 2017 •

edited

Loading

amitdo commented Aug 15, 2017 •

edited

Loading

amitdo commented Sep 13, 2017 •

edited

Loading