Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin.traineddata(best) - Words missing in OCR #1080

Closed
TheSeiko opened this issue Aug 14, 2017 · 25 comments
Closed

Latin.traineddata(best) - Words missing in OCR #1080

TheSeiko opened this issue Aug 14, 2017 · 25 comments

Comments

@TheSeiko
Copy link

TheSeiko commented Aug 14, 2017

Environment

  • Tesseract Version: tesseract-ocr-setup-4.0.0-alpha.20170804.exe - Latin.traineddata from best
  • Platform: Win10 64bit

Current Behavior:

Tesseract skippes words when doing OCR

OCR-Result: USA

Expected Behavior:

OCR-Result: USA/NORDKOREA

Suggested Fix:

not to skip words


Today I saw multiple times, that tesseract skips words, sometimes in the middle of a paragraph.

i.e. tesseract test/1502433849760_1.png test/1502433849760 -l Latin

1502433849760_imageprocessedwithmarks

1502433849760_1
1502433849760.txt

1502433849760_screen

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

Known problem for all langs/scripts
#681

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

In your case the problem is probably in the layout analysis stage.

@TheSeiko
Copy link
Author

TheSeiko commented Aug 14, 2017

I've another example:

tesseract test/1502442621178.png test/1502442621178 -l Latin

Here a word in the middle of the sentence is skipped.

1502442621178_screen

1502442621178

1502442621178.txt

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

Try cutting the non-text areas with gimp and retest.

@TheSeiko
Copy link
Author

TheSeiko commented Aug 14, 2017

Only the red areas are processed by tesseract, they are written to separate png-files. (I've uploaded the separate files as well and the resulting output-files)

@TheSeiko
Copy link
Author

The interesting thing is, two similar frames were processed without problems:

1502434118849

1502431789149

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

Here a word in the middle of the sentence is skipped.

Which word?

Might be related to #681 in this case.

@TheSeiko
Copy link
Author

Pöllan - the ö exists in unicharset

Thank you for the link to #681

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

@TheSeiko
Copy link
Author

TheSeiko commented Aug 14, 2017

Best choice certainty=-2.96366, space=-0.206939, scaled=-20.7456, final=-20.7456
: Pöllan : R=11.9817, C=-2.96366, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM
str P ö l l a n
state: 1 1 1 1 1 1
C -0.237 -2.964 -0.232 -0.209 -0.192 -0.199
Deleting word with certainty -20.7456
: Pöllan : R=11.9817, C=-20.7456, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM
str P ö l l a n
state: 1 1 1 1 1 1
C -0.237 -2.964 -0.232 -0.209 -0.192 -0.199

test.txt

@TheSeiko
Copy link
Author

Best choice certainty=-0.106016, space=-2.97242, scaled=-20.8069, final=-20.8069
: USA : R=2.24336, C=-0.106016, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM
str U S A
state: 1 1 1
C -0.090 -0.086 -0.106
Best choice certainty=-0.229416, space=-2.97242, scaled=-20.8069, final=-20.8069
: /NORDKOREA : R=16.2872, C=-0.229416, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM NORM NORM NORM NORM
str / N O R D K O R E A
state: 1 1 1 1 1 1 1 1 1 1
C -0.223 -0.202 -0.207 -0.201 -0.192 -0.208 -0.197 -0.193 -0.229 -0.228
Deleting word with certainty -20.8069
: /NORDKOREA : R=16.2872, C=-20.8069, F=1, Perm=2, xht=[0,3.40282e+038], ambig=0
pos NORM NORM NORM NORM NORM NORM NORM NORM NORM NORM
str / N O R D K O R E A
state: 1 1 1 1 1 1 1 1 1 1
C -0.223 -0.202 -0.207 -0.201 -0.192 -0.208 -0.197 -0.193 -0.229 -0.228

output_1502433849760_1.txt

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

So it does recognize them, but still decides to drop them...

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

'Pöllan' is dropped because it's not in the dictionary and the 'ö' has low certainty.
'/NORDKOREA' is dropped because it's not in the dictionary and has low space certainty.

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

'USA' shares the same low space certainty with '/NORDKOREA' but escapes from punishment because it's in the dictionary.

@Shreeshrii
Copy link
Collaborator

@TheSeiko Please try with the changes suggested in #681 (comment) to see if you get improved recognition of these words without impacting others.

@amitdo
Copy link
Collaborator

amitdo commented Aug 15, 2017

After applying my patch:
'/NORDKOREA' ('USA /NORDKOREA')
and
'Pöllan'

are recognized in the final text output.

All other words are recognized the same as before.

@TheSeiko
Copy link
Author

Thank you to both of you, your help is much appreciated! I'm on holiday till end of next week then I'll try to compile a windows version with the changes you suggested and test it.

@amitdo
Copy link
Collaborator

amitdo commented Sep 13, 2017

I'll try to compile a windows version with the changes you suggested and test it.

Did you try my suggestion?

@TheSeiko
Copy link
Author

I'm still on it.
It takes me some time as I'm used to Java but I only used C/C++/... at University, which is quite some time ago. Then some other projects took my time.
I've already installed Visual Studio 2017 and the Git client.
I'll keep you updated.

@TheSeiko
Copy link
Author

I've been able to compile it now and starting a test run against 50k frames.

@TheSeiko
Copy link
Author

looks good

@TheSeiko
Copy link
Author

Tried it as well with 64bit but there I get some errors but I don't think the problem is the fix:

E:\Tesseract-OCR4.0ab1>tesseract test/1502442621178.png stdout --oem 1 -l Latin
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 726
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made
Error in pixCreateHeader: height must be > 0
Error in pixCreateNoInit: pixd not made
Error in pixCreate: pixd not made
Error in pixClipRectangle: pixd not made


x86 output:
E:\Tesseract-OCR4.0ab1>tesseract test/1502442621178.png stdout --oem 1 -l Latin
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 726
PATERNION

21-Jähriger geriet mit seinem
Auto in Pöllan zu weit in die
Fahrbahnmitte. Er rammte
das Auto eines 62-Jährigen.

Der Beifahrer des 21-Jährigen
wurde schwer verletzt.


I'll try to find the error and keep you updated.

@amitdo
Copy link
Collaborator

amitdo commented Sep 13, 2017

That one is related to image processing. Seems like a bug on (Windows?) 64 bit environment.

Please open a new issue for that.

@Shreeshrii
Copy link
Collaborator

@zdenop Please close this issue.

Words in https://user-images.githubusercontent.com/30631253/29272525-b9762536-8100-11e7-8d82-0961dd49663b.png mentioned above in #1080 (comment) are not being dropped after patch from @amitdo was merged.

PATERNION

21-Jähriger geriet mit seinem
Auto in Pöllan zu weit in die
Fahrbahnmitte. Er rammte
das Auto eines 62-Jährigen.

Der Beifahrer des 21-Jährigen
wurde schwer verletzt.

@amitdo
Copy link
Collaborator

amitdo commented Mar 19, 2021

Fixed in #1264.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants