Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Words dropped during Devanagari recognition #664

Closed
Shreeshrii opened this issue Jan 18, 2017 · 5 comments
Closed

LSTM: Words dropped during Devanagari recognition #664

Shreeshrii opened this issue Jan 18, 2017 · 5 comments

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 18, 2017

Text/words are dropped during Devanagari recognition with --oem 1 option.

It seems to be related to line segmentation / box creation because the same words are also skipped in the box file created by tesseract run with 'makebox' config file.

Please see attached -

  • image being OCRed,

  • image showing boxfile skipping the words,

  • ground-truth file and

  • OCRed text

  • OCR evaluation report.

arabic-deva1

missing-words

arabic-deva1.txt

arabic-deva1-san.txt

arabic-deva1-san_report.html.txt

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 19, 2017

Another sample, where the whole first line is skipped, in addition to missing words

forbes1849devscript.txt
forbes1849devscript-tif1-hin.txt

  • image
  • ground truth file
  • OCRed text with -l hin

edit: tif file converted to png for uploading.

forbes1849devscript

@Shreeshrii
Copy link
Collaborator Author

Is it related to #633 (comment) ?

@Shreeshrii
Copy link
Collaborator Author

It seems some words are being recognized as 'blanks' - see the following from the debug info - while processing image shown in #664 (comment)

Processing word with lang hin at:Bounding box=(236,2830)->(1276,2924)
Trying word using lang hin, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result :       : R=50, C=-1, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM
str	 	 	 	 	 
state:	1 	1 	1 	1 	1 
C	-1.000	-1.000	-1.000	-1.000	-1.000

and

Processing word with lang hin at:Bounding box=(234,2248)->(1969,2326)
Trying word using lang hin, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : मम : R=0.947715, C=-1.37049, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM
str	म	म
state:	1 	1 
C	-0.086	-0.089
Best choice: accepted=1, adaptable=0, done=1 : Lang result :              : R=120, C=-1, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM	NORM
str	 	 	 	 	 	 	 	 	 	 	 	 
state:	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 	1 
C	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000

@Shreeshrii
Copy link
Collaborator Author

Closing this and linking to issue #681

@Shreeshrii
Copy link
Collaborator Author

Using Image linked above https://cloud.githubusercontent.com/assets/5095331/22055988/c65e0f96-dd83-11e6-9f06-bea70dd85be6.png

Best choice certainty=-3.09489, space=-0.195364, scaled=-21.6642, final=-21.6642
 : शूण्वन्तु : R=13.0464, C=-3.09489, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM    NORM    NORM    NORM
str     शू      ण्      व       न्      तु
state:  1       1       1       1       1
C       -3.095  -0.260  -0.219  -0.298  -0.195
Deleting word with certainty -21.6642
 : शूण्वन्तु : R=13.0464, C=-21.6642, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM    NORM    NORM    NORM
str     शू      ण्      व       न्      तु
state:  1       1       1       1       1
C       -3.095  -0.260  -0.219  -0.298  -0.195
Best choice certainty=-1.30628, space=-0.195364, scaled=-9.14397, final=-9.14397
 : क्रषय: : R=5.64425, C=-1.30628, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM    NORM    NORM    NORM
str     क्      र       ष       य       :
state:  1       1       1       1       1
C       -1.306  -0.262  -0.229  -0.195  -0.204

The correct words are

शृण्वन्तु 
श ृ ण् व न् तु 

and

ऋषयः 
ऋ ष य ः

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant