Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Training - explicit viraama not recognized correctly #604

Open
Shreeshrii opened this issue Dec 23, 2016 · 2 comments
Open

LSTM: Training - explicit viraama not recognized correctly #604

Shreeshrii opened this issue Dec 23, 2016 · 2 comments

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 23, 2016

In Devanagari script, a virama is used to kill the inherent vowel of a consonant. When followed by another consonant, it forms a conjunct form. Depending on the font used, this could either be a glyph or can be represented with the explicit viraama symbol. There are times when the font may have a glyph for the conjunct but the user wants to use explicit virama. ZWNJ (U+200C) and ZWJ (U+200D) are used in various Indic scripts in relation to this.

Tesseract displays the viraama symbol when it comes at end of word (followed by space) but is not doing so when it is followed by another consonant.

Attached text file and associated box/tiff pairs in different fonts can be used for testing/training this feature.

I tried to do 'Fine Tune' LSTM training for this but get a number of errors related to
Encoding of string failed!
Can't encode transcription:

san.training_text_viraam.txt

san.viraama.box-tiff.zip

san shree-dv0726-ot exp0

@Shreeshrii
Copy link
Collaborator Author

Some sample images of real life examples - with Hindi and Sanskrit text with explicit viraama followed by consonant

bg-hin-san010
bg-hin-san012

bg-hin-san003
bg-hin-san006

@Shreeshrii
Copy link
Collaborator Author

Please see 12.1 from http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf for description of viraam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants