-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated langdata #83
Comments
I don't know if Ray plans to update these files. Anyway, it seems that you can now extract the unicharset and the dawg files used by the new lstm engine from the traineddata. |
Yes, using Then you can remove unwanted components, fix word lists and reverse the whole process to create your own new traineddata file. |
Actually, I'm trying to fine tune and continue from the new Arabic.traineddata but the newly generated traineddata file can't keep the same accuracy as Arabic.traineddata. Also, the current unicharset of landdata repo has around 2048 data but the generated from the new traineddata around 300 only is that logic? |
Point taken. It needs updating. I was going to push until I discovered a bug with the RTL word lists. |
@theraysmith Is it ready for update now? |
@jbreiden Do you have the files to update this repo for 4.0.0? Alternately, should we try to reverse engineer files from tessdata_fast, they will not be complete - (config, wordlist, numbers, punc, unicharset). |
No, I don't. But I have been (and continue to) look into this. |
Hmm. Sorry. I thought I had done this in September. |
Thanks! Will the training process, tesstrain.sh and related scripts also need changes? |
Also, what about the possibility of training from scanned images? |
It is possible and seems to work pretty good, as I heard from @wrznr. |
@stweil Do you know how the box files for the scanned images were created? ASAIK tesseract makebox generated box files do not match the format of files from text2image. |
|
On Wed, Mar 21, 2018 at 1:28 AM Shreeshrii ***@***.***> wrote:
@theraysmith <https://github.com/theraysmith>
1.
Since training depends on the fonts used, I suggest loading a file
with fontlist used for training in every language and script in their own
subdirectories. This file can then be referred to by
tesstrain.sh/language_specific.sh.
Yes, I have a list of fonts used for each training, and can add that to
the langdata.
1.
2.
Is it possible to use multiple languages to continue from for creating
a 'script' type of traineddata by finetuning?
Unfortunately not. I did have an idea for a better multi-language
implementation that would cleanly use models from multiple languages at
once, but that depends on getting rid of the old code, and moving the
multi-language functionality into the beam search. Until the old code is
gone, that would be very messy.
…
1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#83 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056bXfAPBtr62lkf6Ma2WzI5Zv7CAVks5tgg8cgaJpZM4Otkog>
.
--
Ray.
|
@Shreeshrii Your right, creating box files has been done by using an extra script. It is rather straightforward:
|
@wrznr's method is similar but easier than my proposal. |
Please share the script, if possible. I would like to test it for Indic/complex scripts. It will also be useful to many others who have been asking for this feature. You could create a PR to put it in https://github.com/tesseract-ocr/tesseract/tree/master/contrib Thanks! |
It won't work well for complex scripts like the Indic scripts. |
@theraysmith
@jbreiden
Any update regarding this???
…On Tue 20 Mar, 2018, 8:52 AM theraysmith, ***@***.***> wrote:
Hmm. Sorry. I thought I had done this in September.
The Google repo is up-to-date apart from the redundant files that need to
be deleted.
I'll work with Jeff to get this done.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#83 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_ozSnNQITRG_kE4gS1yHJgQXSJyKuks5tgHXcgaJpZM4Otkog>
.
|
@wrznr Thank you for the makefile for doing LSTM training from scratch. I will give it a try. Do you also have a variant for doing fine tuning or adding a layer? |
https://github.com/tesseract-ocr/langdata_lstm This issue can be closed. |
We need the updated langdata with the update unicharset specially for arabic language to be able to maintain the same accuracy in the new .traineddata
Thanks
The text was updated successfully, but these errors were encountered: