-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add "auto" language for TesseractOcr #759
feat: add "auto" language for TesseractOcr #759
Conversation
Signed-off-by: Pavel Denisov <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
We can merge this PR and implement the optimized version at a follow up PR |
Sorry for the delay! I was going to check it in the next few days, but can make a follow up PR too. The problem with CI is that the script OCR models are not installed: https://github.com/DS4SD/docling/actions/runs/12806245234/job/35994648426?pr=759#step:8:155 |
Signed-off-by: Pavel Denisov <[email protected]>
…rs lazily Signed-off-by: Pavel Denisov <[email protected]>
Ubuntu package I'm going to add the check if |
Signed-off-by: Pavel Denisov <[email protected]>
Signed-off-by: Pavel Denisov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
* Add "auto" language for TesseractOcr Signed-off-by: Pavel Denisov <[email protected]> * Add tesseract-ocr-script-latn installation for the "auto" language Signed-off-by: Pavel Denisov <[email protected]> * Modify "auto" language in TesseractOcr to initialize the script readers lazily Signed-off-by: Pavel Denisov <[email protected]> * Finalize script readers Signed-off-by: Pavel Denisov <[email protected]> * Fix script models prefix for Linux Signed-off-by: Pavel Denisov <[email protected]> --------- Signed-off-by: Pavel Denisov <[email protected]> Signed-off-by: Václav Vančura <[email protected]>
Add language-agnostic OCR option for TesseractOcr module. It is invoked when the language option is set to
['auto']
. For more context, see the discussion: #640Please let me know what you think.
Checklist: