-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: PermissionError when using tesseract_ocr_cli_model #430
fix: PermissionError when using tesseract_ocr_cli_model #430
Conversation
Make sure that the `tesseract_ocr_cli_model.py` does not open the png image file twice (`tempfile.NamedTemporaryFile` + `high_res_image.save`), and ensure that `_run_tesseract` is executed once the file is no longer open by python. This other results in a "PermissionError: [Errno 13] Permission denied" error on Windows. Signed-off-by: Gaspard Petit <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Thanks @gaspardpetit for this fix. Here a few notes
|
* fixes for referencing drawing blip in wordx Signed-off-by: Maksym Lysak <[email protected]> * Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml. Signed-off-by: Maksym Lysak <[email protected]> * Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml Signed-off-by: Maksym Lysak <[email protected]> * Updated lxml dependency version Signed-off-by: Maksym Lysak <[email protected]> --------- Signed-off-by: Maksym Lysak <[email protected]> Co-authored-by: Maksym Lysak <[email protected]>
* Update tests for docling-core 2.5.0 Signed-off-by: Christoph Auer <[email protected]> * Add export with referenced images to export_figures example Signed-off-by: Christoph Auer <[email protected]> * Fix OCR tests Signed-off-by: Christoph Auer <[email protected]> * Revert "Fix OCR tests" This reverts commit 12b5759. Signed-off-by: Christoph Auer <[email protected]> * Update lockfile for docling-core 2.5.1 Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Christoph Auer <[email protected]>
* fix image index in word backend Signed-off-by: Manuel030 <[email protected]> * fix: Fixes for wordx (DS4SD#432) * fixes for referencing drawing blip in wordx Signed-off-by: Maksym Lysak <[email protected]> * Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml. Signed-off-by: Maksym Lysak <[email protected]> * Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml Signed-off-by: Maksym Lysak <[email protected]> * Updated lxml dependency version Signed-off-by: Maksym Lysak <[email protected]> --------- Signed-off-by: Maksym Lysak <[email protected]> Co-authored-by: Maksym Lysak <[email protected]> Signed-off-by: Manuel030 <[email protected]> * sign dco Signed-off-by: Manuel030 <[email protected]> * correct rebase error Signed-off-by: Manuel030 <[email protected]> --------- Signed-off-by: Manuel030 <[email protected]> Signed-off-by: Maksym Lysak <[email protected]> Co-authored-by: Maxim Lysak <[email protected]> Co-authored-by: Maksym Lysak <[email protected]>
* adding rapidocr engine for ocr in docling Signed-off-by: swayam-singhal <[email protected]> * fixing styling format Signed-off-by: Swaymaw <[email protected]> * updating pyproject.toml and poetry.lock to fix ci bugs Signed-off-by: Swaymaw <[email protected]> * help poetry pinning for python3.9 Signed-off-by: Michele Dolfi <[email protected]> * simplifying rapidocr options so that device can be changed using a single option for all models Signed-off-by: Swaymaw <[email protected]> * fix styling issues and small bug in rapidOcrOptions Signed-off-by: Swaymaw <[email protected]> * use default device until we enable global management Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: swayam-singhal <[email protected]> Signed-off-by: Swaymaw <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Co-authored-by: swayam-singhal <[email protected]> Co-authored-by: Michele Dolfi <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Make sure that the `tesseract_ocr_cli_model.py` does not open the png image file twice (`tempfile.NamedTemporaryFile` + `high_res_image.save`), and ensure that `_run_tesseract` is executed once the file is no longer open by python. This other results in a "PermissionError: [Errno 13] Permission denied" error on Windows. Signed-off-by: Gaspard Petit <[email protected]>
…s://github.com/gaspardpetit/docling into gaspardpetit-fix-permission-error-tesseractcli
Thank you for reviewing this pull request. I have fixed the import order according the the pre-commit linter. I disagree with using With Let me know if you would still prefer to use |
I think we both want to achieve the same, but we interpret the Let me elaborate my understanding of it.
My understanding of the As posted initially, this is just an understanding of the documentation, and I don't have a way to test it on Windows. If you want to give it a try and see it is matching the expectations above, we could simplify the fix and achieve what we both want. |
Note: the CI is currently reporting
|
Hi! Thank you for reviewing this. You are right, I might have merged incorrectly - I will probably end up submitting again from a clean branch. Just to confirm that I understand correctly - using |
Good catch. Sorry, I didn't notice it before. Then it is definitely a no-go. Then I fully support your solution with |
Brilliant, I'll resubmit as a clean branch in a couple of hours then, thank ! |
Replaced by clean merge request #496 |
tesseract_ocr_cli_model.py
does not open the png image file twice (before it was opened once withtempfile.NamedTemporaryFile
and again by passing the file name rather than the file object itself tohigh_res_image.save
;_run_tesseract
is executed after the file is no longer open by python. This otherwise results in a "PermissionError: [Errno 13] Permission denied" error on Windows.