Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: PermissionError when using tesseract_ocr_cli_model #430

Conversation

gaspardpetit
Copy link
Contributor

  • Make sure that the tesseract_ocr_cli_model.py does not open the png image file twice (before it was opened once with tempfile.NamedTemporaryFile and again by passing the file name rather than the file object itself to high_res_image.save;
  • Ensure that _run_tesseract is executed after the file is no longer open by python. This otherwise results in a "PermissionError: [Errno 13] Permission denied" error on Windows.

Make sure that the `tesseract_ocr_cli_model.py` does not open the png image file twice (`tempfile.NamedTemporaryFile` + `high_res_image.save`), and ensure that `_run_tesseract` is executed once the file is no longer open by python. This other results in a "PermissionError: [Errno 13] Permission denied" error on Windows.

Signed-off-by: Gaspard Petit <[email protected]>
Copy link

mergify bot commented Nov 25, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:

@dolfim-ibm dolfim-ibm self-requested a review November 25, 2024 14:53
@dolfim-ibm
Copy link
Contributor

Thanks @gaspardpetit for this fix. Here a few notes

  1. Please review the test failures. I think you are missing the pre-commit styling, which could be fixed with pre-commit run --all-files
  2. I was reading a bit more about the issue in the python docs
    • it would be better to use delete_on_close=False and avoid the extra try/finally. Can you please give it a try?

maxmnemonic and others added 14 commits November 26, 2024 14:44
* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <[email protected]>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <[email protected]>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <[email protected]>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <[email protected]>

---------

Signed-off-by: Maksym Lysak <[email protected]>
Co-authored-by: Maksym Lysak <[email protected]>
* Update tests for docling-core 2.5.0

Signed-off-by: Christoph Auer <[email protected]>

* Add export with referenced images to export_figures example

Signed-off-by: Christoph Auer <[email protected]>

* Fix OCR tests

Signed-off-by: Christoph Auer <[email protected]>

* Revert "Fix OCR tests"

This reverts commit 12b5759.

Signed-off-by: Christoph Auer <[email protected]>

* Update lockfile for docling-core 2.5.1

Signed-off-by: Christoph Auer <[email protected]>

---------

Signed-off-by: Christoph Auer <[email protected]>
* fix image index in word backend

Signed-off-by: Manuel030 <[email protected]>

* fix: Fixes for wordx (DS4SD#432)

* fixes for referencing drawing blip in wordx

Signed-off-by: Maksym Lysak <[email protected]>

* Added safety try-except when trying to load pillow image from a docx blob. Added explicit dependency on lxml.

Signed-off-by: Maksym Lysak <[email protected]>

* Added test for word file with embedded emf images, re-generated full tests for docx, eased up dependency on lxml

Signed-off-by: Maksym Lysak <[email protected]>

* Updated lxml dependency version

Signed-off-by: Maksym Lysak <[email protected]>

---------

Signed-off-by: Maksym Lysak <[email protected]>
Co-authored-by: Maksym Lysak <[email protected]>
Signed-off-by: Manuel030 <[email protected]>

* sign dco

Signed-off-by: Manuel030 <[email protected]>

* correct rebase error

Signed-off-by: Manuel030 <[email protected]>

---------

Signed-off-by: Manuel030 <[email protected]>
Signed-off-by: Maksym Lysak <[email protected]>
Co-authored-by: Maxim Lysak <[email protected]>
Co-authored-by: Maksym Lysak <[email protected]>
* adding rapidocr engine for ocr in docling

Signed-off-by: swayam-singhal <[email protected]>

* fixing styling format

Signed-off-by: Swaymaw <[email protected]>

* updating pyproject.toml and poetry.lock to fix ci bugs

Signed-off-by: Swaymaw <[email protected]>

* help poetry pinning for python3.9

Signed-off-by: Michele Dolfi <[email protected]>

* simplifying rapidocr options so that device can be changed using a single option for all models

Signed-off-by: Swaymaw <[email protected]>

* fix styling issues and small bug in rapidOcrOptions

Signed-off-by: Swaymaw <[email protected]>

* use default device until we enable global management

Signed-off-by: Michele Dolfi <[email protected]>

---------

Signed-off-by: swayam-singhal <[email protected]>
Signed-off-by: Swaymaw <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Co-authored-by: swayam-singhal <[email protected]>
Co-authored-by: Michele Dolfi <[email protected]>
Make sure that the `tesseract_ocr_cli_model.py` does not open the png image file twice (`tempfile.NamedTemporaryFile` + `high_res_image.save`), and ensure that `_run_tesseract` is executed once the file is no longer open by python. This other results in a "PermissionError: [Errno 13] Permission denied" error on Windows.

Signed-off-by: Gaspard Petit <[email protected]>
@gaspardpetit
Copy link
Contributor Author

Thank you for reviewing this pull request. I have fixed the import order according the the pre-commit linter.

I disagree with using delete_on_close=False (with delete=True and without the try/finally blocks):

With delete_on_close=False, temporary files will survive until the script terminates. On a script being used to convert a large number of pdf files, each containing potentially hundreds of pages, this would resulting thousand of temporary image files being deleted only on termination. Within containers with limited storage, this could be a huge disadvantage. It may also make the termination of the script appear to hang and possibly timeout in some scenarios, leading to the process being killed and the files leaking.

Let me know if you would still prefer to use delete_on_close=False and I will amend my pull request accordingly.

@dolfim-ibm
Copy link
Contributor

I think we both want to achieve the same, but we interpret the tempfile library documentation in different ways 😉

Let me elaborate my understanding of it.

  • When the file is used within a with context. The argument delete=True is doing exactly what you propose with the finally:. At the end of the context, the file should be removed.
  • Some operating system (I think mostly Windows) have special handlers which would delete the file, when the .close() method is called on it, even if still within the with statement. This is what I think is causing the error you reported.

My understanding of the tempfile options is that using delete_on_close=False and delete=True (within a with: context) would achieve what we both want: 1) keep the file alive until the OCR is done, 2) delete it as soon it is no longer needed.

As posted initially, this is just an understanding of the documentation, and I don't have a way to test it on Windows. If you want to give it a try and see it is matching the expectations above, we could simplify the fix and achieve what we both want.

@dolfim-ibm
Copy link
Contributor

Note: the CI is currently reporting

  1. one of your commit is missing the DCO sign-off. Can you please fix it with git rebase HEAD~15 --signoff?
  2. the diff currently looks weird. I'm wondering if something went wrong in merging with the latest main branch.

@gaspardpetit
Copy link
Contributor Author

Hi! Thank you for reviewing this.

You are right, I might have merged incorrectly - I will probably end up submitting again from a clean branch.

Just to confirm that I understand correctly - using delete_on_close will also bump the python requirements to 3.12 and I think the docling currently supports 3.9. Do we want that?

@dolfim-ibm
Copy link
Contributor

using delete_on_close will also bump the python requirements to 3.12

Good catch. Sorry, I didn't notice it before. Then it is definitely a no-go.

Then I fully support your solution with delete=False.

@gaspardpetit
Copy link
Contributor Author

Brilliant, I'll resubmit as a clean branch in a couple of hours then, thank !

@gaspardpetit
Copy link
Contributor Author

Replaced by clean merge request #496

@gaspardpetit gaspardpetit deleted the gaspardpetit-fix-permission-error-tesseractcli branch December 3, 2024 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants