Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Do not modify files that had no OCR done #130

Closed
doppelgrau opened this issue Jun 25, 2022 · 6 comments · Fixed by #160
Closed

Feature Request: Do not modify files that had no OCR done #130

doppelgrau opened this issue Jun 25, 2022 · 6 comments · Fixed by #160

Comments

@doppelgrau
Copy link

Hello,

I just observed, that files where no additional OCR is done were still shown as a changed and new version in nextcloud.
Would be a bit nicer in my eyes, if in that case the file is not modified. (Not "cluttering" the file history and maybe preserve other stuff like digital signatures.)

Don't know if that can be easily done and/or there are other reasons to keep this behavior.

Thanks for your work so far.

@R0Wi
Copy link
Contributor

R0Wi commented Jun 26, 2022

Hi @doppelgrau, i think the problem here is that the app just processes a PDF file matching your workflow settings and does not analyze the output. Also i'm not sure if there's a reliable way to detect if ocrmypdf added some OCR information or not. @bahnwaerter do you know if that is possible?

@doppelgrau
Copy link
Author

doppelgrau commented Jun 26, 2022

Took a look on the documentation, if you run it without any of the options (--skip-text, --redo-ocr, force-ocr) the returncode 6 would indicate that:

:~$ ocrmypdf --skip-text input.pdf output.pdf
Scan: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 308.62page/s]
   INFO - Start processing 2 pages concurrently
   INFO - Using Tesseract OpenMP thread limit 3
WARNING -    1: [tesseract] lots of diacritics - possibly poor OCR                                                                                                                            
OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.0/2.0 [00:01<00:00,  1.25page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)
:~$ echo $?
0
:~$ ocrmypdf --skip-text output.pdf outpu2.pdf
Scan: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 193.35page/s]
   INFO - Start processing 2 pages concurrently
   INFO - Using Tesseract OpenMP thread limit 3
   INFO -    1: skipping all processing on this page                                                                                                                                          
   INFO -    2: skipping all processing on this page                                                                                                                                          
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.0/2.0 [00:00<00:00, 134.30page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: -0.0%
   INFO - Image optimization did not improve the file - discarded
   INFO - Output file is a PDF/A-2B (as expected)
:~$ echo $?
0
:~$ ocrmypdf output.pdf outpu2.pdf
Scan: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 132.82page/s]
   INFO - Start processing 2 pages concurrently
   INFO - Using Tesseract OpenMP thread limit 3
OCR:   0%|                                                                                                                                                        | 0.0/2.0 [00:00<?, ?page/s]
  ERROR - PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR)
:~$ echo $?
6
:~$ 

So there is only an easy way if combined with #129 to allow none of these options to be used.

@R0Wi
Copy link
Contributor

R0Wi commented Jun 27, 2022

Interesting, thanks for your efforts. So to explicitly check for an exitcode 6 and then skip the creation of a new document is not a big deal. Also to introduce an option for "none" in #129 is perfectly doable. But to be honest i'm not really happy with that solution since this would introduce changes in behaviour depending on the workflow-settings you make. Would be much cleaner to check the output file for let's say some "OCR layer" and use that as a condition.

Ping @bahnwaerter for contact, we'll report our results here. Thank's for your patience.

@doppelgrau
Copy link
Author

I agree, checking if something had changed would be nicer. But found no nice way with ocrmypdf, so the "returncode 6 solution" might be a compromise.

With --skip-text parsing of the output might be possible (if for each page a "skipping all processing on this page" is printed ...) but that seems ugly and also has the risk to brake if there are updates to ocrmypdf/tesseracct.
A bit more ressource intensive but more robust might be extracting the text before/after with something like pdftotext and if the text didn't change ...
(For after maybe the --sidecar option could be used, if the output follows the same logic as pdftotext)

@R0Wi
Copy link
Contributor

R0Wi commented Jun 27, 2022

Thank's @doppelgrau, the hint for the --sidecar option looks very promising. I did some quick tests and it seems like ocrmypdf only writes content to the sidecar file if it did some OCR actions. Of course the file always contains text if we use --redo-ocr but that seems correct to me.

So a solution could be to always use the --sidecar option with a temporary Nexcloud textfile, which can be analyzed if it has content (file size > 0) after OCR processing was done.

Just to pin this down for me:

@bahnwaerter
Copy link
Collaborator

[...] i'm not sure if there's a reliable way to detect if ocrmypdf added some OCR information or not.
@bahnwaerter do you know if that is possible?

This is a very difficult task if this should be solved outside of ocrmypdf as a simple pre-check. A major reason for this difficulty is that ocrmypdf supports multiple PDF renderers. Each PDF renderer can embed the OCR information differently in the output PDF file which makes a detection as part of this workflow app difficult.

One possibility to get the information is mentioned by @doppelgrau in a comment above and addresses the checking of ocrmypdf's return code. This solution can't be implemented as a pre-check since a full OCR process has to be triggered to obtain a valid return code.

The idea mentioned by @doppelgrau in a comment above is another possibility to get the information by parsing and checking the output of the --skip-text option. This isn't a great solution since the parsing and checking depends on the output format of the used OCR backend in ocrmypdf. In addition to that we use ocrmypdf as an interface for OCR backends like tesseract. Therefore, it is forbidden to bypass the interface and access implementation details directly.

A third solution proposed by @R0Wi in the comment before makes use of the --sidecar option to detect any available OCR text in the specified PDF file. The detection can't be implemented as a pre-check since an full OCR process has to be performed while writing the already embedded OCR text into a temporary file.

I suggest to implement this new feature with the third proposed solution from @R0Wi. The functionality of this solution does not break the existing behavior of this workflow app if it's configurable by an optional workflow option. So, I agree with @R0Wi's personal bullet points to implement this requested feature.

R0Wi added a commit that referenced this issue Sep 19, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Oct 24, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>
@R0Wi R0Wi linked a pull request Oct 24, 2022 that will close this issue
R0Wi added a commit that referenced this issue Oct 24, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Nov 1, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
@R0Wi R0Wi closed this as completed in #160 Nov 1, 2022
R0Wi added a commit that referenced this issue Nov 1, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Nov 1, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Nov 1, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Nov 1, 2022
Only create new file version if OCR result was not empty

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants