Feature Request: Do not modify files that had no OCR done #130

doppelgrau · 2022-06-25T20:35:03Z

Hello,

I just observed, that files where no additional OCR is done were still shown as a changed and new version in nextcloud.
Would be a bit nicer in my eyes, if in that case the file is not modified. (Not "cluttering" the file history and maybe preserve other stuff like digital signatures.)

Don't know if that can be easily done and/or there are other reasons to keep this behavior.

Thanks for your work so far.

R0Wi · 2022-06-26T12:09:44Z

Hi @doppelgrau, i think the problem here is that the app just processes a PDF file matching your workflow settings and does not analyze the output. Also i'm not sure if there's a reliable way to detect if ocrmypdf added some OCR information or not. @bahnwaerter do you know if that is possible?

doppelgrau · 2022-06-26T15:28:11Z

Took a look on the documentation, if you run it without any of the options (--skip-text, --redo-ocr, force-ocr) the returncode 6 would indicate that:

:~$ ocrmypdf --skip-text input.pdf output.pdf
Scan: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 308.62page/s]
   INFO - Start processing 2 pages concurrently
   INFO - Using Tesseract OpenMP thread limit 3
WARNING -    1: [tesseract] lots of diacritics - possibly poor OCR                                                                                                                            
OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.0/2.0 [00:01<00:00,  1.25page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)
:~$ echo $?
0
:~$ ocrmypdf --skip-text output.pdf outpu2.pdf
Scan: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 193.35page/s]
   INFO - Start processing 2 pages concurrently
   INFO - Using Tesseract OpenMP thread limit 3
   INFO -    1: skipping all processing on this page                                                                                                                                          
   INFO -    2: skipping all processing on this page                                                                                                                                          
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.0/2.0 [00:00<00:00, 134.30page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: -0.0%
   INFO - Image optimization did not improve the file - discarded
   INFO - Output file is a PDF/A-2B (as expected)
:~$ echo $?
0
:~$ ocrmypdf output.pdf outpu2.pdf
Scan: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 132.82page/s]
   INFO - Start processing 2 pages concurrently
   INFO - Using Tesseract OpenMP thread limit 3
OCR:   0%|                                                                                                                                                        | 0.0/2.0 [00:00<?, ?page/s]
  ERROR - PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR)
:~$ echo $?
6
:~$

So there is only an easy way if combined with #129 to allow none of these options to be used.

R0Wi · 2022-06-27T06:44:43Z

Interesting, thanks for your efforts. So to explicitly check for an exitcode 6 and then skip the creation of a new document is not a big deal. Also to introduce an option for "none" in #129 is perfectly doable. But to be honest i'm not really happy with that solution since this would introduce changes in behaviour depending on the workflow-settings you make. Would be much cleaner to check the output file for let's say some "OCR layer" and use that as a condition.

Ping @bahnwaerter for contact, we'll report our results here. Thank's for your patience.

doppelgrau · 2022-06-27T11:24:57Z

I agree, checking if something had changed would be nicer. But found no nice way with ocrmypdf, so the "returncode 6 solution" might be a compromise.

With --skip-text parsing of the output might be possible (if for each page a "skipping all processing on this page" is printed ...) but that seems ugly and also has the risk to brake if there are updates to ocrmypdf/tesseracct.
A bit more ressource intensive but more robust might be extracting the text before/after with something like pdftotext and if the text didn't change ...
(For after maybe the --sidecar option could be used, if the output follows the same logic as pdftotext)

R0Wi · 2022-06-27T15:01:56Z

Thank's @doppelgrau, the hint for the --sidecar option looks very promising. I did some quick tests and it seems like ocrmypdf only writes content to the sidecar file if it did some OCR actions. Of course the file always contains text if we use --redo-ocr but that seems correct to me.

So a solution could be to always use the --sidecar option with a temporary Nexcloud textfile, which can be analyzed if it has content (file size > 0) after OCR processing was done.

Just to pin this down for me:

Check how temporary files can be created in Nextcloud environment (see https://github.com/nextcloud/server/blob/0fe7064fc4fbf15272031f852ffbdb6ac08bc6ef/lib/public/ITempManager.php)
Add appropriate arguments at https://github.com/R0Wi/workflow_ocr/blob/80101b94c0c6dfb4efc64bf51fb3333ca196c324/lib/OcrProcessors/OcrMyPdfBasedProcessor.php#L112
Add a sidecar file size check at https://github.com/R0Wi/workflow_ocr/blob/80101b94c0c6dfb4efc64bf51fb3333ca196c324/lib/OcrProcessors/OcrMyPdfBasedProcessor.php#L89 and throw OcrNotPossibleException if the file has zero bytes
Eventually make this option ("Only produce a new file version if OCR text was added") configurable per workflow?

bahnwaerter · 2022-07-02T17:44:59Z

[...] i'm not sure if there's a reliable way to detect if ocrmypdf added some OCR information or not.
@bahnwaerter do you know if that is possible?

This is a very difficult task if this should be solved outside of ocrmypdf as a simple pre-check. A major reason for this difficulty is that ocrmypdf supports multiple PDF renderers. Each PDF renderer can embed the OCR information differently in the output PDF file which makes a detection as part of this workflow app difficult.

One possibility to get the information is mentioned by @doppelgrau in a comment above and addresses the checking of ocrmypdf's return code. This solution can't be implemented as a pre-check since a full OCR process has to be triggered to obtain a valid return code.

The idea mentioned by @doppelgrau in a comment above is another possibility to get the information by parsing and checking the output of the --skip-text option. This isn't a great solution since the parsing and checking depends on the output format of the used OCR backend in ocrmypdf. In addition to that we use ocrmypdf as an interface for OCR backends like tesseract. Therefore, it is forbidden to bypass the interface and access implementation details directly.

A third solution proposed by @R0Wi in the comment before makes use of the --sidecar option to detect any available OCR text in the specified PDF file. The detection can't be implemented as a pre-check since an full OCR process has to be performed while writing the already embedded OCR text into a temporary file.

I suggest to implement this new feature with the third proposed solution from @R0Wi. The functionality of this solution does not break the existing behavior of this workflow app if it's configurable by an optional workflow option. So, I agree with @R0Wi's personal bullet points to implement this requested feature.

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]>

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]>

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]>

R0Wi mentioned this issue Jun 27, 2022

Make OCR skip options configurable #129

Closed

3 tasks

R0Wi mentioned this issue Aug 24, 2022

Dispatch OCP Event with OcrProcessorResult #144

Closed

R0Wi added a commit that referenced this issue Sep 19, 2022

Implement #130

82255ed

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Oct 24, 2022

Implement #130

aa2430a

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]>

R0Wi linked a pull request Oct 24, 2022 that will close this issue

Implement #130 #160

Merged

R0Wi added a commit that referenced this issue Oct 24, 2022

Implement #130

b33bcbe

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Nov 1, 2022

Implement #130 (#160)

8cfba2c

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]>

R0Wi closed this as completed in #160 Nov 1, 2022

R0Wi added a commit that referenced this issue Nov 1, 2022

Implement #130 (#160)

1270eb4

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]>

R0Wi added a commit that referenced this issue Nov 1, 2022

Implement #130 (#160)

076845e

Only create new file version if OCR result was not empty Signed-off-by: Robin Windey <[email protected]> Signed-off-by: Robin Windey <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Do not modify files that had no OCR done #130

Feature Request: Do not modify files that had no OCR done #130

doppelgrau commented Jun 25, 2022

R0Wi commented Jun 26, 2022

doppelgrau commented Jun 26, 2022 •

edited

Loading

R0Wi commented Jun 27, 2022

doppelgrau commented Jun 27, 2022

R0Wi commented Jun 27, 2022 •

edited

Loading

bahnwaerter commented Jul 2, 2022

Feature Request: Do not modify files that had no OCR done #130

Feature Request: Do not modify files that had no OCR done #130

Comments

doppelgrau commented Jun 25, 2022

R0Wi commented Jun 26, 2022

doppelgrau commented Jun 26, 2022 • edited Loading

R0Wi commented Jun 27, 2022

doppelgrau commented Jun 27, 2022

R0Wi commented Jun 27, 2022 • edited Loading

bahnwaerter commented Jul 2, 2022

doppelgrau commented Jun 26, 2022 •

edited

Loading

R0Wi commented Jun 27, 2022 •

edited

Loading