Add image conversion #111

R0Wi · 2022-04-09T15:34:05Z

Extend #108

Closing #107

codecov · 2022-04-26T16:53:09Z

Codecov Report

Merging #111 (4513bcd) into master (2469940) will not change coverage.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##              master      #111   +/-   ##
===========================================
  Coverage     100.00%   100.00%           
- Complexity       125       131    +6     
===========================================
  Files             23        25    +2     
  Lines            457       479   +22     
  Branches           4         4           
===========================================
+ Hits             457       479   +22

Impacted Files	Coverage Δ
lib/Operation.php	`100.00% <ø> (ø)`
lib/AppInfo/Application.php	`100.00% <100.00%> (ø)`
lib/BackgroundJobs/ProcessFileJob.php	`100.00% <100.00%> (ø)`
lib/OcrProcessors/ImageOcrProcessor.php	`100.00% <100.00%> (ø)`
lib/OcrProcessors/OcrMyPdfBasedProcessor.php	`100.00% <100.00%> (ø)`
lib/OcrProcessors/OcrProcessorFactory.php	`100.00% <100.00%> (ø)`
lib/OcrProcessors/OcrProcessorResult.php	`100.00% <100.00%> (ø)`
lib/Service/OcrService.php	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2469940...4513bcd. Read the comment docs.

R0Wi · 2022-04-27T11:58:14Z

@lbdroid would be nice if you could give us some feedback. You can install this version by downloading the artifact from https://github.com/R0Wi/workflow_ocr/suites/6281002808/artifacts/224633584 . Thanks !

lbdroid · 2022-04-27T13:44:07Z

@lbdroid would be nice if you could give us some feedback. You can install this version by downloading the artifact from https://github.com/R0Wi/workflow_ocr/suites/6281002808/artifacts/224633584 . Thanks !

I'll be happy to provide feedback on it. I can look at it middle of next week -- I work in accounting, and the personal tax return filing deadline here in Canada is on Monday.

R0Wi · 2022-04-27T13:46:15Z

@lbdroid would be nice if you could give us some feedback. You can install this version by downloading the artifact from https://github.com/R0Wi/workflow_ocr/suites/6281002808/artifacts/224633584 . Thanks !

I'll be happy to provide feedback on it. I can look at it middle of next week -- I work in accounting, and the personal tax return filing deadline here in Canada is on Monday.

Sounds good to me, thanks in advance 😺

bahnwaerter

There are two locations in the code where I added some comments regarding the conversion step design and its safety. Please resolve the problems that are addressed by the comments and questions before approving and merging this feature. Everything else looks fine to me.

bahnwaerter · 2022-05-06T21:47:00Z

lib/OcrProcessors/PdfOcrProcessor.php

+	public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings): OcrProcessorResult {
+		if ($file->getMimeType() !== 'application/pdf') {
+			// Convert file to pdf. Here we assume that we're dealing with an image input
+			$pdfContent = $this->converter->convertToPdf($file->getContent());


This line of code can be dangerous and can be eventually unsafe if a programmer adds more unsupported image MIME types to the constructor mapping table of the OcrProcessorFactory where the create method would return a fresh and valid OcrProcessor object and would not throw any exception. If the MIME type text/plain would be added, an OCR processing of a simple text file with the new MIME type text/plain would reach the image conversion code statement at this location with a valid OcrProcessor for the text/plain MIME type but without carrying any input with a supported image MIME type. Then, the execution on this path would lead to an unpredictable behavior, so the overall OCR processing could crash.

Therefore, we should check the conversion step for an arbitrary MIME type as well. This could be done by extending the constructor mapping table of the OcrProcessorFactory with the corresponding input-to-PDF converter if a conversion is necessary, otherwise specify none. Such a mapping table could then be used as a lookup table to preprocess the arbitrary input with the appropriate input-to-PDF converter if necessary, otherwise the conversion step can be skipped entirely.

bahnwaerter · 2022-05-06T22:10:52Z

lib/OcrProcessors/PdfOcrProcessor.php

+	public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings): OcrProcessorResult {
+		if ($file->getMimeType() !== 'application/pdf') {
+			// Convert file to pdf. Here we assume that we're dealing with an image input
+			$pdfContent = $this->converter->convertToPdf($file->getContent());


Why do we need a conversion from an image file to a PDF file here?

If we convert an image to a PDF and pass the PDF to OCRmyPDF, the original input image will be converted twice. OCRmyPDF converts the input PDF again (using ghostscript) before feeding tesseract with the double converted image. Due to the 2-step conversion of input images, the image quality suffers with lossy input image formats (e.g. JPEG) which could result in a degraded OCR quality. So, why do we not remove any image conversion and pass the input images directly to tesseract instead?

bahnwaerter · 2022-05-06T22:16:47Z

lib/Wrapper/IImageToPdfConverter.php

+
+namespace OCA\WorkflowOcr\Wrapper;
+
+interface IImageToPdfConverter {


Here, the conversion interface can be even more abstract, since not only image files can be converted to PDF files, but also any input files, such as simple text files.

Suggested change

interface IImageToPdfConverter {

interface IInputToPdfConverter {

bahnwaerter

Thanks @lbdroid and @R0Wi for your patches. After the change of the original feature patches, we have a good-looking and future-proof interface to implement various OCR processors, e.g. the image-based OCRmyPDF processor. Note that I haven't tested the new implementation (changed patches) on a fresh Nextcloud yet. But feel free to retest and release the new feature!

R0Wi · 2022-05-11T06:00:06Z

@bahnwaerter thx for review. I will cleanup the docs and add some additional test these days. I will inform you when ready to merge 🚀

This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file.

* Add optional png/jpg conversion via Imagick * Closing #107 tmp test

* Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix

sonarqubecloud · 2022-05-21T11:31:33Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

R0Wi · 2022-05-21T22:15:34Z

The backport to stable23 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-stable23 stable23
# Navigate to the new working tree
cd .worktrees/backport-stable23
# Create a new branch
git switch --create backport-111-to-stable23
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick ---mainline 1 5efcfebeffde2136fa1f7768ba31240a535b9ec8
# Push it to GitHub
git push --set-upstream origin backport-111-to-stable23
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-stable23

Then, create a pull request where the base branch is stable23 and the compare/head branch is backport-111-to-stable23.

* Add image conversion capability for JPEG and PNG images. This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file. * Implement optional image conversion before PDF processing * Add optional png/jpg conversion via Imagick * Closing #107 tmp test * Fix Psalm errors * Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix * Move OCR processors to different classes * Code- & docs cleanup * Fix code smells & psalm errors Co-authored-by: lbdroid <[email protected]>

* Add image conversion capability for JPEG and PNG images. This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file. * Implement optional image conversion before PDF processing * Add optional png/jpg conversion via Imagick * Closing #107 tmp test * Fix Psalm errors * Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix * Move OCR processors to different classes * Code- & docs cleanup * Fix code smells & psalm errors Co-authored-by: lbdroid <[email protected]> Co-authored-by: lbdroid <[email protected]>

R0Wi force-pushed the feature/add-image-conversion branch from 12537a5 to 02e81f2 Compare April 26, 2022 16:50

R0Wi force-pushed the feature/add-image-conversion branch 5 times, most recently from 3112ec9 to bc6d172 Compare April 26, 2022 17:32

R0Wi linked an issue Apr 26, 2022 that may be closed by this pull request

Only works on PDF files #107

Closed

R0Wi force-pushed the feature/add-image-conversion branch 5 times, most recently from 4667a53 to f02e66b Compare April 27, 2022 11:51

R0Wi requested a review from bahnwaerter April 27, 2022 11:54

R0Wi marked this pull request as ready for review April 27, 2022 11:54

R0Wi mentioned this pull request Apr 27, 2022

Add image conversion capability for JPEG and PNG images. #108

Closed

R0Wi added backport stable22 backport stable23 labels Apr 27, 2022

bahnwaerter requested changes May 6, 2022

View reviewed changes

R0Wi added backport stable24 and removed backport stable22 labels May 7, 2022

R0Wi force-pushed the feature/add-image-conversion branch 3 times, most recently from 676d888 to 1fbc5bd Compare May 7, 2022 21:43

bahnwaerter approved these changes May 10, 2022

View reviewed changes

R0Wi added this to the v1.23.3 milestone May 11, 2022

lbdroid and others added 3 commits May 21, 2022 13:13

Add image conversion capability for JPEG and PNG images.

9f0ec44

This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file.

Implement optional image conversion before PDF processing

200f461

* Add optional png/jpg conversion via Imagick * Closing #107 tmp test

Fix Psalm errors

25e3c3a

* Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix

R0Wi force-pushed the feature/add-image-conversion branch from 51ff1ad to 3f7c90c Compare May 21, 2022 11:17

Move OCR processors to different classes

280b14e

R0Wi force-pushed the feature/add-image-conversion branch from 3f7c90c to d31aed0 Compare May 21, 2022 11:22

R0Wi added 2 commits May 21, 2022 13:30

Code- & docs cleanup

6b03710

Fix code smells & psalm errors

4513bcd

R0Wi force-pushed the feature/add-image-conversion branch from 0475a31 to 4513bcd Compare May 21, 2022 11:31

R0Wi merged commit 5efcfeb into master May 21, 2022

R0Wi deleted the feature/add-image-conversion branch May 21, 2022 22:15

R0Wi mentioned this pull request May 21, 2022

[Backport stable24] Add image conversion #123

Merged

R0Wi mentioned this pull request May 22, 2022

[Backport stable23] Add image conversion #124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add image conversion #111

Add image conversion #111

R0Wi commented Apr 9, 2022 •

edited

Loading

codecov bot commented Apr 26, 2022 •

edited

Loading

R0Wi commented Apr 27, 2022

lbdroid commented Apr 27, 2022

R0Wi commented Apr 27, 2022

bahnwaerter left a comment

bahnwaerter May 6, 2022

bahnwaerter May 6, 2022

bahnwaerter May 6, 2022

bahnwaerter left a comment

R0Wi commented May 11, 2022

sonarqubecloud bot commented May 21, 2022

R0Wi commented May 21, 2022


		namespace OCA\WorkflowOcr\Wrapper;

		interface IImageToPdfConverter {

	interface IImageToPdfConverter {
	interface IInputToPdfConverter {

Add image conversion #111

Add image conversion #111

Conversation

R0Wi commented Apr 9, 2022 • edited Loading

codecov bot commented Apr 26, 2022 • edited Loading

Codecov Report

R0Wi commented Apr 27, 2022

lbdroid commented Apr 27, 2022

R0Wi commented Apr 27, 2022

bahnwaerter left a comment

Choose a reason for hiding this comment

bahnwaerter May 6, 2022

Choose a reason for hiding this comment

bahnwaerter May 6, 2022

Choose a reason for hiding this comment

bahnwaerter May 6, 2022

Choose a reason for hiding this comment

bahnwaerter left a comment

Choose a reason for hiding this comment

R0Wi commented May 11, 2022

sonarqubecloud bot commented May 21, 2022

R0Wi commented May 21, 2022

R0Wi commented Apr 9, 2022 •

edited

Loading

codecov bot commented Apr 26, 2022 •

edited

Loading