Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image conversion #111

Merged
merged 6 commits into from
May 21, 2022
Merged

Add image conversion #111

merged 6 commits into from
May 21, 2022

Conversation

R0Wi
Copy link
Contributor

@R0Wi R0Wi commented Apr 9, 2022

Extend #108

Closing #107

@R0Wi R0Wi force-pushed the feature/add-image-conversion branch from 12537a5 to 02e81f2 Compare April 26, 2022 16:50
@codecov
Copy link

codecov bot commented Apr 26, 2022

Codecov Report

Merging #111 (4513bcd) into master (2469940) will not change coverage.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##              master      #111   +/-   ##
===========================================
  Coverage     100.00%   100.00%           
- Complexity       125       131    +6     
===========================================
  Files             23        25    +2     
  Lines            457       479   +22     
  Branches           4         4           
===========================================
+ Hits             457       479   +22     
Impacted Files Coverage Δ
lib/Operation.php 100.00% <ø> (ø)
lib/AppInfo/Application.php 100.00% <100.00%> (ø)
lib/BackgroundJobs/ProcessFileJob.php 100.00% <100.00%> (ø)
lib/OcrProcessors/ImageOcrProcessor.php 100.00% <100.00%> (ø)
lib/OcrProcessors/OcrMyPdfBasedProcessor.php 100.00% <100.00%> (ø)
lib/OcrProcessors/OcrProcessorFactory.php 100.00% <100.00%> (ø)
lib/OcrProcessors/OcrProcessorResult.php 100.00% <100.00%> (ø)
lib/Service/OcrService.php 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2469940...4513bcd. Read the comment docs.

@R0Wi R0Wi force-pushed the feature/add-image-conversion branch 5 times, most recently from 3112ec9 to bc6d172 Compare April 26, 2022 17:32
@R0Wi R0Wi linked an issue Apr 26, 2022 that may be closed by this pull request
@R0Wi R0Wi force-pushed the feature/add-image-conversion branch 5 times, most recently from 4667a53 to f02e66b Compare April 27, 2022 11:51
@R0Wi R0Wi requested a review from bahnwaerter April 27, 2022 11:54
@R0Wi R0Wi marked this pull request as ready for review April 27, 2022 11:54
@R0Wi
Copy link
Contributor Author

R0Wi commented Apr 27, 2022

@lbdroid would be nice if you could give us some feedback. You can install this version by downloading the artifact from https://github.com/R0Wi/workflow_ocr/suites/6281002808/artifacts/224633584 . Thanks !

@lbdroid
Copy link
Contributor

lbdroid commented Apr 27, 2022

@lbdroid would be nice if you could give us some feedback. You can install this version by downloading the artifact from https://github.com/R0Wi/workflow_ocr/suites/6281002808/artifacts/224633584 . Thanks !

I'll be happy to provide feedback on it. I can look at it middle of next week -- I work in accounting, and the personal tax return filing deadline here in Canada is on Monday.

@R0Wi
Copy link
Contributor Author

R0Wi commented Apr 27, 2022

@lbdroid would be nice if you could give us some feedback. You can install this version by downloading the artifact from https://github.com/R0Wi/workflow_ocr/suites/6281002808/artifacts/224633584 . Thanks !

I'll be happy to provide feedback on it. I can look at it middle of next week -- I work in accounting, and the personal tax return filing deadline here in Canada is on Monday.

Sounds good to me, thanks in advance 😺

Copy link
Collaborator

@bahnwaerter bahnwaerter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two locations in the code where I added some comments regarding the conversion step design and its safety. Please resolve the problems that are addressed by the comments and questions before approving and merging this feature. Everything else looks fine to me.

public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings): OcrProcessorResult {
if ($file->getMimeType() !== 'application/pdf') {
// Convert file to pdf. Here we assume that we're dealing with an image input
$pdfContent = $this->converter->convertToPdf($file->getContent());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of code can be dangerous and can be eventually unsafe if a programmer adds more unsupported image MIME types to the constructor mapping table of the OcrProcessorFactory where the create method would return a fresh and valid OcrProcessor object and would not throw any exception. If the MIME type text/plain would be added, an OCR processing of a simple text file with the new MIME type text/plain would reach the image conversion code statement at this location with a valid OcrProcessor for the text/plain MIME type but without carrying any input with a supported image MIME type. Then, the execution on this path would lead to an unpredictable behavior, so the overall OCR processing could crash.

Therefore, we should check the conversion step for an arbitrary MIME type as well. This could be done by extending the constructor mapping table of the OcrProcessorFactory with the corresponding input-to-PDF converter if a conversion is necessary, otherwise specify none. Such a mapping table could then be used as a lookup table to preprocess the arbitrary input with the appropriate input-to-PDF converter if necessary, otherwise the conversion step can be skipped entirely.

public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings): OcrProcessorResult {
if ($file->getMimeType() !== 'application/pdf') {
// Convert file to pdf. Here we assume that we're dealing with an image input
$pdfContent = $this->converter->convertToPdf($file->getContent());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a conversion from an image file to a PDF file here?

If we convert an image to a PDF and pass the PDF to OCRmyPDF, the original input image will be converted twice. OCRmyPDF converts the input PDF again (using ghostscript) before feeding tesseract with the double converted image. Due to the 2-step conversion of input images, the image quality suffers with lossy input image formats (e.g. JPEG) which could result in a degraded OCR quality. So, why do we not remove any image conversion and pass the input images directly to tesseract instead?


namespace OCA\WorkflowOcr\Wrapper;

interface IImageToPdfConverter {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the conversion interface can be even more abstract, since not only image files can be converted to PDF files, but also any input files, such as simple text files.

Suggested change
interface IImageToPdfConverter {
interface IInputToPdfConverter {

@R0Wi R0Wi force-pushed the feature/add-image-conversion branch 3 times, most recently from 676d888 to 1fbc5bd Compare May 7, 2022 21:43
Copy link
Collaborator

@bahnwaerter bahnwaerter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lbdroid and @R0Wi for your patches. After the change of the original feature patches, we have a good-looking and future-proof interface to implement various OCR processors, e.g. the image-based OCRmyPDF processor. Note that I haven't tested the new implementation (changed patches) on a fresh Nextcloud yet. But feel free to retest and release the new feature!

@R0Wi
Copy link
Contributor Author

R0Wi commented May 11, 2022

@bahnwaerter thx for review. I will cleanup the docs and add some additional test these days. I will inform you when ready to merge 🚀

@R0Wi R0Wi added this to the v1.23.3 milestone May 11, 2022
lbdroid and others added 3 commits May 21, 2022 13:13
This conversion uses ImageMagick's "convert" command to create a
new PDF file. It does not OCR the file during the conversion, but
requires a separate flow for the newly created PDF file.
* Add optional png/jpg conversion via Imagick
* Closing #107

tmp test
* Use api packages from vendor
* Update Psalm baseline
* Apply Psalm autofix
@R0Wi R0Wi force-pushed the feature/add-image-conversion branch from 51ff1ad to 3f7c90c Compare May 21, 2022 11:17
@R0Wi R0Wi force-pushed the feature/add-image-conversion branch from 3f7c90c to d31aed0 Compare May 21, 2022 11:22
@R0Wi R0Wi force-pushed the feature/add-image-conversion branch from 0475a31 to 4513bcd Compare May 21, 2022 11:31
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@R0Wi R0Wi merged commit 5efcfeb into master May 21, 2022
@R0Wi R0Wi deleted the feature/add-image-conversion branch May 21, 2022 22:15
@R0Wi
Copy link
Contributor Author

R0Wi commented May 21, 2022

The backport to stable23 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-stable23 stable23
# Navigate to the new working tree
cd .worktrees/backport-stable23
# Create a new branch
git switch --create backport-111-to-stable23
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick ---mainline 1 5efcfebeffde2136fa1f7768ba31240a535b9ec8
# Push it to GitHub
git push --set-upstream origin backport-111-to-stable23
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-stable23

Then, create a pull request where the base branch is stable23 and the compare/head branch is backport-111-to-stable23.

R0Wi added a commit that referenced this pull request May 21, 2022
* Add image conversion capability for JPEG and PNG images.

This conversion uses ImageMagick's "convert" command to create a
new PDF file. It does not OCR the file during the conversion, but
requires a separate flow for the newly created PDF file.

* Implement optional image conversion before PDF processing

* Add optional png/jpg conversion via Imagick
* Closing #107

tmp test

* Fix Psalm errors

* Use api packages from vendor
* Update Psalm baseline
* Apply Psalm autofix

* Move OCR processors to different classes

* Code- & docs cleanup

* Fix code smells & psalm errors

Co-authored-by: lbdroid <[email protected]>
R0Wi added a commit that referenced this pull request May 22, 2022
* Add image conversion capability for JPEG and PNG images.

This conversion uses ImageMagick's "convert" command to create a
new PDF file. It does not OCR the file during the conversion, but
requires a separate flow for the newly created PDF file.

* Implement optional image conversion before PDF processing

* Add optional png/jpg conversion via Imagick
* Closing #107

tmp test

* Fix Psalm errors

* Use api packages from vendor
* Update Psalm baseline
* Apply Psalm autofix

* Move OCR processors to different classes

* Code- & docs cleanup

* Fix code smells & psalm errors

Co-authored-by: lbdroid <[email protected]>
R0Wi added a commit that referenced this pull request May 22, 2022
* Add image conversion capability for JPEG and PNG images.

This conversion uses ImageMagick's "convert" command to create a
new PDF file. It does not OCR the file during the conversion, but
requires a separate flow for the newly created PDF file.

* Implement optional image conversion before PDF processing

* Add optional png/jpg conversion via Imagick
* Closing #107

tmp test

* Fix Psalm errors

* Use api packages from vendor
* Update Psalm baseline
* Apply Psalm autofix

* Move OCR processors to different classes

* Code- & docs cleanup

* Fix code smells & psalm errors

Co-authored-by: lbdroid <[email protected]>

Co-authored-by: lbdroid <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only works on PDF files
3 participants