-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add image conversion #111
Add image conversion #111
Conversation
12537a5
to
02e81f2
Compare
Codecov Report
@@ Coverage Diff @@
## master #111 +/- ##
===========================================
Coverage 100.00% 100.00%
- Complexity 125 131 +6
===========================================
Files 23 25 +2
Lines 457 479 +22
Branches 4 4
===========================================
+ Hits 457 479 +22
Continue to review full report at Codecov.
|
3112ec9
to
bc6d172
Compare
4667a53
to
f02e66b
Compare
@lbdroid would be nice if you could give us some feedback. You can install this version by downloading the artifact from https://github.com/R0Wi/workflow_ocr/suites/6281002808/artifacts/224633584 . Thanks ! |
I'll be happy to provide feedback on it. I can look at it middle of next week -- I work in accounting, and the personal tax return filing deadline here in Canada is on Monday. |
Sounds good to me, thanks in advance 😺 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two locations in the code where I added some comments regarding the conversion step design and its safety. Please resolve the problems that are addressed by the comments and questions before approving and merging this feature. Everything else looks fine to me.
public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings): OcrProcessorResult { | ||
if ($file->getMimeType() !== 'application/pdf') { | ||
// Convert file to pdf. Here we assume that we're dealing with an image input | ||
$pdfContent = $this->converter->convertToPdf($file->getContent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of code can be dangerous and can be eventually unsafe if a programmer adds more unsupported image MIME types to the constructor mapping table of the OcrProcessorFactory
where the create
method would return a fresh and valid OcrProcessor
object and would not throw any exception. If the MIME type text/plain
would be added, an OCR processing of a simple text file with the new MIME type text/plain
would reach the image conversion code statement at this location with a valid OcrProcessor
for the text/plain
MIME type but without carrying any input with a supported image MIME type. Then, the execution on this path would lead to an unpredictable behavior, so the overall OCR processing could crash.
Therefore, we should check the conversion step for an arbitrary MIME type as well. This could be done by extending the constructor mapping table of the OcrProcessorFactory
with the corresponding input-to-PDF converter if a conversion is necessary, otherwise specify none. Such a mapping table could then be used as a lookup table to preprocess the arbitrary input with the appropriate input-to-PDF converter if necessary, otherwise the conversion step can be skipped entirely.
public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings): OcrProcessorResult { | ||
if ($file->getMimeType() !== 'application/pdf') { | ||
// Convert file to pdf. Here we assume that we're dealing with an image input | ||
$pdfContent = $this->converter->convertToPdf($file->getContent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a conversion from an image file to a PDF file here?
If we convert an image to a PDF and pass the PDF to OCRmyPDF, the original input image will be converted twice. OCRmyPDF converts the input PDF again (using ghostscript) before feeding tesseract with the double converted image. Due to the 2-step conversion of input images, the image quality suffers with lossy input image formats (e.g. JPEG) which could result in a degraded OCR quality. So, why do we not remove any image conversion and pass the input images directly to tesseract instead?
lib/Wrapper/IImageToPdfConverter.php
Outdated
|
||
namespace OCA\WorkflowOcr\Wrapper; | ||
|
||
interface IImageToPdfConverter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the conversion interface can be even more abstract, since not only image files can be converted to PDF files, but also any input files, such as simple text files.
interface IImageToPdfConverter { | |
interface IInputToPdfConverter { |
676d888
to
1fbc5bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lbdroid and @R0Wi for your patches. After the change of the original feature patches, we have a good-looking and future-proof interface to implement various OCR processors, e.g. the image-based OCRmyPDF processor. Note that I haven't tested the new implementation (changed patches) on a fresh Nextcloud yet. But feel free to retest and release the new feature!
@bahnwaerter thx for review. I will cleanup the docs and add some additional test these days. I will inform you when ready to merge 🚀 |
This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file.
* Add optional png/jpg conversion via Imagick * Closing #107 tmp test
* Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix
51ff1ad
to
3f7c90c
Compare
3f7c90c
to
d31aed0
Compare
0475a31
to
4513bcd
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-stable23 stable23
# Navigate to the new working tree
cd .worktrees/backport-stable23
# Create a new branch
git switch --create backport-111-to-stable23
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick ---mainline 1 5efcfebeffde2136fa1f7768ba31240a535b9ec8
# Push it to GitHub
git push --set-upstream origin backport-111-to-stable23
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-stable23 Then, create a pull request where the |
* Add image conversion capability for JPEG and PNG images. This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file. * Implement optional image conversion before PDF processing * Add optional png/jpg conversion via Imagick * Closing #107 tmp test * Fix Psalm errors * Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix * Move OCR processors to different classes * Code- & docs cleanup * Fix code smells & psalm errors Co-authored-by: lbdroid <[email protected]>
* Add image conversion capability for JPEG and PNG images. This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file. * Implement optional image conversion before PDF processing * Add optional png/jpg conversion via Imagick * Closing #107 tmp test * Fix Psalm errors * Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix * Move OCR processors to different classes * Code- & docs cleanup * Fix code smells & psalm errors Co-authored-by: lbdroid <[email protected]>
* Add image conversion capability for JPEG and PNG images. This conversion uses ImageMagick's "convert" command to create a new PDF file. It does not OCR the file during the conversion, but requires a separate flow for the newly created PDF file. * Implement optional image conversion before PDF processing * Add optional png/jpg conversion via Imagick * Closing #107 tmp test * Fix Psalm errors * Use api packages from vendor * Update Psalm baseline * Apply Psalm autofix * Move OCR processors to different classes * Code- & docs cleanup * Fix code smells & psalm errors Co-authored-by: lbdroid <[email protected]> Co-authored-by: lbdroid <[email protected]>
Extend #108
Closing #107