Skip to content

Commit

Permalink
Implement optional image conversion before PDF processing
Browse files Browse the repository at this point in the history
* Add optional png/jpg conversion via Imagick
* Closing #107
  • Loading branch information
R0Wi committed Apr 26, 2022
1 parent 0d8a05a commit 3112ec9
Show file tree
Hide file tree
Showing 18 changed files with 348 additions and 182 deletions.
21 changes: 19 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,23 @@ In the backend [`OCRmyPDF`](https://github.com/jbarlow83/OCRmyPDF) is used for p
apt-get install ocrmypdf
```

For conversion of non-PDF image files, the commandline tool 'convert' from ImageMagick is used.
To convert image files (`jpg`/`png`) to PDF, the [`imagick`](https://www.php.net/manual/de/book.imagick.php) PHP extension is used. Make sure you have the [`imagemagick`](https://imagemagick.org/index.php) tool installed and the PHP extension activated.

```bash
apt-get install imagemagick php-imagick
```

To allow `imagick` to work with PDF files, add/change the following line in your `policy.xml`, so that `rights` is changed from `none` to `read | write`:

```xml
<policymap>
<!-- [...] -->
<policy domain="coder" rights="read | write" pattern="PDF" />
</policymap>

```

Depending on your system and version of imagemagick, this file is usually located in `/etc/ImageMagick-6/policy.xml`.

Also if you want to use specific language settings please install the corresponding `tesseract` packages.

Expand Down Expand Up @@ -160,7 +176,7 @@ To **test** if your file gets processed properly you can do the following steps:
For processing PDF files, the external command line tool [`OCRmyPDF`](https://github.com/jbarlow83/OCRmyPDF) is used. The tool is invoked with the [`--redo-ocr`](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped) parameter so that it will perform a detailed text analysis. The detailed analysis masks out visible text and sends the image of each page to the OCR processor. After processing, additional text is inserted as OCR, whereas existing text in a mixed file document (images embedded into text pages) is not disrupted.

### Images
For processing images (JPG, PNG), the external command line tool 'convert' (part of ImageMagick) is used. This performs a file-type conversion only. To also OCR the file, create a separate flow for the generated PDF file.
For processing images (currently `jpg` and `png` are supported), the PHP library [`imagick`](https://www.php.net/manual/de/book.imagick)is used. This performs a file-type conversion to PDF before processing the file. The converted PDF file will then be saved as a new file with the original filename and the extension `.pdf` (for example `myImage.jpg` will be saved to `myImage.jpg.pdf`). The original image fill will remain untouched.

## Development
### Dev setup
Expand Down Expand Up @@ -346,3 +362,4 @@ That's all. If you now create a new workflow based on your added mimetype, your
| php-shellcommand | >= 1.6 | https://github.com/mikehaertl/php-shellcommand |
| chain | >= 0.9.0 | https://packagist.org/packages/cocur/chain |
| PHPUnit | >= 8.0 | https://phpunit.de/ |
| imagick | | https://www.php.net/manual/de/book.imagick |
3 changes: 3 additions & 0 deletions lib/AppInfo/Application.php
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@
use OCA\WorkflowOcr\Wrapper\Filesystem;
use OCA\WorkflowOcr\Wrapper\ICommand;
use OCA\WorkflowOcr\Wrapper\IFilesystem;
use OCA\WorkflowOcr\Wrapper\IImageToPdfConverter;
use OCA\WorkflowOcr\Wrapper\ImageToPdfConverter;
use OCA\WorkflowOcr\Wrapper\IViewFactory;
use OCA\WorkflowOcr\Wrapper\ViewFactory;
use OCP\AppFramework\App;
Expand Down Expand Up @@ -68,6 +70,7 @@ public function register(IRegistrationContext $context): void {
$context->registerServiceAlias(IViewFactory::class, ViewFactory::class);
$context->registerServiceAlias(IFilesystem::class, Filesystem::class);
$context->registerServiceAlias(IGlobalSettingsService::class, GlobalSettingsService::class);
$context->registerServiceAlias(IImageToPdfConverter::class, ImageToPdfConverter::class);

// BUG #43
$context->registerService(ICommand::class, function () {
Expand Down
24 changes: 11 additions & 13 deletions lib/BackgroundJobs/ProcessFileJob.php
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ private function processFile(string $filePath, WorkflowSettings $settings) : voi
}

try {
$ocrFile = $this->ocrFile($node, $settings);
$ocrFile = $this->ocrService->ocrFile($node, $settings);
} catch (OcrNotPossibleException $ocrNpEx) {
$this->logger->error('OCR for file ' . $node->getPath() . ' not possible. Message: ' . $ocrNpEx->getMessage());
return;
Expand All @@ -174,10 +174,16 @@ private function processFile(string $filePath, WorkflowSettings $settings) : voi
return;
}

if ($node->getMimeType() == "application/pdf")
$this->createNewFileVersion($filePath, $ocrFile, $node->getId());
else
$this->createNewFileVersion($filePath.".pdf", $ocrFile, $node->getId());
$fileContent = $ocrFile->getFileContent();
$nodeId = $node->getId();
$originalFileExtension = $node->getExtension();
$newFileExtension = $ocrFile->getFileExtension();

if ($originalFileExtension === $newFileExtension) {
$this->createNewFileVersion($filePath, $fileContent, $nodeId);
} else {
$this->createNewFileVersion($filePath.".pdf", $fileContent, $nodeId);
}
}

private function getNode(string $filePath) : ?Node {
Expand Down Expand Up @@ -211,14 +217,6 @@ private function initUserEnvironment(string $uid) : void {
$this->filesystem->init($uid, '/' . $uid . '/files');
}

/**
* @param File $file
* @param WorkflowSettings $settings
*/
private function ocrFile(File $file, WorkflowSettings $settings) : string {
return $this->ocrService->ocrFile($file->getMimeType(), $file->getContent(), $settings);
}

private function shutdownUserEnvironment() : void {
$this->userSession->setUser(null);
}
Expand Down
5 changes: 3 additions & 2 deletions lib/OcrProcessors/IOcrProcessor.php
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,16 @@
use OCA\WorkflowOcr\Exception\OcrNotPossibleException;
use OCA\WorkflowOcr\Model\GlobalSettings;
use OCA\WorkflowOcr\Model\WorkflowSettings;
use OCP\Files\File;

interface IOcrProcessor {
/**
* Processes OCR on the given file
* @param string $fileContent The file to be processed
* @param File $file The file to be processed
* @param WorkflowSettings $settings The settings to be used for this specific workflow
* @param GlobalSettings $globalSettings The global settings configured for all OCR workflows on this system
* @return string The processed file as byte string
* @throws OcrNotPossibleException
*/
public function ocrFile(string $fileContent, WorkflowSettings $settings, GlobalSettings $globalSettings) : string;
public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings) : OcrProcessorResult;
}
7 changes: 7 additions & 0 deletions lib/OcrProcessors/IOcrProcessorFactory.php
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,11 @@ interface IOcrProcessorFactory {
* @return IOcrProcessor|null
*/
public function create(string $mimeType) : IOcrProcessor;

/**
* Returns true, if an OCR processor for the given mimetype
* can be constructed.
* @return bool
*/
public function canCreate(string $mimeType) : bool;
}
95 changes: 0 additions & 95 deletions lib/OcrProcessors/ImageOcrProcessor.php

This file was deleted.

15 changes: 11 additions & 4 deletions lib/OcrProcessors/OcrProcessorFactory.php
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,16 @@

use OCA\WorkflowOcr\Exception\OcrProcessorNotFoundException;
use OCA\WorkflowOcr\Wrapper\ICommand;
use OCA\WorkflowOcr\Wrapper\IImageToPdfConverter;
use OCP\AppFramework\Bootstrap\IRegistrationContext;
use Psr\Container\ContainerInterface;
use Psr\Log\LoggerInterface;

class OcrProcessorFactory implements IOcrProcessorFactory {
private static $mapping = [
'application/pdf' => PdfOcrProcessor::class,
'image/jpeg' => ImageOcrProcessor::class,
'image/png' => ImageOcrProcessor::class,
'image/jpeg' => PdfOcrProcessor::class,
'image/png' => PdfOcrProcessor::class
];

/** @var ContainerInterface */
Expand All @@ -52,16 +53,22 @@ public static function registerOcrProcessors(IRegistrationContext $context) : vo
* under the hood.
*/
$context->registerService(PdfOcrProcessor::class, function (ContainerInterface $c) {
return new PdfOcrProcessor($c->get(ICommand::class), $c->get(LoggerInterface::class));
return new PdfOcrProcessor($c->get(ICommand::class), $c->get(IImageToPdfConverter::class), $c->get(LoggerInterface::class));
}, false);
}

/** @inheritdoc */
public function create(string $mimeType) : IOcrProcessor {
if (!array_key_exists($mimeType, self::$mapping)) {
if (!$this->canCreate($mimeType)) {
throw new OcrProcessorNotFoundException($mimeType);
}
$className = self::$mapping[$mimeType];

return $this->container->get($className);
}

/** @inheritdoc */
public function canCreate(string $mimeType) : bool {
return array_key_exists($mimeType, self::$mapping);
}
}
47 changes: 47 additions & 0 deletions lib/OcrProcessors/OcrProcessorResult.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<?php

declare(strict_types=1);

/**
* @copyright Copyright (c) 2020 Robin Windey <[email protected]>
*
* @license GNU AGPL version 3 or any later version
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as
* published by the Free Software Foundation, either version 3 of the
* License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Affero General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

namespace OCA\WorkflowOcr\OcrProcessors;

/**
* Represents a file which was processed via OCR.
*/
class OcrProcessorResult {
/** @var string */
private $fileContent;
/** @var string */
private $fileExtension;

public function __construct(string $fileContent, string $fileExtension) {
$this->fileContent = $fileContent;
$this->fileExtension = $fileExtension;
}

public function getFileContent(): string {
return $this->fileContent;
}

public function getFileExtension(): string {
return $this->fileExtension;
}
}
21 changes: 17 additions & 4 deletions lib/OcrProcessors/PdfOcrProcessor.php
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
use OCA\WorkflowOcr\Model\GlobalSettings;
use OCA\WorkflowOcr\Model\WorkflowSettings;
use OCA\WorkflowOcr\Wrapper\ICommand;
use OCA\WorkflowOcr\Wrapper\IImageToPdfConverter;
use OCP\Files\File;
use Psr\Log\LoggerInterface;

class PdfOcrProcessor implements IOcrProcessor {
Expand All @@ -48,20 +50,31 @@ class PdfOcrProcessor implements IOcrProcessor {
/** @var ICommand */
private $command;

/** @var IImageToPdfConverter */
private $converter;

/** @var LoggerInterface */
private $logger;

public function __construct(ICommand $command, LoggerInterface $logger) {
public function __construct(ICommand $command, IImageToPdfConverter $converter, LoggerInterface $logger) {
$this->command = $command;
$this->converter = $converter;
$this->logger = $logger;
}

public function ocrFile(string $fileContent, WorkflowSettings $settings, GlobalSettings $globalSettings): string {
public function ocrFile(File $file, WorkflowSettings $settings, GlobalSettings $globalSettings): OcrProcessorResult {
if ($file->getMimeType() !== 'application/pdf') {
// Convert file to pdf. Here we assume that we're dealing with an image input
$pdfContent = $this->converter->convertToPdf($file->getContent());
} else {
$pdfContent = $file->getContent();
}

$commandStr = 'ocrmypdf -q ' . $this->getCommandlineArgs($settings, $globalSettings) . ' - - | cat';

$this->command
->setCommand($commandStr)
->setStdIn($fileContent);
->setStdIn($pdfContent);

$this->logger->debug('Running command: ' . $commandStr);

Expand Down Expand Up @@ -90,7 +103,7 @@ public function ocrFile(string $fileContent, WorkflowSettings $settings, GlobalS

$this->logger->debug("OCR processing was successful");

return $ocrFileContent;
return new OcrProcessorResult($ocrFileContent, "pdf");
}

private function getCommandlineArgs(WorkflowSettings $settings, GlobalSettings $globalSettings): string {
Expand Down
4 changes: 3 additions & 1 deletion lib/Service/IOcrService.php
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
namespace OCA\WorkflowOcr\Service;

use OCA\WorkflowOcr\Model\WorkflowSettings;
use OCA\WorkflowOcr\OcrProcessors\OcrProcessorResult;
use OCP\Files\File;

interface IOcrService {
/**
Expand All @@ -38,5 +40,5 @@ interface IOcrService {
* @throws \OCA\WorkflowOcr\Exception\OcrNotPossibleException
* @throws \OCA\WorkflowOcr\Exception\OcrProcessorNotFoundException
*/
public function ocrFile(string $mimeType, string $fileContent, WorkflowSettings $settings) : string;
public function ocrFile(File $file, WorkflowSettings $settings) : OcrProcessorResult;
}
Loading

0 comments on commit 3112ec9

Please sign in to comment.