Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New converter for the hOCR format #650

Closed
richardpaulhudson opened this issue Jul 29, 2021 · 1 comment · Fixed by #651
Closed

New converter for the hOCR format #650

richardpaulhudson opened this issue Jul 29, 2021 · 1 comment · Fixed by #651

Comments

@richardpaulhudson
Copy link

Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. It would be good to have a PDFMiner converter that extracts the explicit text information from those PDFs that do have it and uses it to generate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.

I have already developed a solution and am about to submit a pull request.

@pietermarsman
Copy link
Member

Duplicate of #265

@pietermarsman pietermarsman marked this as a duplicate of #265 Aug 14, 2022
pietermarsman added a commit that referenced this issue Aug 14, 2022
* Add HOCRConverter

* Add line to README.md

* Test cicd

* Test cicd 2

* Changes based on review comments

* Remove whitespace changes to CHANGELOG.md

* Remove duplicated html output

* Add link to hocr wiki

* Add tests for extracting hocr and html

Co-authored-by: Pieter Marsman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants