New converter for the hOCR format #650

richardpaulhudson · 2021-07-29T20:06:50Z

Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. It would be good to have a PDFMiner converter that extracts the explicit text information from those PDFs that do have it and uses it to generate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.

I have already developed a solution and am about to submit a pull request.

pietermarsman · 2022-08-14T09:52:37Z

Duplicate of #265

* Add HOCRConverter * Add line to README.md * Test cicd * Test cicd 2 * Changes based on review comments * Remove whitespace changes to CHANGELOG.md * Remove duplicated html output * Add link to hocr wiki * Add tests for extracting hocr and html Co-authored-by: Pieter Marsman <[email protected]>

pietermarsman mentioned this issue Jan 25, 2022

Add HOCRConverter (fixes #650) #651

Merged

pietermarsman added type: new feature status: accepted component: interpreter Related to PDFInterpreter labels Aug 14, 2022

pietermarsman marked this as a duplicate of #265 Aug 14, 2022

pietermarsman closed this as completed in #651 Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New converter for the hOCR format #650

New converter for the hOCR format #650

richardpaulhudson commented Jul 29, 2021

pietermarsman commented Aug 14, 2022

New converter for the hOCR format #650

New converter for the hOCR format #650

Comments

richardpaulhudson commented Jul 29, 2021

pietermarsman commented Aug 14, 2022