You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. It would be good to have a PDFMiner converter that extracts the explicit text information from those PDFs that do have it and uses it to generate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.
I have already developed a solution and am about to submit a pull request.
The text was updated successfully, but these errors were encountered:
* Add HOCRConverter
* Add line to README.md
* Test cicd
* Test cicd 2
* Changes based on review comments
* Remove whitespace changes to CHANGELOG.md
* Remove duplicated html output
* Add link to hocr wiki
* Add tests for extracting hocr and html
Co-authored-by: Pieter Marsman <[email protected]>
Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. It would be good to have a PDFMiner converter that extracts the explicit text information from those PDFs that do have it and uses it to generate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.
I have already developed a solution and am about to submit a pull request.
The text was updated successfully, but these errors were encountered: