-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HOCRConverter (fixes #650) #651
Conversation
Would be amazing if this could be merged and included! |
Looks good to me. I only wonder if this is something that should be added to pdfminer.six as core functionality. Alternatively, this could be something that everyone implements to their own liking. The composable api is perfectly suitable for adding functionality like this. I'll post this question on the gitter. |
After some delibration I'm positive on adding hocr as an output format. It has two advantages: direct comparison of the output to ocr tools and usage of other tools (e.g. visualization) built for hocr. I'll do a more detailed review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the super nice PR!
Can you add tests showing this works. Ideally you would use the simple1.pdf for this.
This PR is already very good, but I like to use each change as an opportunity to improve pdfminer.six a bit. So I added some comments on how to improve this PR.
@richardpaulhudson I used this PR a bit for testing if the new CI pipeline is functioning properly. Now it is :) |
@richardpaulhudson any plans on working on this in the future? |
Hi @pietermarsman, thank you for the review and sorry for not responding sooner — I've changed employers in the meantime and there seem to be issues with where my GitHub notification mails are ending up. I hope to be able to pick up working on this in the next couple of months. |
FYI, I've changed this MR to merge into master. The develop branch will be removed, because soon we will work with version tags to indicate the releases and the distinction between develop and master becomes obsolete. |
bump ;) |
Sorry it's taken me so long to get back to this :-)
I can certainly see the need for some sort of regression test, but am unsure how to approach it. What I actually did myself was:
neither of which lend themselves easily to a regression test. The options are:
|
I prefer option 1 (just checking if the code does not raise an error) or 2 (check for specific output). If you go for two, we do indeed need to have some output that we know is reasonably stable. Having a test with output (option 2) is also a start of some documentation, as other developers can easily see what the expected output is of the tool |
@richardpaulhudson Thanks for the all your work! |
Pull request
Fix #650
Fix #265
Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. This converter extracts the explicit text information from those PDFs that do have it and uses it to genxerate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.
How Has This Been Tested?
tox also runs with Python 3.8 and 3.9.
Checklist
works
version
is not necessary
verified that this is not necessary
CHANGELOG.md