-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request - PDF Parser #48
Comments
PDFs are a nightmare and there is no really good python lib for pdfs that i've found. I've used tika to parse pdfs. If it comes to papers i think grobid is the way to go. I think the best method is to create a custom one depending on what you want to parse. Does it have alot of tables etc. I saw another QA repo by a french or italian group i think but i cant seem to find it. They had PDF parsing in the pipeline. Maybe their implementation would be a good start. If only i could find it... |
Okay, thanks for the feedback, it is appreciated. That QA repo with an existing pipeline is a absolutely good start. I would appreciate if you dig deeper maybe you will find it. Otherwise I will explore other options. |
Hey @mmortazavi, |
I will try the Apache Tika a go. I will stay in touch. |
Hey @mmortazavi, |
@tholor Unfortunately I have not. Somehow I did not have the time yet. It seems the wrapper for Tika is quite straightforward to use, I may give it a shot this week. Will get back to you once tested. Otherwise I am not able to test haystack, since I only have PDFs. |
@mmortazavi Great, thanks for the update. We will also evaluate different converter options in the next weeks and integrate one in Haystack. |
@tholor that would be great; I will be looking forward to it. I have also managed to spend sometime to test Tika wrapper. It was rather straightforward, and it seems Tika is a powerful OCR engine and is quite capable of parsing the PDFs (manual comparison, nothing like a benchmark though); thanks for the introduction. So far so good. However, the problem I can see now that I need to convert the PDFs to text and pass it to the |
@mmortazavi to give a quick update on this topic: We see two major functionalities related to PDFs (or other file formats): 1. Conversion into plain, clean text
2. Alignment of search results to original file
|
any update regarding this ? I hope somebody working on this |
Hi @Utomo88, a PDF parsing pipeline is implemented in #109. It includes the extraction of text from single/multi column pages, basic cleaning steps to remove numeric tables, headers, and footers. Regarding PDF 2.0, are you aware of any public domain v2.0 PDFs that we can we could test with? The current implementation uses pdftotext, but its documentation has no mention of PDF versions. (Closing this issue as it's resolved by #109 but feel free to reply here.) |
Hi @Utomo88, these are samples with very few phrases of text that would not represent a real-world scenario well. It'd be good to test out with larger documents containing text, tables, and figures. Even better if we find something permissibly licensed, so we could add in the Haystack's automated test pipeline. |
I am wondering if it is possible to use PDF files instead of text files when writing to db? As far as I checked there is no built-in capability in the
write_documents_to_db
to handle it. Is it on the roadmap or is out-of-scope? Are there any suggestions to add this feature in the pipeline (robust Python library)?The text was updated successfully, but these errors were encountered: