Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - PDF Parser #48

Closed
mmortazavi opened this issue Mar 25, 2020 · 13 comments
Closed

Feature Request - PDF Parser #48

mmortazavi opened this issue Mar 25, 2020 · 13 comments
Labels
type:feature New feature or request

Comments

@mmortazavi
Copy link

I am wondering if it is possible to use PDF files instead of text files when writing to db? As far as I checked there is no built-in capability in the write_documents_to_db to handle it. Is it on the roadmap or is out-of-scope? Are there any suggestions to add this feature in the pipeline (robust Python library)?

@ViktorAlm
Copy link

PDFs are a nightmare and there is no really good python lib for pdfs that i've found. I've used tika to parse pdfs. If it comes to papers i think grobid is the way to go. I think the best method is to create a custom one depending on what you want to parse. Does it have alot of tables etc. I saw another QA repo by a french or italian group i think but i cant seem to find it. They had PDF parsing in the pipeline. Maybe their implementation would be a good start. If only i could find it...

@mmortazavi
Copy link
Author

Okay, thanks for the feedback, it is appreciated. That QA repo with an existing pipeline is a absolutely good start. I would appreciate if you dig deeper maybe you will find it. Otherwise I will explore other options.

@tholor
Copy link
Member

tholor commented Mar 25, 2020

Hey @mmortazavi,
yes, PDF conversion is definitely something on our roadmap. As @ViktorAlm mentioned already, it's very tricky to have here a good, generic solution. There's a couple of python libraries, but for industrial applications, Apache Tika is probably the most common tooling and supports also many other file types.
Maybe you could give this Python Wrapper for Tika a try on your PDFs? This could be an interesting option to integrate then also into haystack.

@mmortazavi
Copy link
Author

I will try the Apache Tika a go. I will stay in touch.

@tholor
Copy link
Member

tholor commented May 4, 2020

Hey @mmortazavi,
Did you try Apache Tika? How did it go?

@mmortazavi
Copy link
Author

@tholor Unfortunately I have not. Somehow I did not have the time yet. It seems the wrapper for Tika is quite straightforward to use, I may give it a shot this week. Will get back to you once tested. Otherwise I am not able to test haystack, since I only have PDFs.

@tholor tholor added the type:feature New feature or request label May 6, 2020
@tholor
Copy link
Member

tholor commented May 6, 2020

@mmortazavi Great, thanks for the update. We will also evaluate different converter options in the next weeks and integrate one in Haystack.

@mmortazavi
Copy link
Author

@tholor that would be great; I will be looking forward to it.

I have also managed to spend sometime to test Tika wrapper. It was rather straightforward, and it seems Tika is a powerful OCR engine and is quite capable of parsing the PDFs (manual comparison, nothing like a benchmark though); thanks for the introduction. So far so good.

However, the problem I can see now that I need to convert the PDFs to text and pass it to the write_documents_to_db for indexing and query, right? One issue here is that the PDFs contains tables, in some pages it is two columns, sometimes single column, as well as images with captions. The Tika returns a plain text, and rather messy format! The goal here beyond Q&A is to be able to retrieve and highlight the Search Results as them appear in the PDFs (with filename as well as page number). I am not sure these extra functionalities are what haystack is meant for, but I would like them to have them all for the use case. Such extra features is shown in this blogpost using Azure Cognitive Search, in case it is not clear. Happy to hear your thoughts.

@tholor
Copy link
Member

tholor commented May 13, 2020

@mmortazavi to give a quick update on this topic:

We see two major functionalities related to PDFs (or other file formats):

1. Conversion into plain, clean text

  • Tools like Tika could do the job for plain conversion
  • We will try to support the most general cleaning steps in haystack (e.g. removal of header, footer, pdfs with multiple columns of text)
  • Special cleaning steps related to custom PDFs will be the responsibility of the user
  • We expect to integrate this into haystack in the next weeks

2. Alignment of search results to original file

  • We definitely see the value of this feature as it would enable a) highlighting the answer in the original format and b) labeling in the original format
  • We are currently exploring the option of converting PDFs into HTML mimicking the original format (e.g. via https://github.com/pdf2htmlEX/pdf2htmlEX)
  • If the integration seems feasible, it would be great to include it in haystack. We don't have a timeline yet, but will keep you posted.

@Utomo88
Copy link

Utomo88 commented Jun 14, 2020

any update regarding this ? I hope somebody working on this
and is that https://github.com/pdf2htmlEX/pdf2htmlEX already support PDF 2.0 ?

@tanaysoni
Copy link
Contributor

Hi @Utomo88, a PDF parsing pipeline is implemented in #109. It includes the extraction of text from single/multi column pages, basic cleaning steps to remove numeric tables, headers, and footers.

Regarding PDF 2.0, are you aware of any public domain v2.0 PDFs that we can we could test with? The current implementation uses pdftotext, but its documentation has no mention of PDF versions.

(Closing this issue as it's resolved by #109 but feel free to reply here.)

@Utomo88
Copy link

Utomo88 commented Jun 15, 2020

@tanaysoni
Copy link
Contributor

Hi @Utomo88, these are samples with very few phrases of text that would not represent a real-world scenario well. It'd be good to test out with larger documents containing text, tables, and figures. Even better if we find something permissibly licensed, so we could add in the Haystack's automated test pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants