Feature Request - PDF Parser #48

mmortazavi · 2020-03-25T08:39:42Z

I am wondering if it is possible to use PDF files instead of text files when writing to db? As far as I checked there is no built-in capability in the write_documents_to_db to handle it. Is it on the roadmap or is out-of-scope? Are there any suggestions to add this feature in the pipeline (robust Python library)?

The text was updated successfully, but these errors were encountered:

ViktorAlm · 2020-03-25T11:56:46Z

PDFs are a nightmare and there is no really good python lib for pdfs that i've found. I've used tika to parse pdfs. If it comes to papers i think grobid is the way to go. I think the best method is to create a custom one depending on what you want to parse. Does it have alot of tables etc. I saw another QA repo by a french or italian group i think but i cant seem to find it. They had PDF parsing in the pipeline. Maybe their implementation would be a good start. If only i could find it...

mmortazavi · 2020-03-25T14:36:31Z

Okay, thanks for the feedback, it is appreciated. That QA repo with an existing pipeline is a absolutely good start. I would appreciate if you dig deeper maybe you will find it. Otherwise I will explore other options.

tholor · 2020-03-25T15:54:31Z

Hey @mmortazavi,
yes, PDF conversion is definitely something on our roadmap. As @ViktorAlm mentioned already, it's very tricky to have here a good, generic solution. There's a couple of python libraries, but for industrial applications, Apache Tika is probably the most common tooling and supports also many other file types.
Maybe you could give this Python Wrapper for Tika a try on your PDFs? This could be an interesting option to integrate then also into haystack.

mmortazavi · 2020-03-26T09:39:41Z

I will try the Apache Tika a go. I will stay in touch.

tholor · 2020-05-04T10:33:18Z

Hey @mmortazavi,
Did you try Apache Tika? How did it go?

mmortazavi · 2020-05-05T17:14:27Z

@tholor Unfortunately I have not. Somehow I did not have the time yet. It seems the wrapper for Tika is quite straightforward to use, I may give it a shot this week. Will get back to you once tested. Otherwise I am not able to test haystack, since I only have PDFs.

tholor · 2020-05-06T11:57:23Z

@mmortazavi Great, thanks for the update. We will also evaluate different converter options in the next weeks and integrate one in Haystack.

mmortazavi · 2020-05-06T15:33:52Z

@tholor that would be great; I will be looking forward to it.

I have also managed to spend sometime to test Tika wrapper. It was rather straightforward, and it seems Tika is a powerful OCR engine and is quite capable of parsing the PDFs (manual comparison, nothing like a benchmark though); thanks for the introduction. So far so good.

However, the problem I can see now that I need to convert the PDFs to text and pass it to the write_documents_to_db for indexing and query, right? One issue here is that the PDFs contains tables, in some pages it is two columns, sometimes single column, as well as images with captions. The Tika returns a plain text, and rather messy format! The goal here beyond Q&A is to be able to retrieve and highlight the Search Results as them appear in the PDFs (with filename as well as page number). I am not sure these extra functionalities are what haystack is meant for, but I would like them to have them all for the use case. Such extra features is shown in this blogpost using Azure Cognitive Search, in case it is not clear. Happy to hear your thoughts.

tholor · 2020-05-13T12:47:24Z

@mmortazavi to give a quick update on this topic:

We see two major functionalities related to PDFs (or other file formats):

1. Conversion into plain, clean text

Tools like Tika could do the job for plain conversion
We will try to support the most general cleaning steps in haystack (e.g. removal of header, footer, pdfs with multiple columns of text)
Special cleaning steps related to custom PDFs will be the responsibility of the user
We expect to integrate this into haystack in the next weeks

2. Alignment of search results to original file

We definitely see the value of this feature as it would enable a) highlighting the answer in the original format and b) labeling in the original format
We are currently exploring the option of converting PDFs into HTML mimicking the original format (e.g. via https://github.com/pdf2htmlEX/pdf2htmlEX)
If the integration seems feasible, it would be great to include it in haystack. We don't have a timeline yet, but will keep you posted.

Utomo88 · 2020-06-14T09:27:33Z

any update regarding this ? I hope somebody working on this
and is that https://github.com/pdf2htmlEX/pdf2htmlEX already support PDF 2.0 ?

tanaysoni · 2020-06-15T08:37:54Z

Hi @Utomo88, a PDF parsing pipeline is implemented in #109. It includes the extraction of text from single/multi column pages, basic cleaning steps to remove numeric tables, headers, and footers.

Regarding PDF 2.0, are you aware of any public domain v2.0 PDFs that we can we could test with? The current implementation uses pdftotext, but its documentation has no mention of PDF versions.

(Closing this issue as it's resolved by #109 but feel free to reply here.)

Utomo88 · 2020-06-15T08:44:50Z

Check this
https://github.com/pdf-association/pdf20examples
From datalogic
https://www.pdfa.org/pdf-2-0-examples-now-available/

tanaysoni · 2020-06-15T11:32:18Z

Hi @Utomo88, these are samples with very few phrases of text that would not represent a real-world scenario well. It'd be good to test out with larger documents containing text, tables, and figures. Even better if we find something permissibly licensed, so we could add in the Haystack's automated test pipeline.

tholor added the type:feature New feature or request label May 6, 2020

tanaysoni closed this as completed Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request - PDF Parser #48

Feature Request - PDF Parser #48

mmortazavi commented Mar 25, 2020

ViktorAlm commented Mar 25, 2020

mmortazavi commented Mar 25, 2020

tholor commented Mar 25, 2020

mmortazavi commented Mar 26, 2020

tholor commented May 4, 2020

mmortazavi commented May 5, 2020

tholor commented May 6, 2020

mmortazavi commented May 6, 2020

tholor commented May 13, 2020

Utomo88 commented Jun 14, 2020

tanaysoni commented Jun 15, 2020

Utomo88 commented Jun 15, 2020

tanaysoni commented Jun 15, 2020

Feature Request - PDF Parser #48

Feature Request - PDF Parser #48

Comments

mmortazavi commented Mar 25, 2020

ViktorAlm commented Mar 25, 2020

mmortazavi commented Mar 25, 2020

tholor commented Mar 25, 2020

mmortazavi commented Mar 26, 2020

tholor commented May 4, 2020

mmortazavi commented May 5, 2020

tholor commented May 6, 2020

mmortazavi commented May 6, 2020

tholor commented May 13, 2020

Utomo88 commented Jun 14, 2020

tanaysoni commented Jun 15, 2020

Utomo88 commented Jun 15, 2020

tanaysoni commented Jun 15, 2020