Indexing of files is not currently supported #453

Weilin37 · 2020-09-30T18:23:01Z

Describe the bug
I am trying to follow the tutorial notebook for DPR and replace the GOT text files with my own text files. My text files is just a simple batch of 2k text files comprising of a title and abstract (paragraph or two long) separate by a newline.

Error message
in this section of the code:

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

I get an error message: Indexing of files is not currently supported:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-fd0550140fb8> in <module>
      5 
      6 # Convert files to dicts
----> 7 dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
      8 
      9 # Now, let's write the dicts containing documents to our DB.

~/Library/Python/3.8/lib/python/site-packages/haystack/preprocessor/utils.py in convert_files_to_dicts(dir_path, clean_func, split_paragraphs)
    101             text = document["text"]
    102         else:
--> 103             raise Exception(f"Indexing of {path.suffix} files is not currently supported.")
    104 
    105         if clean_func:

Exception: Indexing of  files is not currently supported.

Expected behavior
I was expecting the code to simply convert the files to dicts in the same way it does for the GoT text

To Reproduce
https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb

System:

OS: MacOS
GPU/CPU: CPU
Haystack version (commit or version number): Latest pip install as of today

The text was updated successfully, but these errors were encountered:

lalitpagaria · 2020-09-30T23:38:51Z

@Weilin37 You need to add .txt suffix with your file name. Currently files with .txt and .pdf suffix are supported.
But I think this should be properly documented.

Ideally file type identification by checking file header should be used, but it require special libmagic lib to be installed along with python-magic lib.

Weilin37 · 2020-10-01T03:10:00Z

Hi @lalitpagaria,

I checked my directory and indeed the ".txt" suffix is present.

Here is one of the file names: PMC7462872.txt

Which contains the following title and abstract text from public scientific literature:

Exacerbation of chronic inflammatory demyelinating polyneuropathy in concomitance with COVID-19

• A worsening of CIDP may occur in concomitance with COVID-19. • Cytokine hyperactivation triggered by SARS-CoV-2 might be a possible mechanism. • The management of these patients is particularly challenging.

I checked the other files which were auto generated in the same format and they all have ".txt" suffix as well. I just tried only putting in one of the "txt" files to test if one file would work but it gave me the same error.

Could it be possibly due to something that's inside the file?

lalitpagaria · 2020-10-01T08:34:01Z

@Weilin37 I am able to reproduce this issue, it happen when OS create few internal files with extension(s) not supported by converter.
@tholor @tanaysoni It would be better to skip not supported files, instead of throwing exception. I can create PR if this approach is okay for you?

tholor · 2020-10-01T08:47:05Z

@lalitpagaria A PR would be great. Thx!
How about we skip unsupported files and log a warning Skipped file "XYZ" as type ".XYZ" is not supported here. See haystack.file_converter for support of more file types?

* Skip file converter if file type is not supported. Refer #453 * Fixing issue reported by mypy * Addressing review comments

Weilin37 · 2020-10-01T13:11:40Z

@lalitpagaria

Thanks! Just to be thorough - I generated the txt files in python and then manually moved them to a new folder. So I'm not entirely sure any hidden files are there but we'll have to see!

lalitpagaria · 2020-10-01T13:13:48Z

@Weilin37 can you please try latest changes now. See if fix which merged for this is working in your case or you are getting some other issue.

Weilin37 · 2020-10-01T13:31:59Z

@lalitpagaria

it works!

tholor · 2020-10-01T14:25:59Z

Awesome! Great to see the community helping each other :)
Closing this as it's fixed by #456

Weilin37 · 2020-11-05T14:43:11Z

@lalitpagaria with 0.4.9 this problem appears again.

When I upgrade from master (0.5.0), I get a different issues with the following code:

from haystack.retriever.dense import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  embed_title=True,
                                  max_seq_len=256,
                                  batch_size=16,
                                  remove_sep_tok_from_untitled_passages=True)
# Important: 
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation. 
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(retriever)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-f8eb00f5564a> in <module>
      1 from haystack.retriever.dense import DensePassageRetriever
----> 2 retriever = DensePassageRetriever(document_store=document_store,
      3                                   query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
      4                                   passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
      5                                   use_gpu=True,

TypeError: __init__() got an unexpected keyword argument 'max_seq_len'

lalitpagaria · 2020-11-05T16:11:51Z

@Weilin37 DPR issue on 0.5.0 is because of changes in #527
Can you please replace max_seq_len with max_seq_len_passage in you script.

Regarding your original issue I am not sure why you getting that as haystack already have test to catch it. Can you please try above suggested change on 0.5.0 and then share stacktrace or error.

tholor · 2020-11-05T16:17:18Z

@Weilin37 I guess with 0.5.0 you refer to the FARM version? Please always make sure that your Haystack and FARM version are compatible. For example, with the latest Haystack, we expect FARM 0.5.0 (as specified in requirements.txt).

As @lalitpagaria already mentioned, the signature of DPR has changed in #527. You can also see an updated example reflecting the changes in our Tutorial 6

Weilin37 · 2020-11-05T17:56:09Z

@tholor thanks the changing from max_seq_len to max_seq_len_passage worked!

Weilin37 added the type:bug Something isn't working label Sep 30, 2020

lalitpagaria mentioned this issue Oct 1, 2020

Skip file conversion if file type is not supported #456

Merged

tholor pushed a commit that referenced this issue Oct 1, 2020

Skip file conversion if file type is not supported (#456)

9b58374

* Skip file converter if file type is not supported. Refer #453 * Fixing issue reported by mypy * Addressing review comments

tholor closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing of files is not currently supported #453

Indexing of files is not currently supported #453

Weilin37 commented Sep 30, 2020 •

edited

Loading

lalitpagaria commented Sep 30, 2020 •

edited

Loading

Weilin37 commented Oct 1, 2020 •

edited

Loading

lalitpagaria commented Oct 1, 2020

tholor commented Oct 1, 2020

Weilin37 commented Oct 1, 2020

lalitpagaria commented Oct 1, 2020

Weilin37 commented Oct 1, 2020

tholor commented Oct 1, 2020

Weilin37 commented Nov 5, 2020 •

edited

Loading

lalitpagaria commented Nov 5, 2020

tholor commented Nov 5, 2020

Weilin37 commented Nov 5, 2020

Indexing of files is not currently supported #453

Indexing of files is not currently supported #453

Comments

Weilin37 commented Sep 30, 2020 • edited Loading

lalitpagaria commented Sep 30, 2020 • edited Loading

Weilin37 commented Oct 1, 2020 • edited Loading

lalitpagaria commented Oct 1, 2020

tholor commented Oct 1, 2020

Weilin37 commented Oct 1, 2020

lalitpagaria commented Oct 1, 2020

Weilin37 commented Oct 1, 2020

tholor commented Oct 1, 2020

Weilin37 commented Nov 5, 2020 • edited Loading

lalitpagaria commented Nov 5, 2020

tholor commented Nov 5, 2020

Weilin37 commented Nov 5, 2020

Weilin37 commented Sep 30, 2020 •

edited

Loading

lalitpagaria commented Sep 30, 2020 •

edited

Loading

Weilin37 commented Oct 1, 2020 •

edited

Loading

Weilin37 commented Nov 5, 2020 •

edited

Loading