Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing of files is not currently supported #453

Closed
Weilin37 opened this issue Sep 30, 2020 · 12 comments
Closed

Indexing of files is not currently supported #453

Weilin37 opened this issue Sep 30, 2020 · 12 comments
Labels
type:bug Something isn't working

Comments

@Weilin37
Copy link

Weilin37 commented Sep 30, 2020

Describe the bug
I am trying to follow the tutorial notebook for DPR and replace the GOT text files with my own text files. My text files is just a simple batch of 2k text files comprising of a title and abstract (paragraph or two long) separate by a newline.

Error message
in this section of the code:

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

I get an error message: Indexing of files is not currently supported:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-fd0550140fb8> in <module>
      5 
      6 # Convert files to dicts
----> 7 dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
      8 
      9 # Now, let's write the dicts containing documents to our DB.

~/Library/Python/3.8/lib/python/site-packages/haystack/preprocessor/utils.py in convert_files_to_dicts(dir_path, clean_func, split_paragraphs)
    101             text = document["text"]
    102         else:
--> 103             raise Exception(f"Indexing of {path.suffix} files is not currently supported.")
    104 
    105         if clean_func:

Exception: Indexing of  files is not currently supported.

Expected behavior
I was expecting the code to simply convert the files to dicts in the same way it does for the GoT text

To Reproduce
https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb

System:

  • OS: MacOS
  • GPU/CPU: CPU
  • Haystack version (commit or version number): Latest pip install as of today
@Weilin37 Weilin37 added the type:bug Something isn't working label Sep 30, 2020
@lalitpagaria
Copy link
Contributor

lalitpagaria commented Sep 30, 2020

@Weilin37 You need to add .txt suffix with your file name. Currently files with .txt and .pdf suffix are supported.
But I think this should be properly documented.

Ideally file type identification by checking file header should be used, but it require special libmagic lib to be installed along with python-magic lib.

@Weilin37
Copy link
Author

Weilin37 commented Oct 1, 2020

Hi @lalitpagaria,

I checked my directory and indeed the ".txt" suffix is present.

Here is one of the file names: PMC7462872.txt

Which contains the following title and abstract text from public scientific literature:

Exacerbation of chronic inflammatory demyelinating polyneuropathy in concomitance with COVID-19

• A worsening of CIDP may occur in concomitance with COVID-19. • Cytokine hyperactivation triggered by SARS-CoV-2 might be a possible mechanism. • The management of these patients is particularly challenging.

I checked the other files which were auto generated in the same format and they all have ".txt" suffix as well. I just tried only putting in one of the "txt" files to test if one file would work but it gave me the same error.

Could it be possibly due to something that's inside the file?

@lalitpagaria
Copy link
Contributor

@Weilin37 I am able to reproduce this issue, it happen when OS create few internal files with extension(s) not supported by converter.
@tholor @tanaysoni It would be better to skip not supported files, instead of throwing exception. I can create PR if this approach is okay for you?

@tholor
Copy link
Member

tholor commented Oct 1, 2020

@lalitpagaria A PR would be great. Thx!
How about we skip unsupported files and log a warning Skipped file "XYZ" as type ".XYZ" is not supported here. See haystack.file_converter for support of more file types?

tholor pushed a commit that referenced this issue Oct 1, 2020
* Skip file converter if file type is not supported. Refer #453

* Fixing issue reported by mypy

* Addressing review comments
@Weilin37
Copy link
Author

Weilin37 commented Oct 1, 2020

@lalitpagaria

Thanks! Just to be thorough - I generated the txt files in python and then manually moved them to a new folder. So I'm not entirely sure any hidden files are there but we'll have to see!

@lalitpagaria
Copy link
Contributor

@Weilin37 can you please try latest changes now. See if fix which merged for this is working in your case or you are getting some other issue.

@Weilin37
Copy link
Author

Weilin37 commented Oct 1, 2020

@lalitpagaria

it works!

@tholor
Copy link
Member

tholor commented Oct 1, 2020

Awesome! Great to see the community helping each other :)
Closing this as it's fixed by #456

@tholor tholor closed this as completed Oct 1, 2020
@Weilin37
Copy link
Author

Weilin37 commented Nov 5, 2020

@lalitpagaria with 0.4.9 this problem appears again.

When I upgrade from master (0.5.0), I get a different issues with the following code:

from haystack.retriever.dense import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  embed_title=True,
                                  max_seq_len=256,
                                  batch_size=16,
                                  remove_sep_tok_from_untitled_passages=True)
# Important: 
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation. 
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(retriever)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-f8eb00f5564a> in <module>
      1 from haystack.retriever.dense import DensePassageRetriever
----> 2 retriever = DensePassageRetriever(document_store=document_store,
      3                                   query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
      4                                   passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
      5                                   use_gpu=True,

TypeError: __init__() got an unexpected keyword argument 'max_seq_len'

@lalitpagaria
Copy link
Contributor

@Weilin37 DPR issue on 0.5.0 is because of changes in #527
Can you please replace max_seq_len with max_seq_len_passage in you script.

Regarding your original issue I am not sure why you getting that as haystack already have test to catch it. Can you please try above suggested change on 0.5.0 and then share stacktrace or error.

@tholor
Copy link
Member

tholor commented Nov 5, 2020

@Weilin37 I guess with 0.5.0 you refer to the FARM version? Please always make sure that your Haystack and FARM version are compatible. For example, with the latest Haystack, we expect FARM 0.5.0 (as specified in requirements.txt).

As @lalitpagaria already mentioned, the signature of DPR has changed in #527. You can also see an updated example reflecting the changes in our Tutorial 6

@Weilin37
Copy link
Author

Weilin37 commented Nov 5, 2020

@tholor thanks the changing from max_seq_len to max_seq_len_passage worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants