Faiss document store create duplicate vector_ids #392

lalitpagaria · 2020-09-17T16:25:43Z

Describe the bug
Faiss document store will generate duplicate vector ids when number of documents in write_documents and update_embeddings functions is greater than configured index_buffer_size.

Error message
No error message

Expected behavior
All written documents should have unique vector_ids

Additional context
Using enumerate index as vector_id causing this issue.

for vector_id, doc in enumerate(document_objects[i: i + self.index_buffer_size]):

My PR will fix this issue, because of refactoring of vector_is generation logic.

To Reproduce
Following test data will reproduce this issue

documents = [
        {"name": "name_1", "text": "text_1", "embedding": np.random.rand(768).astype(np.float32)},
        {"name": "name_2", "text": "text_2", "embedding": np.random.rand(768).astype(np.float32)},
        {"name": "name_3", "text": "text_3", "embedding": np.random.rand(768).astype(np.float32)},
    ]

    document_store = FAISSDocumentStore(sql_url="sqlite:///haystack_test_faiss.db", index_buffer_size=len(documents) - 1)
    document_store.delete_all_documents()
    document_store.write_documents(documents)
    documents_indexed = document_store.get_all_documents()

    # test if number of documents is correct
    assert len(documents_indexed) == len(documents)

    # test if two docs have same vector_is assigned
    vector_ids = set()
    for i, doc in enumerate(documents_indexed):
        vector_ids.add(doc.meta["vector_id"])
    assert len(vector_ids) == len(documents)

System:
All system

The text was updated successfully, but these errors were encountered:

lalitpagaria · 2020-09-17T16:26:34Z

It will be fixed by this PR #385

tholor · 2020-09-18T06:32:14Z

Hey @lalitpagaria ,

Good catch! I can reproduce this bug. I saw #385, but I think for this bug we can have a simpler solution. I will therefore create a separate PR to not mix it with the bigger changes of #385. We will need a more careful review and discussion there.

tholor · 2020-09-21T11:37:55Z

Fixed by #395

fingoldo · 2021-09-25T00:44:13Z

I'm not sure if this is related. I'll try specifying index_buffer_size. My version is very fresh, from the source.

09/25/2021 03:33:20 - INFO - haystack.document_store.faiss - Updating embeddings for 1938372 docs...
Inferencing Samples: 100%|█████████████████████████████████████████████████████| 250/250 [00:25<00:00, 9.74 Batches/s]
Updating Embedding: 0%| | 0/1938372 [03:37<?, ? docs/s]
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1800, in _execute_context
cursor, statement, parameters, context
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 717, in do_execute
cursor.execute(statement, parameters)
sqlite3.IntegrityError: UNIQUE constraint failed: document.vector_id

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "gen_embeddings.py", line 19, in
FAISS_store.update_embeddings(retriever, update_existing_embeddings=False, batch_size=1_000)
File "c:\users\administrator\haystack\haystack\document_store\faiss.py", line 290, in update_embeddings
self.update_vector_ids(vector_id_map, index=index)
File "c:\users\administrator\haystack\haystack\document_store\sql.py", line 376, in update_vector_ids
}, synchronize_session=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\orm\query.py", line 3224, in update
execution_options={"synchronize_session": synchronize_session},
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\orm\session.py", line 1689, in execute
result = conn._execute_20(statement, params or {}, execution_options)
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1611, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 324, in _execute_on_connection
self, multiparams, params, execution_options
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1488, in _execute_clauseelement
cache_hit=cache_hit,
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1843, in execute_context
e, statement, parameters, cursor, context
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 2024, in handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from=e
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 207, in raise
raise exception
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1800, in _execute_context
cursor, statement, parameters, context
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 717, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: document.vector_id

lalitpagaria added the type:bug Something isn't working label Sep 17, 2020

This was referenced Sep 17, 2020

Improvements in FaissDocumentStore: Abstracting out FAISS index related customisation to another class #385

Closed

SQL store incorrectly apply filter query #391

Closed

tholor self-assigned this Sep 18, 2020

tholor mentioned this issue Sep 18, 2020

Fix duplicate vector ids in FAISS #395

Merged

tholor closed this as completed Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faiss document store create duplicate vector_ids #392

Faiss document store create duplicate vector_ids #392

lalitpagaria commented Sep 17, 2020

lalitpagaria commented Sep 17, 2020

tholor commented Sep 18, 2020

tholor commented Sep 21, 2020

fingoldo commented Sep 25, 2021

Faiss document store create duplicate vector_ids #392

Faiss document store create duplicate vector_ids #392

Comments

lalitpagaria commented Sep 17, 2020

lalitpagaria commented Sep 17, 2020

tholor commented Sep 18, 2020

tholor commented Sep 21, 2020

fingoldo commented Sep 25, 2021