Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faiss document store create duplicate vector_ids #392

Closed
lalitpagaria opened this issue Sep 17, 2020 · 4 comments
Closed

Faiss document store create duplicate vector_ids #392

lalitpagaria opened this issue Sep 17, 2020 · 4 comments
Assignees
Labels
type:bug Something isn't working

Comments

@lalitpagaria
Copy link
Contributor

Describe the bug
Faiss document store will generate duplicate vector ids when number of documents in write_documents and update_embeddings functions is greater than configured index_buffer_size.

Error message
No error message

Expected behavior
All written documents should have unique vector_ids

Additional context
Using enumerate index as vector_id causing this issue.

for vector_id, doc in enumerate(document_objects[i: i + self.index_buffer_size]): 

My PR will fix this issue, because of refactoring of vector_is generation logic.

To Reproduce
Following test data will reproduce this issue

documents = [
        {"name": "name_1", "text": "text_1", "embedding": np.random.rand(768).astype(np.float32)},
        {"name": "name_2", "text": "text_2", "embedding": np.random.rand(768).astype(np.float32)},
        {"name": "name_3", "text": "text_3", "embedding": np.random.rand(768).astype(np.float32)},
    ]

    document_store = FAISSDocumentStore(sql_url="sqlite:///haystack_test_faiss.db", index_buffer_size=len(documents) - 1)
    document_store.delete_all_documents()
    document_store.write_documents(documents)
    documents_indexed = document_store.get_all_documents()

    # test if number of documents is correct
    assert len(documents_indexed) == len(documents)

    # test if two docs have same vector_is assigned
    vector_ids = set()
    for i, doc in enumerate(documents_indexed):
        vector_ids.add(doc.meta["vector_id"])
    assert len(vector_ids) == len(documents)

System:
All system

@lalitpagaria lalitpagaria added the type:bug Something isn't working label Sep 17, 2020
@lalitpagaria
Copy link
Contributor Author

It will be fixed by this PR #385

@tholor
Copy link
Member

tholor commented Sep 18, 2020

Hey @lalitpagaria ,

Good catch! I can reproduce this bug. I saw #385, but I think for this bug we can have a simpler solution. I will therefore create a separate PR to not mix it with the bigger changes of #385. We will need a more careful review and discussion there.

@tholor
Copy link
Member

tholor commented Sep 21, 2020

Fixed by #395

@tholor tholor closed this as completed Sep 21, 2020
@fingoldo
Copy link
Contributor

I'm not sure if this is related. I'll try specifying index_buffer_size. My version is very fresh, from the source.

09/25/2021 03:33:20 - INFO - haystack.document_store.faiss - Updating embeddings for 1938372 docs...
Inferencing Samples: 100%|█████████████████████████████████████████████████████| 250/250 [00:25<00:00, 9.74 Batches/s]
Updating Embedding: 0%| | 0/1938372 [03:37<?, ? docs/s]
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1800, in _execute_context
cursor, statement, parameters, context
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 717, in do_execute
cursor.execute(statement, parameters)
sqlite3.IntegrityError: UNIQUE constraint failed: document.vector_id

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "gen_embeddings.py", line 19, in
FAISS_store.update_embeddings(retriever, update_existing_embeddings=False, batch_size=1_000)
File "c:\users\administrator\haystack\haystack\document_store\faiss.py", line 290, in update_embeddings
self.update_vector_ids(vector_id_map, index=index)
File "c:\users\administrator\haystack\haystack\document_store\sql.py", line 376, in update_vector_ids
}, synchronize_session=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\orm\query.py", line 3224, in update
execution_options={"synchronize_session": synchronize_session},
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\orm\session.py", line 1689, in execute
result = conn._execute_20(statement, params or {}, execution_options)
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1611, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 324, in _execute_on_connection
self, multiparams, params, execution_options
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1488, in _execute_clauseelement
cache_hit=cache_hit,
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1843, in execute_context
e, statement, parameters, cursor, context
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 2024, in handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from
=e
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 207, in raise

raise exception
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1800, in _execute_context
cursor, statement, parameters, context
File "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 717, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: document.vector_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants