-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove phi normalization from FAISS & support more index types #467
Conversation
def _create_new_index(self, vector_size: int, index_factory: str = "HNSW4"): | ||
index = faiss.index_factory(vector_size + 1, index_factory) | ||
def _create_new_index(self, vector_dim: int, index_factory: str = "Flat", metric_type=faiss.METRIC_INNER_PRODUCT, **kwargs): | ||
if index_factory == "HNSW" and metric_type == faiss.METRIC_INNER_PRODUCT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for adding comment now, but adding here for more context -
I think better to limit faiss related internals in this class. As user of this class already getting freedom to pass self configured index in class via faiss_index
and faiss_index_factory_str
parameters of the constructor. It will keep this class clean.
Another point how prevent index.add()
function from breaking? With faiss_index_factory_str="IDMap,Flat"
parameter vector add
call will fail. Which was addressed in #385 as follows:
total_indices = self.size()
ids = np.arange(total_indices, total_indices + len(embeddings), 1, dtype=np.int64)
if isinstance(self.faiss_index, IndexIDMap):
self.faiss_index.add_with_ids(vectors_to_add, ids)
else:
self.faiss_index.add(vectors_to_add)
What is your view on this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, normally I would agree that we should limit it to FAISS internals. However, HNSW Flat with IP is not giving the same results when initialized via the index factory. The performance is totally off. See facebookresearch/faiss#1434
Therefore, we need a special case handling here anyway and do a direct init without the factory. I then took the chance and added "good" defaults for the other params. I believe most users will be overwhelmed by deciding on a proper index type for their use case. I believe most will use either just "Flat" or our preconfigured "HNSW" variant. We will also benchmark those two and try to tune the default HNSW config to fit the most common use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the IDMap. What would be the use case here? I think the FAISS id handling is rather something internal that is not interesting for most haystack users. Therefore, I don't think we need to support it, but maybe I am missing a good use case here 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, normally I would agree that we should limit it to FAISS internals. However, HNSW Flat with IP is not giving the same results when initialized via the index factory. The performance is totally off. See facebookresearch/faiss#1434
Yes agree with you. We can move this part to another function to make it more explicit. I am also able to reproduce facebookresearch/faiss#1434 even tried by compiling fresh from top of the branch. I suspect some issue related to default value of metric_type
in struct
. They recently added support for IP metric type for HNSW
so suspect some places they might have missed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the IDMap. What would be the use case here? I think the FAISS id handling is rather something internal that is not interesting for most haystack users. Therefore, I don't think we need to support it, but maybe I am missing a good use case here 🤔 ?
Actually IDMap
have very good use case it abstract faiss internal id with external id. Hence one can update
or delete
embedding from faiss index store. So far what I understood that internally embeddings are sequentially indexed or the corresponding vector_id. Hence if one id is removed in between then it causes update of vector_id for all the embeddings coming after that record.
So far haystack only allow add method for faiss document_store
(Elastic search don't have this issue). And if user want to update/remove particular document then they need to delete all documents and then add them again. There another solution to mark document to be removed as stale but it will not prevent faiss search to return stale document vector_id. This is what I covered in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tholor Have you got use case of using IDMap? BTW Milvus handle these case better
https://dzone.com/articles/how-milvus-realizes-the-delete-function
Now they are planning to support multi storage options. https://docs.google.com/document/d/1iwwLH4Jtm3OXIVb7jFYsfmcbOyX6AWZKaNJAaXC7-cw/edit
document_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents] | ||
embeddings = [doc.embedding for doc in document_objects] | ||
embeddings = np.array(embeddings, dtype="float32") | ||
self.faiss_index.train(embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check self.faiss_index.is_trained
before calling train
otherwise function will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we could catch the error and make it more descriptive, but I think we should also let the user know that it's not valid to call train_index() in this case.
@@ -26,7 +26,11 @@ def elasticsearch_fixture(): | |||
except: | |||
print("Starting Elasticsearch ...") | |||
status = subprocess.run( | |||
['docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.1'], | |||
['docker rm haystack_test_elastic'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same is needed for tika
docker container as well
As described in #422 we don't need the phi normalization trick anymore as HNSW supports the inner product metric. This will simplify the code a lot - especially for multiple subsequent writes of documents.
Breaking changes:
vector_size
->vector_dim
Old:
FAISSDocumentStore(...vector_size=768)
New:
FAISSDocumentStore(...vector_dim=768)
Limitations / Future improvements