-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cosine similarity as similarity metric for FAISS document store #1337
Comments
Hey @mathislucka , We'd probably need to add the normalization to |
@tholor we reverted this in #467. Refer #422 for more discussion. Referring comment -
|
This was a different type of normalization. If I recall correctly we used L2+Phi normalization back then to enable inner product in HNSW. This was then later on natively supported by FAISS and the trick was not required anymore. This issue here however is about supporting cosine similarity (just requiring L2 normalization). |
Oh okay now I know that I already implemented L2 normalisation but we closed that PR for future reference. We can bring some of these idea now 🙂 You had one comment about L2 normalisation |
You are right @tholor there will probably only be changes in the document store and not in the retriever. I didn't look at it properly before. I'll make sure to read through the referenced issues when implementing the feature. What do you think about the case when someone loads an existing faiss index from a file. As far as I can see, no initialization parameters are saved alongside the file so someone might initialize an index that was originally generated with cosine similarity with another distance metric. This would produce incompatible vectors. Currently, I am not sure if there is a solution to this. On the other hand, you already put responsibility to check if saved and loaded indices are compatible in the hands of the end user (e.g. here: haystack/haystack/document_store/faiss.py Line 454 in 07bd3c5
|
Good point. It might make sense to add the |
I didn't get to writing any code but I manage to think a bit more about the issue of embedding mismatch when loading a faiss document store by calling its As far as I can see, the I came to the conclusion that the From an end user perspective, I do not like that I have to remember these arguments or store them somewhere else. Couldn't the What do you think about this @tholor ? |
I have added a first draft of the proposed changes. It just contains the bare minimum code that will be needed to support cosine similarity. I'm having issues to get the tests to run though. It seems as if some requirements are missing in the |
Yes, that's absolutely right. We store the params in a the attribute
Can you please share more details about the problem that you are facing (e.g. error msgs)? If you want to run all tests locally, you will need to run a couple of documentstores in the background. The simplest would be to launch the required ones via docker (similar as in our CI) before running the tests. |
closing this with #1352 |
Is your feature request related to a problem? Please describe.
FAISS is a really nice and fast indexing solution for dense vectors. Using it for semantic similarity search works very well. However, currently only dot product and L2 are supported when it comes to distance metrics for FAISS. In
haystack/haystack/document_store/faiss.py
Line 98 in 7569ab9
Describe the solution you'd like
As stated in the FAISS documentation (https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances#how-can-i-index-vectors-for-cosine-similarity) dot product is equivalent to cosine similarity when the indexed vectors and the query vector are normalized. This feature could be implemented by normalizing all documents which are written by a FAISS document store which was initialized with a cosine similarity metric. The same would have to be done for the retriever.
Describe alternatives you've considered
Additional context
I could try and make a PR for this, if such a feature is desired. I'd also be glad if you had any hints on your preferred way of implementation.
The text was updated successfully, but these errors were encountered: