Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cosine similarity as similarity metric for FAISS document store #1337

Closed
mathislucka opened this issue Aug 11, 2021 · 10 comments
Closed

Comments

@mathislucka
Copy link
Member

Is your feature request related to a problem? Please describe.
FAISS is a really nice and fast indexing solution for dense vectors. Using it for semantic similarity search works very well. However, currently only dot product and L2 are supported when it comes to distance metrics for FAISS. In

raise ValueError("The FAISS document store can currently only support dot_product similarity. "
you raise an error whenever any other distance metric should be used for FAISS indices. When looking at sentence embedding models which are currently available, only some of them are optimized for dot product similarity (mostly for asymmetric search) while most work well with cosine similarity (mostly for symmetric search) (see https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models ). It would be really nice if the FAISS document store could also support cosine similarity as a distance metric.

Describe the solution you'd like
As stated in the FAISS documentation (https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances#how-can-i-index-vectors-for-cosine-similarity) dot product is equivalent to cosine similarity when the indexed vectors and the query vector are normalized. This feature could be implemented by normalizing all documents which are written by a FAISS document store which was initialized with a cosine similarity metric. The same would have to be done for the retriever.

Describe alternatives you've considered

Additional context
I could try and make a PR for this, if such a feature is desired. I'd also be glad if you had any hints on your preferred way of implementation.

@tholor
Copy link
Member

tholor commented Aug 12, 2021

Hey @mathislucka ,
This would be a very helpful addition - we'd be happy to see a PR here :)

We'd probably need to add the normalization to write_documents (if docs already contain embeddings), update_embeddings and query_by_embedding when the doc store was initialized with cosine similarity. I currently don't see the need to touch the retriever, but maybe I am missing something.

@lalitpagaria
Copy link
Contributor

lalitpagaria commented Aug 12, 2021

@tholor we reverted this in #467. Refer #422 for more discussion. Referring comment -

After some further investigation, I think we can actually get rid of the phi normalization trick.
The HMSW Index in FAISS now also supports the inner product metric directly. While I had problems with initializing the index via the factory the direct init seems to work well (facebookresearch/faiss#1434).
When comparing both approaches, it seemed that L2+ Phi normalization is a bit faster (0.009 vs. 0.006 sec/query), but accuracy was actually a bit worse on some random dummy vectors (~ 0.85 vs 0.54) .

@tholor
Copy link
Member

tholor commented Aug 12, 2021

This was a different type of normalization. If I recall correctly we used L2+Phi normalization back then to enable inner product in HNSW. This was then later on natively supported by FAISS and the trick was not required anymore. This issue here however is about supporting cosine similarity (just requiring L2 normalization).

@lalitpagaria
Copy link
Contributor

Oh okay now I know that I already implemented L2 normalisation but we closed that PR for future reference. We can bring some of these idea now 🙂

You had one comment about L2 normalisation
#385 (comment)

@mathislucka
Copy link
Member Author

You are right @tholor there will probably only be changes in the document store and not in the retriever. I didn't look at it properly before. I'll make sure to read through the referenced issues when implementing the feature.

What do you think about the case when someone loads an existing faiss index from a file. As far as I can see, no initialization parameters are saved alongside the file so someone might initialize an index that was originally generated with cosine similarity with another distance metric. This would produce incompatible vectors. Currently, I am not sure if there is a solution to this. On the other hand, you already put responsibility to check if saved and loaded indices are compatible in the hands of the end user (e.g. here:

make sure to use the same SQL DB that you used when calling `save()`.
). So maybe it would be Ok to place the burden on the user here.

@tholor
Copy link
Member

tholor commented Aug 13, 2021

What do you think about the case when someone loads an existing faiss index from a file.

Good point. It might make sense to add the similarity arg to the load method. Then it's still the users responsibility to check that the right one is set, but at least it's possible to set it.
Optionally, we could also do a check after the index is loaded and similarity = cosine - We could verify that a sample of vectors has length = 1 and was therefore normalized.

@mathislucka
Copy link
Member Author

I didn't get to writing any code but I manage to think a bit more about the issue of embedding mismatch when loading a faiss document store by calling its load method:

As far as I can see, the load method does not take the same arguments as the __init__ method. Isn't this problematic because the document store will always be initialized with the default values when using the load method? I mean, it's not just the similarity arg that would be missing, it's also all the other arguments.

I came to the conclusion that the load method should take the same args as the __init__ and pass them during initialization. I think this should be a separate PR though.

From an end user perspective, I do not like that I have to remember these arguments or store them somewhere else. Couldn't the save method also save the current configuration of the document store alongside the actual faiss index file? We could just store it in a json or something. Maybe make this optional and the load method will either look for such a file when loading the faiss index or it will use the default args. This would also be non-breaking as indices which were saved without configuration would still load in the same way as before. This would also be a separate PR I think.

What do you think about this @tholor ?

@mathislucka
Copy link
Member Author

I have added a first draft of the proposed changes. It just contains the bare minimum code that will be needed to support cosine similarity.

I'm having issues to get the tests to run though. It seems as if some requirements are missing in the requirements.txt and the requirements-dev.txt. I could not find documentation on setting up the project and getting the tests to run. Could you give me some hints or point me to resources that I've possibly overlooked?

@tholor
Copy link
Member

tholor commented Aug 18, 2021

From an end user perspective, I do not like that I have to remember these arguments or store them somewhere else. Couldn't the save method also save the current configuration of the document store alongside the actual faiss index file? We could just store it in a json or something

Yes, that's absolutely right. We store the params in a the attribute config anyway. Should be easy to store it as json and load it back when needed. Let's tackle it separately. Can you please create an issue?

I'm having issues to get the tests to run though.

Can you please share more details about the problem that you are facing (e.g. error msgs)? If you want to run all tests locally, you will need to run a couple of documentstores in the background. The simplest would be to launch the required ones via docker (similar as in our CI) before running the tests.
You can also disable a couple of tests via markers or an additional param but I think we should document all of this way better. This is a nightmare for contributors. Created another issue for it: #1353

@mathislucka
Copy link
Member Author

closing this with #1352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants