Add cosine similarity as similarity metric for FAISS document store #1337

mathislucka · 2021-08-11T16:18:54Z

Is your feature request related to a problem? Please describe.
FAISS is a really nice and fast indexing solution for dense vectors. Using it for semantic similarity search works very well. However, currently only dot product and L2 are supported when it comes to distance metrics for FAISS. In

haystack/haystack/document_store/faiss.py

Line 98 in 7569ab9

    
           raise ValueError("The FAISS document store can currently only support dot_product similarity. "

you raise an error whenever any other distance metric should be used for FAISS indices. When looking at sentence embedding models which are currently available, only some of them are optimized for dot product similarity (mostly for asymmetric search) while most work well with cosine similarity (mostly for symmetric search) (see https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models ). It would be really nice if the FAISS document store could also support cosine similarity as a distance metric.

Describe the solution you'd like
As stated in the FAISS documentation (https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances#how-can-i-index-vectors-for-cosine-similarity) dot product is equivalent to cosine similarity when the indexed vectors and the query vector are normalized. This feature could be implemented by normalizing all documents which are written by a FAISS document store which was initialized with a cosine similarity metric. The same would have to be done for the retriever.

Describe alternatives you've considered

Additional context
I could try and make a PR for this, if such a feature is desired. I'd also be glad if you had any hints on your preferred way of implementation.

tholor · 2021-08-12T12:46:06Z

Hey @mathislucka ,
This would be a very helpful addition - we'd be happy to see a PR here :)

We'd probably need to add the normalization to write_documents (if docs already contain embeddings), update_embeddings and query_by_embedding when the doc store was initialized with cosine similarity. I currently don't see the need to touch the retriever, but maybe I am missing something.

lalitpagaria · 2021-08-12T13:03:21Z

@tholor we reverted this in #467. Refer #422 for more discussion. Referring comment -

After some further investigation, I think we can actually get rid of the phi normalization trick.
The HMSW Index in FAISS now also supports the inner product metric directly. While I had problems with initializing the index via the factory the direct init seems to work well (facebookresearch/faiss#1434).
When comparing both approaches, it seemed that L2+ Phi normalization is a bit faster (0.009 vs. 0.006 sec/query), but accuracy was actually a bit worse on some random dummy vectors (~ 0.85 vs 0.54) .

tholor · 2021-08-12T13:41:31Z

This was a different type of normalization. If I recall correctly we used L2+Phi normalization back then to enable inner product in HNSW. This was then later on natively supported by FAISS and the trick was not required anymore. This issue here however is about supporting cosine similarity (just requiring L2 normalization).

lalitpagaria · 2021-08-12T13:56:54Z

Oh okay now I know that I already implemented L2 normalisation but we closed that PR for future reference. We can bring some of these idea now 🙂

You had one comment about L2 normalisation
#385 (comment)

mathislucka · 2021-08-12T16:23:46Z

You are right @tholor there will probably only be changes in the document store and not in the retriever. I didn't look at it properly before. I'll make sure to read through the referenced issues when implementing the feature.

What do you think about the case when someone loads an existing faiss index from a file. As far as I can see, no initialization parameters are saved alongside the file so someone might initialize an index that was originally generated with cosine similarity with another distance metric. This would produce incompatible vectors. Currently, I am not sure if there is a solution to this. On the other hand, you already put responsibility to check if saved and loaded indices are compatible in the hands of the end user (e.g. here:

haystack/haystack/document_store/faiss.py

Line 454 in 07bd3c5

make sure to use the same SQL DB that you used when calling `save()`.

). So maybe it would be Ok to place the burden on the user here.

tholor · 2021-08-13T10:31:49Z

What do you think about the case when someone loads an existing faiss index from a file.

Good point. It might make sense to add the similarity arg to the load method. Then it's still the users responsibility to check that the right one is set, but at least it's possible to set it.
Optionally, we could also do a check after the index is loaded and similarity = cosine - We could verify that a sample of vectors has length = 1 and was therefore normalized.

mathislucka · 2021-08-17T22:09:22Z

I didn't get to writing any code but I manage to think a bit more about the issue of embedding mismatch when loading a faiss document store by calling its load method:

As far as I can see, the load method does not take the same arguments as the __init__ method. Isn't this problematic because the document store will always be initialized with the default values when using the load method? I mean, it's not just the similarity arg that would be missing, it's also all the other arguments.

I came to the conclusion that the load method should take the same args as the __init__ and pass them during initialization. I think this should be a separate PR though.

From an end user perspective, I do not like that I have to remember these arguments or store them somewhere else. Couldn't the save method also save the current configuration of the document store alongside the actual faiss index file? We could just store it in a json or something. Maybe make this optional and the load method will either look for such a file when loading the faiss index or it will use the default args. This would also be non-breaking as indices which were saved without configuration would still load in the same way as before. This would also be a separate PR I think.

What do you think about this @tholor ?

mathislucka · 2021-08-18T16:16:18Z

I have added a first draft of the proposed changes. It just contains the bare minimum code that will be needed to support cosine similarity.

I'm having issues to get the tests to run though. It seems as if some requirements are missing in the requirements.txt and the requirements-dev.txt. I could not find documentation on setting up the project and getting the tests to run. Could you give me some hints or point me to resources that I've possibly overlooked?

tholor · 2021-08-18T17:15:44Z

From an end user perspective, I do not like that I have to remember these arguments or store them somewhere else. Couldn't the save method also save the current configuration of the document store alongside the actual faiss index file? We could just store it in a json or something

Yes, that's absolutely right. We store the params in a the attribute config anyway. Should be easy to store it as json and load it back when needed. Let's tackle it separately. Can you please create an issue?

I'm having issues to get the tests to run though.

Can you please share more details about the problem that you are facing (e.g. error msgs)? If you want to run all tests locally, you will need to run a couple of documentstores in the background. The simplest would be to launch the required ones via docker (similar as in our CI) before running the tests.
You can also disable a couple of tests via markers or an additional param but I think we should document all of this way better. This is a nightmare for contributors. Created another issue for it: #1353

mathislucka · 2021-09-20T19:29:57Z

closing this with #1352

tholor added topic:document_store topic:faiss type:feature New feature or request labels Aug 12, 2021

mathislucka mentioned this issue Aug 18, 2021

feat: normalize embeddings for faiss cosine similarity #1352

Merged

4 tasks

mathislucka mentioned this issue Aug 23, 2021

faiss document store should save it's configuration alongside the index #1366

Closed

mathislucka closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cosine similarity as similarity metric for FAISS document store #1337

Add cosine similarity as similarity metric for FAISS document store #1337

mathislucka commented Aug 11, 2021

tholor commented Aug 12, 2021

lalitpagaria commented Aug 12, 2021 •

edited

Loading

tholor commented Aug 12, 2021

lalitpagaria commented Aug 12, 2021

mathislucka commented Aug 12, 2021

tholor commented Aug 13, 2021

mathislucka commented Aug 17, 2021

mathislucka commented Aug 18, 2021

tholor commented Aug 18, 2021

mathislucka commented Sep 20, 2021

Add cosine similarity as similarity metric for FAISS document store #1337

Add cosine similarity as similarity metric for FAISS document store #1337

Comments

mathislucka commented Aug 11, 2021

Describe alternatives you've considered

tholor commented Aug 12, 2021

lalitpagaria commented Aug 12, 2021 • edited Loading

tholor commented Aug 12, 2021

lalitpagaria commented Aug 12, 2021

mathislucka commented Aug 12, 2021

tholor commented Aug 13, 2021

mathislucka commented Aug 17, 2021

mathislucka commented Aug 18, 2021

tholor commented Aug 18, 2021

mathislucka commented Sep 20, 2021

lalitpagaria commented Aug 12, 2021 •

edited

Loading