FAQ

Frequently Asked Questions

Q: You say that FALCONN uses LSH. What is LSH anyway?
A: LSH is locality-sensitive hashing, a way to partition a metric space into nicely-shaped cells. See a brief LSH primer for an introduction.
Q: My favorite paper about the nearest neighbor search claims that LSH requires lots of memory and is thus impractical. Is that true?
A: Yes and no. As defined originally, LSH indeed can require lots of memory and for large datasets that may be infeasible. However, in 2007, researchers came up with a simple and incredibly efficient way to reduce the space used by the LSH-based data structure, called multiprobe LSH. FALCONN supports multiprobe LSH and works very well with the whole data structure being as small as the dataset itself or even smaller.
Q: FALCONN provides LSH for cosine similarity. How come you claim it can be used for the general Euclidean distance?
A: LSH for cosine similarity is good enough, if your dataset is reasonable. What it means formally is that there is a correlation between an angle between two vectors and the corresponding Euclidean distance. In this case, you can partition the dataset using LSH for cosine similarity, and then use the genuine Euclidean distance to find the K-nearest neighbors among the candidates. In the beginning, it is good to re-center your dataset so that the center of mass is the origin. It often makes the dataset more isotropic and easier to partition. FALCONN provides both of these features.