Skip to content

Latest commit

 

History

History
20 lines (11 loc) · 1.75 KB

File metadata and controls

20 lines (11 loc) · 1.75 KB

Similarity

The similarity refers to how much overlap there is between two things. Knowing how similar two things are can be very useful for instance because you can often substitute similar things with each other.

Word similarity

Using a thesaurus, you can find out how similar two words are using path length with path length, Resnik similarity or Lesk similarity. Another way of doing this is edit distance.

Vector similarity

There are a large number of ways to calculate the similarity between two vectors or embeddings. Here, similarity is quantified by the distance between points in space. There are a large number of ways to calculate distance between points in space. The ones discussed in the course are Jaccard's distance, Euclidian distance and cosine.

Jaccard is used to compare Co-occurrence sets. Euclidian distance, while being a more intuitive distance measure, it is very much influenced by just one coordinate of the vector being far removed from another word in the embeddings space. Cosine is much better than Euclidian, as cosine gives prominence to similarity in relative values.

Cosine vs Euclidian

In this figure above, if a vector is high in y it occurs often with pet and high in x occurs high in road. The absolute distance between possum and monocycle is very small, but the angle is large. While the angle of cat and possum is more similar, which is also correct.

Here is a more fun image:

Distance measures