Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Function relative_cosine_similarity in keyedvectors.py #2307

Merged
merged 18 commits into from
Jan 15, 2019
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -1385,6 +1385,35 @@ def init_sims(self, replace=False):
logger.info("precomputing L2-norms of word weight vectors")
self.vectors_norm = _l2_norm(self.vectors, replace=replace)

def relative_cosine_similarity(self, wa, wb, topn=10):
"""Compute the relative cosine similarity between two words given top-n similar words,
by Artuur ... <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>.

To calculate relative cosine similarity between two words, equation (1) of the paper is used.
For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than
any arbitrary word pairs.

Parameters
----------
wa: str
Word for which we have to look top-n similar word.
wb: str
Word for which we evaluating relative cosine similarity with wa.
topn: int, optional
Number of top-n similar words to look with respect to wa.

Returns
-------
numpy.float64
Relative cosine similarity between wa and wb.

"""
sims = self.similar_by_word(wa, topn)
assert sims, "Failed code invariant: list of similar words must never be empty."
rcs = float(self.similarity(wa, wb)) / (sum(sim for _, sim in sims))

return rcs


class Word2VecKeyedVectors(WordEmbeddingsKeyedVectors):
"""Mapping between words and vectors for the :class:`~gensim.models.Word2Vec` model.
Expand Down
23 changes: 23 additions & 0 deletions gensim/test/test_keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,29 @@ def test_most_similar_topn(self):
predicted = self.vectors.most_similar('war', topn=None)
self.assertEqual(len(predicted), len(self.vectors.vocab))

def test_relative_cosine_similarity(self):
"""Test relative_cosine_similarity returns expected results with an input of a word pair and topn"""
wordnet_syn = [
'good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well',
'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed',
'unspoiled', 'unspoilt', 'thoroughly', 'soundly'
] # synonyms for "good" as per wordnet
cos_sim = []
for i in range(len(wordnet_syn)):
if wordnet_syn[i] in self.vectors.vocab:
cos_sim.append(self.vectors.similarity("good", wordnet_syn[i]))
cos_sim = sorted(cos_sim, reverse=True) # cosine_similarity of "good" with wordnet_syn in decreasing order
# computing relative_cosine_similarity of two similar words
rcs_wordnet = self.vectors.similarity("good", "nice") / sum(cos_sim[i] for i in range(10))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this is calculating. It's kind of like the relactive_cosine_similarity() formula, but now with only WordNet synonyms as contributors to the denominator. And, only those synonyms which happen to be in this vector-set. Are all those words in the euclidean_vectors.bin test vectors set? As a result, I'm not sure what the following asserts really test. Is this matching something in the paper?

Copy link
Contributor Author

@rsdel2007 rsdel2007 Dec 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the problem which I found while making test.There is not any claim or any perfect result in the paper and I can't find any way to confirm on corpus other than wordnet, so I think best way will be to compare the relative_cosine_similarity of wordnet synonyms and most_similar ones under a threshold of 0.125.

Let me give explain the insights of the section relative cosine similarity of the paper_:

  1. Construct a set of the top 10 most (cosine) similar words for w1 (called topn in the paper).
  2. Calculate a normalized score for each of the words in the topn, by dividing by the sum of the topn cosine similarity scores.

They mostly wanted to know if the most similar word of w1 was a synonym or not, and not a synonym/hypernym etc. They expected that if the most (cosine) similar word is a lot more (cosine) similar than the other words in the topn it is more likely to be a synonym, than if it is only slightly more similar. So this is what the rcs takes into account.
So, they come to conclusion which is the only claim in the paper is that if a word pair have a rcs greater 0.10 than it is more likely to be an arbitrary pair.
0.10 can be used threshold but this result is based on wordnet corpus. On a short corpus this result may be more lower. The threshold is nothing but the mean of the cosine_similarities of topn words. So on a short corpus it may be anything less than 0.10.

@gojomo ,Can you suggest some better way to test why looking at above description?

I am looking forward for helping to contribute for the tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know offhand what's in euclidean_vectors.bin - if there's any overlap with the wordnet words you've chosen. But if there's one or more word-and-nearest-neighbor pairs in that set-of-word-vectors (or some other available-at-unit-testing set-of-word-vectors) that the RCS measure successfully identifies as synonyms, and one or more other word-and-nearest-neighbor pairs that the RCS measure also successfully rejects as synonyms, then having the test method show that functionality would be useful as a demonstration/confirmation of the RCS functionality. (And, at least a little, a guard against any future regressions where that breaks due to other changes... which seems unlikely here, but is one of the reasons for ensuring this kind of test coverage.)

Maybe @viplexke, who originally suggested this in #2175, has some other application/test ideas?

rcs = self.vectors.relative_cosine_similarity("good", "nice", 10)
self.assertTrue(rcs_wordnet >= rcs)
self.assertTrue(np.allclose(rcs_wordnet, rcs, 0, 0.125))
# computing relative_cosine_similarity for two non-similar words
rcs = self.vectors.relative_cosine_similarity("good", "worst", 10)
self.assertTrue(rcs < 0.10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0.10 an important threshold from the paper, or just chosen because it works? Is this sort of contrast – between a word good and a near-antonym worst – the sort of thing RCS is supposed to be good for?


def test_most_similar_raises_keyerror(self):
"""Test most_similar raises KeyError when input is out of vocab."""
with self.assertRaises(KeyError):
Expand Down