Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Function relative_cosine_similarity in keyedvectors.py #2307

Merged
merged 18 commits into from
Jan 15, 2019
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,7 @@ class Vocab(object):
and for constructing binary trees (incl. both word leaves and inner nodes).

"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes, please revert all of it (stay PR compact)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def __init__(self, **kwargs):
self.count = 0
self.__dict__.update(kwargs)
Expand All @@ -209,6 +210,7 @@ def __str__(self):

class BaseKeyedVectors(utils.SaveLoad):
"""Abstract base class / interface for various types of word vectors."""

def __init__(self, vector_size):
self.vectors = zeros((0, vector_size))
self.vocab = {}
Expand Down Expand Up @@ -371,6 +373,7 @@ def rank(self, entity1, entity2):

class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
"""Class containing common methods for operations over word vectors."""

def __init__(self, vector_size):
super(WordEmbeddingsKeyedVectors, self).__init__(vector_size=vector_size)
self.vectors_norm = None
Expand Down Expand Up @@ -1384,12 +1387,42 @@ def init_sims(self, replace=False):
else:
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

def relative_cosine_similarity(self, wa, wb, topn=10):
"""Compute the relative cosine similarity between two words given top-n similar words,
proposed by Artuur Leeuwenberg, Mihaela Vela, Jon Dehdari, Josef van Genabith
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make a proper link in the doc, please use

by `Artuur Leeuwenberg, ... <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay..Done.

"A Minimally Supervised Approach for Synonym Extraction with Word Embeddings"
<https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>.

To calculate relative cosine similarity between two words, equation (1) of the paper is used.
For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than
any arbitrary word pairs.

Parameters
----------
wa: str
word for which we have to look top-n similar word.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence should start from uppercased letter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

wb: str
word for which we evaluating relative cosine similarity with wa.
topn: int, optional
Number of top-n similar words to look with respect to wa.
Returns
-------
numpy.float64
relative cosine similarity between wa and wb.
"""
sims = self.similar_by_word(wa, topn)
assert sims, "Cannot generate similar words"
rcs = (self.similarity(wa, wb)) / (sum(result[1] for result in sims))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need wrap left part with ()

Copy link
Owner

@piskvorky piskvorky Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, prepend float if this is meant to be a float division. Both to avoid potential errors due to integer operands in python2, and to make the intent clear.

Also, can you please unpack result into appropriately named variables, instead of writing result[1]?

Copy link
Contributor

@menshikh-iv menshikh-iv Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rcs = (self.similarity(wa, wb)) / (sum(result[1] for result in sims))
rcs = float(self.similarity(wa, wb)) / sum(sim for _, sim in sims)

Copy link
Owner

@piskvorky piskvorky Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv cool! You have to teach me how to do that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


return rcs


class Word2VecKeyedVectors(WordEmbeddingsKeyedVectors):
"""Mapping between words and vectors for the :class:`~gensim.models.Word2Vec` model.
Used to perform operations on the vectors such as vector lookup, distance, similarity etc.

"""

def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Expand Down Expand Up @@ -1895,6 +1928,7 @@ def int_index(self, index, doctags, max_rawint):

class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
"""Vectors and vocab for :class:`~gensim.models.fasttext.FastText`."""

def __init__(self, vector_size, min_n, max_n):
super(FastTextKeyedVectors, self).__init__(vector_size=vector_size)
self.vectors_vocab = None
Expand Down
21 changes: 21 additions & 0 deletions gensim/test/test_keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,27 @@ def test_most_similar_topn(self):
predicted = self.vectors.most_similar('war', topn=None)
self.assertEqual(len(predicted), len(self.vectors.vocab))

def test_relative_cosine_similarity(self):
"""Test relative_cosine_similarity returns expected results with an input of a word pair and topn"""
wordnet_syn = ['good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
Copy link
Contributor

@menshikh-iv menshikh-iv Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format it properly, please, like

wordnet_syn = [
    'good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
    'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
    'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well',
    'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed',
    'unspoiled', 'unspoilt', 'thoroughly', 'soundly',
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well',
'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed',
'unspoiled', 'unspoilt', 'thoroughly', 'soundly'] # synonyms for "good" as per wordnet
cos_sim = []
for i in range(len(wordnet_syn)):
if wordnet_syn[i] in self.vectors.vocab:
cos_sim.append(self.vectors.similarity("good", wordnet_syn[i]))
cos_sim = sorted(cos_sim, reverse=True) # cosine_similarity of "good" with wordnet_syn in decreasing order
# computing relative_cosine_similarity of two similar words
rcs_wordnet = self.vectors.similarity("good", "nice") / sum(cos_sim[i] for i in range(10))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this is calculating. It's kind of like the relactive_cosine_similarity() formula, but now with only WordNet synonyms as contributors to the denominator. And, only those synonyms which happen to be in this vector-set. Are all those words in the euclidean_vectors.bin test vectors set? As a result, I'm not sure what the following asserts really test. Is this matching something in the paper?

Copy link
Contributor Author

@rsdel2007 rsdel2007 Dec 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the problem which I found while making test.There is not any claim or any perfect result in the paper and I can't find any way to confirm on corpus other than wordnet, so I think best way will be to compare the relative_cosine_similarity of wordnet synonyms and most_similar ones under a threshold of 0.125.

Let me give explain the insights of the section relative cosine similarity of the paper_:

  1. Construct a set of the top 10 most (cosine) similar words for w1 (called topn in the paper).
  2. Calculate a normalized score for each of the words in the topn, by dividing by the sum of the topn cosine similarity scores.

They mostly wanted to know if the most similar word of w1 was a synonym or not, and not a synonym/hypernym etc. They expected that if the most (cosine) similar word is a lot more (cosine) similar than the other words in the topn it is more likely to be a synonym, than if it is only slightly more similar. So this is what the rcs takes into account.
So, they come to conclusion which is the only claim in the paper is that if a word pair have a rcs greater 0.10 than it is more likely to be an arbitrary pair.
0.10 can be used threshold but this result is based on wordnet corpus. On a short corpus this result may be more lower. The threshold is nothing but the mean of the cosine_similarities of topn words. So on a short corpus it may be anything less than 0.10.

@gojomo ,Can you suggest some better way to test why looking at above description?

I am looking forward for helping to contribute for the tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know offhand what's in euclidean_vectors.bin - if there's any overlap with the wordnet words you've chosen. But if there's one or more word-and-nearest-neighbor pairs in that set-of-word-vectors (or some other available-at-unit-testing set-of-word-vectors) that the RCS measure successfully identifies as synonyms, and one or more other word-and-nearest-neighbor pairs that the RCS measure also successfully rejects as synonyms, then having the test method show that functionality would be useful as a demonstration/confirmation of the RCS functionality. (And, at least a little, a guard against any future regressions where that breaks due to other changes... which seems unlikely here, but is one of the reasons for ensuring this kind of test coverage.)

Maybe @viplexke, who originally suggested this in #2175, has some other application/test ideas?

rcs = self.vectors.relative_cosine_similarity("good", "nice", 10)
self.assertTrue(rcs_wordnet >= rcs)
self.assertTrue(np.allclose(rcs_wordnet, rcs, 0, 0.125))
# computing relative_cosine_similarity for two non-similar words
rcs = self.vectors.relative_cosine_similarity("good", "worst", 10)
self.assertTrue(rcs < 0.10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0.10 an important threshold from the paper, or just chosen because it works? Is this sort of contrast – between a word good and a near-antonym worst – the sort of thing RCS is supposed to be good for?


def test_most_similar_raises_keyerror(self):
"""Test most_similar raises KeyError when input is out of vocab."""
with self.assertRaises(KeyError):
Expand Down