-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Function relative_cosine_similarity in keyedvectors.py #2307
Changes from 10 commits
a0ed2a7
9095f3b
62015f6
e637468
96d2114
ac7580d
1633f3f
92d4444
80322a7
ad3394c
cf6ee20
613d20c
84ab7b7
778d01d
04dec59
a89ee1e
36903a3
10a28be
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -195,6 +195,7 @@ class Vocab(object): | |||||
and for constructing binary trees (incl. both word leaves and inner nodes). | ||||||
|
||||||
""" | ||||||
|
||||||
def __init__(self, **kwargs): | ||||||
self.count = 0 | ||||||
self.__dict__.update(kwargs) | ||||||
|
@@ -209,6 +210,7 @@ def __str__(self): | |||||
|
||||||
class BaseKeyedVectors(utils.SaveLoad): | ||||||
"""Abstract base class / interface for various types of word vectors.""" | ||||||
|
||||||
def __init__(self, vector_size): | ||||||
self.vectors = zeros((0, vector_size)) | ||||||
self.vocab = {} | ||||||
|
@@ -371,6 +373,7 @@ def rank(self, entity1, entity2): | |||||
|
||||||
class WordEmbeddingsKeyedVectors(BaseKeyedVectors): | ||||||
"""Class containing common methods for operations over word vectors.""" | ||||||
|
||||||
def __init__(self, vector_size): | ||||||
super(WordEmbeddingsKeyedVectors, self).__init__(vector_size=vector_size) | ||||||
self.vectors_norm = None | ||||||
|
@@ -1384,12 +1387,42 @@ def init_sims(self, replace=False): | |||||
else: | ||||||
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL) | ||||||
|
||||||
def relative_cosine_similarity(self, wa, wb, topn=10): | ||||||
"""Compute the relative cosine similarity between two words given top-n similar words, | ||||||
proposed by Artuur Leeuwenberg, Mihaela Vela, Jon Dehdari, Josef van Genabith | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to make a proper link in the doc, please use
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay..Done. |
||||||
"A Minimally Supervised Approach for Synonym Extraction with Word Embeddings" | ||||||
<https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>. | ||||||
|
||||||
To calculate relative cosine similarity between two words, equation (1) of the paper is used. | ||||||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than | ||||||
any arbitrary word pairs. | ||||||
|
||||||
Parameters | ||||||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
---------- | ||||||
wa: str | ||||||
word for which we have to look top-n similar word. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sentence should start from uppercased letter There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||
wb: str | ||||||
word for which we evaluating relative cosine similarity with wa. | ||||||
topn: int, optional | ||||||
Number of top-n similar words to look with respect to wa. | ||||||
menshikh-iv marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Returns | ||||||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
------- | ||||||
numpy.float64 | ||||||
relative cosine similarity between wa and wb. | ||||||
""" | ||||||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
menshikh-iv marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
sims = self.similar_by_word(wa, topn) | ||||||
assert sims, "Cannot generate similar words" | ||||||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
rcs = (self.similarity(wa, wb)) / (sum(result[1] for result in sims)) | ||||||
piskvorky marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no need wrap left part with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, prepend Also, can you please unpack There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @menshikh-iv cool! You have to teach me how to do that :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||
|
||||||
return rcs | ||||||
|
||||||
|
||||||
class Word2VecKeyedVectors(WordEmbeddingsKeyedVectors): | ||||||
"""Mapping between words and vectors for the :class:`~gensim.models.Word2Vec` model. | ||||||
Used to perform operations on the vectors such as vector lookup, distance, similarity etc. | ||||||
|
||||||
""" | ||||||
|
||||||
def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None): | ||||||
"""Store the input-hidden weight matrix in the same format used by the original | ||||||
C word2vec-tool, for compatibility. | ||||||
|
@@ -1895,6 +1928,7 @@ def int_index(self, index, doctags, max_rawint): | |||||
|
||||||
class FastTextKeyedVectors(WordEmbeddingsKeyedVectors): | ||||||
"""Vectors and vocab for :class:`~gensim.models.fasttext.FastText`.""" | ||||||
|
||||||
def __init__(self, vector_size, min_n, max_n): | ||||||
super(FastTextKeyedVectors, self).__init__(vector_size=vector_size) | ||||||
self.vectors_vocab = None | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -104,6 +104,27 @@ def test_most_similar_topn(self): | |
predicted = self.vectors.most_similar('war', topn=None) | ||
self.assertEqual(len(predicted), len(self.vectors.vocab)) | ||
|
||
def test_relative_cosine_similarity(self): | ||
"""Test relative_cosine_similarity returns expected results with an input of a word pair and topn""" | ||
wordnet_syn = ['good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable', | ||
'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. format it properly, please, like wordnet_syn = [
'good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well',
'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed',
'unspoiled', 'unspoilt', 'thoroughly', 'soundly',
] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well', | ||
'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed', | ||
'unspoiled', 'unspoilt', 'thoroughly', 'soundly'] # synonyms for "good" as per wordnet | ||
cos_sim = [] | ||
for i in range(len(wordnet_syn)): | ||
if wordnet_syn[i] in self.vectors.vocab: | ||
cos_sim.append(self.vectors.similarity("good", wordnet_syn[i])) | ||
cos_sim = sorted(cos_sim, reverse=True) # cosine_similarity of "good" with wordnet_syn in decreasing order | ||
# computing relative_cosine_similarity of two similar words | ||
rcs_wordnet = self.vectors.similarity("good", "nice") / sum(cos_sim[i] for i in range(10)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what this is calculating. It's kind of like the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, this is the problem which I found while making test.There is not any claim or any perfect result in the paper and I can't find any way to confirm on corpus other than Let me give explain the insights of the section relative cosine similarity of the paper_:
They mostly wanted to know if the most similar word of w1 was a synonym or not, and not a synonym/hypernym etc. They expected that if the most (cosine) similar word is a lot more (cosine) similar than the other words in the topn it is more likely to be a synonym, than if it is only slightly more similar. So this is what the rcs takes into account. @gojomo ,Can you suggest some better way to test why looking at above description? I am looking forward for helping to contribute for the tests There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know offhand what's in Maybe @viplexke, who originally suggested this in #2175, has some other application/test ideas? |
||
rcs = self.vectors.relative_cosine_similarity("good", "nice", 10) | ||
self.assertTrue(rcs_wordnet >= rcs) | ||
self.assertTrue(np.allclose(rcs_wordnet, rcs, 0, 0.125)) | ||
# computing relative_cosine_similarity for two non-similar words | ||
rcs = self.vectors.relative_cosine_similarity("good", "worst", 10) | ||
self.assertTrue(rcs < 0.10) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is |
||
|
||
def test_most_similar_raises_keyerror(self): | ||
"""Test most_similar raises KeyError when input is out of vocab.""" | ||
with self.assertRaises(KeyError): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated changes, please revert all of it (stay PR compact)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.