Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Wrapper for FastText #847

Merged
merged 62 commits into from
Jan 24, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
55a4fc9
updated refactor
Aug 18, 2016
e916f7e
commit missed file
Aug 18, 2016
e5416ed
docstring added
Aug 18, 2016
e64766b
more refactoring
Aug 19, 2016
c34cf37
add missing docstring
Aug 19, 2016
c9b31f9
fix docstring format
Aug 19, 2016
a0329af
clearer docstring
droudy Aug 19, 2016
0c0e2fa
minor typo in word2vec wmdistance
jayantj Sep 2, 2016
cdefeb0
pyemd error in keyedvecs
jayantj Sep 8, 2016
1aec5a2
relative import of keyedvecs from word2vec fails
jayantj Sep 8, 2016
e7368a3
bug in init_sims in word2vec
jayantj Sep 8, 2016
fe283c2
property descriptors for syn0, syn0norm, index2word, vocab - fixes bu…
jayantj Sep 8, 2016
9b36bc4
tests for loading older word2vec models
jayantj Sep 9, 2016
dfe1893
backwards compatibility for loading older models
jayantj Sep 9, 2016
4a03f20
test for syn0norm not saved to file
jayantj Sep 9, 2016
09b6ebe
syn0norm not saved to file for KeyedVectors
jayantj Sep 9, 2016
7df4138
tests and fix for accuracy
jayantj Sep 9, 2016
4c54d9b
minor bug in finalized vocab check
jayantj Sep 9, 2016
a28f9f1
warnings for direct syn0/syn0norm access
jayantj Sep 9, 2016
bf1182e
fixes use of most_similar in accuracy
jayantj Sep 10, 2016
5a6b97b
changes logging level to ERROR in word2vec tests
jayantj Sep 10, 2016
cfb2e1c
renames kv to wv in word2vec
jayantj Sep 12, 2016
b002765
minor bugs with checking existence of syn0
jayantj Sep 12, 2016
27c0a14
replaces syn0 and syn0norm with wv.syn0 and wv.syn0norm in tests and …
jayantj Sep 12, 2016
81f8cbb
adds changelog
jayantj Sep 12, 2016
aa7e632
initial fastText wrapper class
jayantj Aug 29, 2016
c780b9b
fasttext load binary data + oov vectors
jayantj Aug 29, 2016
ccf5a47
tests for fasttext wrapper
jayantj Sep 9, 2016
708113b
reduced memory requirements for fasttext model
jayantj Sep 9, 2016
b7de266
annoy indexer tests for fasttext
jayantj Sep 12, 2016
4d3d251
adds changelog and documentation
jayantj Sep 12, 2016
f2d13ce
renames kv to wv in fasttext wrapper
jayantj Sep 12, 2016
3777423
refactors syn0 word vector lookup into method
jayantj Sep 12, 2016
6e20834
updates keyedvector load tests to use actual values
jayantj Dec 16, 2016
564ea0d
Merge branch 'develop' into fasttext
jayantj Dec 18, 2016
caeb275
updates word2vec load old models tests + test models
jayantj Dec 19, 2016
784ffbf
more fasttext wrapper tests
jayantj Dec 22, 2016
20fe6f2
refactoring of some fasttext and word2vec methods
jayantj Dec 22, 2016
3b9483b
refactors FastText to use subclass of KeyedVectors, updates tests
jayantj Dec 22, 2016
f5cdfb6
Merge branch 'develop' into fasttext
jayantj Dec 26, 2016
700dd26
changes setUp for fast text unittests to setUpClass to reduce time taken
jayantj Dec 26, 2016
d30ea56
adds normalized ngram vectors for fasttext model, tests
jayantj Dec 27, 2016
bb6e538
deletes training files after loading model, tests
jayantj Dec 27, 2016
c7a5d07
doesnt match with oov words, tests
jayantj Dec 27, 2016
734057b
more asserts while loading from fasttext model file, renames some var…
jayantj Dec 27, 2016
56d89e9
updates FastText __contains__ to return True for all words for which …
jayantj Dec 27, 2016
dc51096
updates docstrings, adds comments for fasttext wrapper and tests
jayantj Dec 27, 2016
bb48663
adds fasttext test models
jayantj Dec 27, 2016
b58dd53
changes setUpClass to setUp to allow python2.6 compatibility
jayantj Jan 3, 2017
461a6b4
updates word2vec test model files
jayantj Jan 4, 2017
9137090
python2.6 compatibility for fasttext tests
jayantj Jan 4, 2017
e5ae899
Revert "updates keyedvector load tests to use actual values"
jayantj Jan 4, 2017
b98b40f
Merge branch 'develop' into fasttext
jayantj Jan 4, 2017
5eb8f75
replaces all instances of vocab and syn0 being accessed directly thro…
jayantj Jan 4, 2017
27bec7b
adds fasttext tutorial notebook
jayantj Jan 6, 2017
ef0e1e2
minor doc updates
jayantj Jan 6, 2017
ab07ef9
removes direct vocab access in FastText
jayantj Jan 6, 2017
2f37b04
suppresses numpy overflow warning while computing fasttext hash
jayantj Jan 6, 2017
b2ff794
minor doc + pep8 updates
jayantj Jan 11, 2017
7b0874a
adds warning to doesnt_match if word vector is missing
jayantj Jan 11, 2017
a7bceb6
minor fixes to fasttext tutorial
jayantj Jan 11, 2017
dee9f97
Merge branch 'develop' into fasttext
tmylk Jan 24, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Changes

Unreleased:

None
* FastText wrapper added, can be used for training FastText word representations and performing word2vec operations over it

0.13.4.1, 2017-01-04

Expand Down
607 changes: 607 additions & 0 deletions docs/notebooks/FastText_Tutorial.ipynb

Large diffs are not rendered by default.

58 changes: 35 additions & 23 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,25 @@ def save(self, *args, **kwargs):
kwargs['ignore'] = kwargs.get('ignore', ['syn0norm'])
super(KeyedVectors, self).save(*args, **kwargs)

def word_vec(self, word, use_norm=False):
"""
Accept a single word as input.
Returns the word's representations in vector space, as a 1D numpy array.

Example::

>>> trained_model.word_vec('office', use_norm=True)
array([ -1.40128313e-02, ...])

"""
if word in self.vocab:
if use_norm:
return self.syn0norm[self.vocab[word].index]
else:
return self.syn0[self.vocab[word].index]
else:
raise KeyError("word '%s' not in vocabulary" % word)

def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None):
"""
Find the top-N most similar words. Positive words contribute positively towards the
Expand Down Expand Up @@ -90,11 +109,10 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i
for word, weight in positive + negative:
if isinstance(word, ndarray):
mean.append(weight * word)
elif word in self.vocab:
mean.append(weight * self.syn0norm[self.vocab[word].index])
all_words.add(self.vocab[word].index)
else:
raise KeyError("word '%s' not in vocabulary" % word)
mean.append(weight * self.word_vec(word, use_norm=True))
if word in self.vocab:
Copy link
Owner

@piskvorky piskvorky Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code test, can never reach here (above line would throw a KeyError).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KeyError has been removed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's still there, on line 66.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line raises a KeyError in case word in self.vocab is False. So in case it's True, line 115 would be executed.
Also, word_vec has been overriden in the KeyedVectors subclass for FastText.

Copy link
Owner

@piskvorky piskvorky Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my point is -- isn't it always True? How could it be False, when that would raise an exception at the line above? The test seems superfluous.

But if subclasses can make word_vec() behave differently (not raise for missing words), then it makes sense. Not sure what the general contract for word_vec() behaviour is.

all_words.add(self.vocab[word].index)
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
Expand Down Expand Up @@ -230,22 +248,14 @@ def most_similar_cosmul(self, positive=[], negative=[], topn=10):
# allow calls like most_similar_cosmul('dog'), as a shorthand for most_similar_cosmul(['dog'])
positive = [positive]

all_words = set()

def word_vec(word):
if isinstance(word, ndarray):
return word
elif word in self.vocab:
all_words.add(self.vocab[word].index)
return self.syn0norm[self.vocab[word].index]
else:
raise KeyError("word '%s' not in vocabulary" % word)

positive = [word_vec(word) for word in positive]
negative = [word_vec(word) for word in negative]
positive = [self.word_vec(word, use_norm=True) for word in positive]
negative = [self.word_vec(word, use_norm=True) for word in negative]
if not positive:
raise ValueError("cannot compute similarity with no input")

all_words = set([self.vocab[word].index for word in positive+negative if word in self.vocab])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the all_words created above for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove the input words from the returned most_similar words.

Copy link
Owner

@piskvorky piskvorky Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, never mind, the review snippet showed me the code for all_words from most_similar above, I thought it's the same function. Disregard my comment.

Square brackets [ ] not needed inside the set().


# equation (4) of Levy & Goldberg "Linguistic Regularities...",
# with distances shifted to [0,1] per footnote (7)
pos_dists = [((1 + dot(self.syn0norm, term)) / 2) for term in positive]
Expand Down Expand Up @@ -311,14 +321,16 @@ def doesnt_match(self, words):
"""
self.init_sims()

words = [word for word in words if word in self.vocab] # filter out OOV words
logger.debug("using words %s" % words)
if not words:
used_words = [word for word in words if word in self]
if len(used_words) != len(words):
ignored_words = set(words) - set(used_words)
logger.warning("vectors for words %s are not present in the model, ignoring these words", ignored_words)
if not used_words:
raise ValueError("cannot select a word from an empty list")
vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL)
vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
mean = matutils.unitvec(vectors.mean(axis=0)).astype(REAL)
dists = dot(vectors, mean)
return sorted(zip(dists, words))[0][1]
return sorted(zip(dists, used_words))[0][1]

def __getitem__(self, words):

Expand All @@ -345,9 +357,9 @@ def __getitem__(self, words):
"""
if isinstance(words, string_types):
# allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
return self.syn0[self.vocab[words].index]
return self.word_vec(words)

return vstack([self.syn0[self.vocab[word].index] for word in words])
return vstack([self.word_vec(word) for word in words])

def __contains__(self, word):
return word in self.vocab
Expand Down
5 changes: 4 additions & 1 deletion gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ def __init__(
else:
logger.debug('Fast version of {0} is being used'.format(__name__))

self.wv = KeyedVectors() # wv --> KeyedVectors
self.initialize_word_vectors()
self.sg = int(sg)
self.cum_table = None # for negative sampling
self.vector_size = int(size)
Expand Down Expand Up @@ -469,6 +469,9 @@ def __init__(
self.build_vocab(sentences, trim_rule=trim_rule)
self.train(sentences)

def initialize_word_vectors(self):
self.wv = KeyedVectors()

def make_cum_table(self, power=0.75, domain=2**31 - 1):
"""
Create a cumulative-distribution table using stored vocabulary word counts for
Expand Down
3 changes: 2 additions & 1 deletion gensim/models/wrappers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@
from .ldamallet import LdaMallet
from .dtmmodel import DtmModel
from .ldavowpalwabbit import LdaVowpalWabbit
from .wordrank import Wordrank
from .fasttext import FastText
from .wordrank import Wordrank
Loading