Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2vec fails to train when using build_vocab_from_freq() #2083

Open
alexandry-augustin opened this issue Jun 5, 2018 · 3 comments
Open
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@alexandry-augustin
Copy link

alexandry-augustin commented Jun 5, 2018

Description

I have a Doc2Vec model trained using the build_vocab_from_file() function. This is so I can include a <PAD> token manually at index 0. This token does not appears in the original dataset, but is needed further down my program.

Steps/Code/Corpus to Reproduce

Here is a simple example of of what I am trying to achieve:

import collections, sys

import gensim
from gensim import models
from gensim.models.doc2vec import TaggedDocument

lines = [u'It is a truth universally acknowledged',
        u'This was invitation enough.', 
        u'An invitation to dinner was soon afterwards dispatched']
words = [line.split() for line in lines]
doc_labels = [u'text0', u'tex1', u'text2']
word_freq = collections.Counter([w for line in words for w in line])
word_freq['<PAD>'] = sys.maxint # this ensure that the pad token has index 0 in gensim's vocabulary

class DocIterator(object):
    def __init__(self, docs, labels):
        self.docs = docs
        self.labels = labels
    def __iter__(self):
        for idx, doc in enumerate(self.docs):
            yield TaggedDocument(words=doc, tags=[self.labels[idx]])
            
doc_it = DocIterator(words, doc_labels)
model = gensim.models.Doc2Vec(vector_size=100, min_count=0)
model.build_vocab_from_freq(word_freq)
model.train(doc_it, total_examples=len(lines), epochs=10)

Expected Results

Expected size of model.docvecs.count is 3 (not 0).

Actual Results

Actual size of model.docvecs.count is 0

print(model.docvecs.count) -> 0

Versions

Linux-3.19.0-82-generic-x86_64-with-Ubuntu-15.04-vivid
('Python', '2.7.9 (default, Apr 2 2015, 15:33:21) \n[GCC 4.9.2]')
('NumPy', '1.14.3')
('SciPy', '1.1.0')
('gensim', '3.4.0')
('FAST_VERSION', 1)

Now my questions are:

  • What is the correct way of using build_vocab_from_freq() to get a valid model?
  • Failling this, what is the best way to force gensim to include an unseen token at a specific index value in the vocabulary?
@gojomo
Copy link
Collaborator

gojomo commented Jun 13, 2018

Doc2Vec requires the 'build vocab' preparation to also discover all corpus tags and allocate/initialize their vectors pre-training... but this (new, inherited-from-a-shared-superclass) build_vocab_from_freq() method doesn't do everything Doc2Vec needs, only what Word2Vec needs. It'd need to be overridden or marked as unsupported in Doc2Vec, or a complementary method to help setup Doc2Vec state added. (@manneshiva?)

Until that's done you could look at what build_vocab() does extra in the Doc2Vec case and do that to your model manually. But, since you have a full corpus iterator, could you just use the standard build_vocab() anyway, as a workaround?

@alexandry-augustin
Copy link
Author

Thank you. In the scenario where I would be using build_vocab():

  • if the <PAD> token is not in the original dataset, I cannot add it later using build_vocab(..., update=True). This will seg fault as Doc2Vec does not support vocabulary expansion (see Segmentation fault using build_vocab(..., update=True) for Doc2Vec #1019).

  • if the <PAD> token is in the original dataset as an additional document (as to not interfere with normal computation of the document vectors), I cannot easily guarantee that the <PAD> token will be assigned index 0. Since indices in the vocabulary are assigned in order of frequencies, I would have to create an entire document filled with <PAD> tokens with higher count than any other word in the dataset. This is highly undesirable since: (i) I do not know in advance the word frequencies, (ii) it fills up memory for no real reasons.

Another option that I have investigated was the use of null_word. However build_vocab() overrides any pre-existing vocabulary:

doc_it = DocIterator(lines, doc_labels)
model = gensim.models.Doc2Vec(vector_size=100, min_count=0)

model.vocabulary.add_null_word(model.wv) #add null word to vocabulary
assert(model.wv.vocab['\0'].index == 0)

model.build_vocab(doc_it)

model.wv.vocab['\0'] <-- 'KeyError: '\x00''

@menshikh-iv
Copy link
Contributor

@alexandry-augustin thanks for the report, looks like a bug for me, several variants for the fix

  • disable this method for doc2vec (and maybe for fasttext, need to test all of this cases)
  • make this method "similar" with build_vocab (i.e. end2end)
  • fix method for d2v (cover with tests + check how it works for other *2vec)

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

4 participants