Doc2vec fails to train when using build_vocab_from_freq() #2083

alexandry-augustin · 2018-06-05T11:20:27Z

Description

I have a Doc2Vec model trained using the build_vocab_from_file() function. This is so I can include a <PAD> token manually at index 0. This token does not appears in the original dataset, but is needed further down my program.

Steps/Code/Corpus to Reproduce

Here is a simple example of of what I am trying to achieve:

import collections, sys

import gensim
from gensim import models
from gensim.models.doc2vec import TaggedDocument

lines = [u'It is a truth universally acknowledged',
        u'This was invitation enough.', 
        u'An invitation to dinner was soon afterwards dispatched']
words = [line.split() for line in lines]
doc_labels = [u'text0', u'tex1', u'text2']
word_freq = collections.Counter([w for line in words for w in line])
word_freq['<PAD>'] = sys.maxint # this ensure that the pad token has index 0 in gensim's vocabulary

class DocIterator(object):
    def __init__(self, docs, labels):
        self.docs = docs
        self.labels = labels
    def __iter__(self):
        for idx, doc in enumerate(self.docs):
            yield TaggedDocument(words=doc, tags=[self.labels[idx]])
            
doc_it = DocIterator(words, doc_labels)
model = gensim.models.Doc2Vec(vector_size=100, min_count=0)
model.build_vocab_from_freq(word_freq)
model.train(doc_it, total_examples=len(lines), epochs=10)

Expected Results

Expected size of model.docvecs.count is 3 (not 0).

Actual Results

Actual size of model.docvecs.count is 0

print(model.docvecs.count) -> 0

Versions

Linux-3.19.0-82-generic-x86_64-with-Ubuntu-15.04-vivid
('Python', '2.7.9 (default, Apr 2 2015, 15:33:21) \n[GCC 4.9.2]')
('NumPy', '1.14.3')
('SciPy', '1.1.0')
('gensim', '3.4.0')
('FAST_VERSION', 1)

Now my questions are:

What is the correct way of using build_vocab_from_freq() to get a valid model?
Failling this, what is the best way to force gensim to include an unseen token at a specific index value in the vocabulary?

The text was updated successfully, but these errors were encountered:

gojomo · 2018-06-13T22:26:30Z

Doc2Vec requires the 'build vocab' preparation to also discover all corpus tags and allocate/initialize their vectors pre-training... but this (new, inherited-from-a-shared-superclass) build_vocab_from_freq() method doesn't do everything Doc2Vec needs, only what Word2Vec needs. It'd need to be overridden or marked as unsupported in Doc2Vec, or a complementary method to help setup Doc2Vec state added. (@manneshiva?)

Until that's done you could look at what build_vocab() does extra in the Doc2Vec case and do that to your model manually. But, since you have a full corpus iterator, could you just use the standard build_vocab() anyway, as a workaround?

alexandry-augustin · 2018-06-15T10:14:13Z

Thank you. In the scenario where I would be using build_vocab():

if the <PAD> token is not in the original dataset, I cannot add it later using build_vocab(..., update=True). This will seg fault as Doc2Vec does not support vocabulary expansion (see Segmentation fault using build_vocab(..., update=True) for Doc2Vec #1019).
if the <PAD> token is in the original dataset as an additional document (as to not interfere with normal computation of the document vectors), I cannot easily guarantee that the <PAD> token will be assigned index 0. Since indices in the vocabulary are assigned in order of frequencies, I would have to create an entire document filled with <PAD> tokens with higher count than any other word in the dataset. This is highly undesirable since: (i) I do not know in advance the word frequencies, (ii) it fills up memory for no real reasons.

Another option that I have investigated was the use of null_word. However build_vocab() overrides any pre-existing vocabulary:

doc_it = DocIterator(lines, doc_labels)
model = gensim.models.Doc2Vec(vector_size=100, min_count=0)

model.vocabulary.add_null_word(model.wv) #add null word to vocabulary
assert(model.wv.vocab['\0'].index == 0)

model.build_vocab(doc_it)

model.wv.vocab['\0'] <-- 'KeyError: '\x00''

menshikh-iv · 2018-08-02T05:21:13Z

@alexandry-augustin thanks for the report, looks like a bug for me, several variants for the fix

disable this method for doc2vec (and maybe for fasttext, need to test all of this cases)
make this method "similar" with build_vocab (i.e. end2end)
fix method for d2v (cover with tests + check how it works for other *2vec)

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Aug 2, 2018

menshikh-iv assigned mpenkov Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2vec fails to train when using build_vocab_from_freq() #2083

Doc2vec fails to train when using build_vocab_from_freq() #2083

alexandry-augustin commented Jun 5, 2018 •

edited by mpenkov

Loading

gojomo commented Jun 13, 2018 •

edited

Loading

alexandry-augustin commented Jun 15, 2018

menshikh-iv commented Aug 2, 2018

Doc2vec fails to train when using build_vocab_from_freq() #2083

Doc2vec fails to train when using build_vocab_from_freq() #2083

Comments

alexandry-augustin commented Jun 5, 2018 • edited by mpenkov Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

gojomo commented Jun 13, 2018 • edited Loading

alexandry-augustin commented Jun 15, 2018

menshikh-iv commented Aug 2, 2018

alexandry-augustin commented Jun 5, 2018 •

edited by mpenkov

Loading

gojomo commented Jun 13, 2018 •

edited

Loading