Doc2Vec: when we have string tags, build_vocab with update removes previous index #3162

espdev · 2021-06-04T05:54:12Z

Problem description

I'm trying to resume training my Doc2Vec model with string tags, but model.build_vocab removes all previous index from model.dv.

Steps/code/corpus to reproduce

A simple example to reproduce this:

import string

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [tag]) for tag, doc in zip(string.ascii_lowercase, common_texts)]
documents1 = documents[:6]
documents2 = documents[6:]

model = Doc2Vec(vector_size=5, window=2, min_count=1)

model.build_vocab(documents1)
model.train(documents1, total_examples=len(documents1), epochs=5)

model.save('model')
model = Doc2Vec.load('model')

print('Vector count after train:', len(model.dv))
print('Keys:', model.dv.index_to_key)

model.build_vocab(documents2, update=True)
model.train(documents2, total_examples=model.corpus_count, epochs=model.epochs)

print('Vector count after update:', len(model.dv))
print('Keys:', model.dv.index_to_key)

model.save('model')
model = Doc2Vec.load('model')

print('Vector count after load:', len(model.dv))
print('Keys:', model.dv.index_to_key)

Output:

Vector count after train: 6
Keys: ['a', 'b', 'c', 'd', 'e', 'f']
Vector count after update: 3
Keys: ['g', 'h', 'i']
Vector count after load: 3
Keys: ['g', 'h', 'i']

And we have an interesting behavior:

print('b' in model.dv)
# True
print(model.dv['b'])
# [ 0.00524729 -0.19762747 -0.10339681 -0.19433555  0.04022206]

The tag seems still exists in the model after updating, but len and index_to_key do not show this.

At the same time the code with int tags works correctly (it seems to me):

documents = [TaggedDocument(doc, [tag]) for tag, doc in enumerate(common_texts)]
documents1 = documents[:6]
documents2 = documents[6:]
...

Vector count after train: 6
Keys: [0, 1, 2, 3, 4, 5]
Vector count after update: 9
Keys: [0, 1, 2, 3, 4, 5, 6, 7, 8]
Vector count after load: 9
Keys: [0, 1, 2, 3, 4, 5, 6, 7, 8]

Versions

Windows-10-10.0.19041-SP0
Python 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]
Bits 64
NumPy 1.20.3
SciPy 1.6.1
gensim 4.0.1
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

gojomo · 2021-06-04T21:00:19Z

I don't think build_vocab(... update=True) has ever reliably worked for Doc2Vec models:

The original implementation was only focused on Word2Vec, with no real testing/demos/verification done on Doc2Vec.
There are zero unit tests of update=True functionality, either for minimal non-error behavior or meaningful results, in our test_doc2vec.py test suite.
A segfault-crashing bug was reliably reported all through the Gensim 3.x.y releases: Segmentation fault using build_vocab(..., update=True) for Doc2Vec #1019.

So, it is a bug that people are tempted to try it, and that it doesn't do anything useful - but unless a new strong advocate/implementor emerges for this functionality (which hasn't happened in the 4.5 years #1019 has been open), it's as likely to be formally disabled (per suggestion here in 2017) as to be fixed.

Note that while expanding the model's set of known words or pre-trained doc-tags is the thorny possibility that's not been working, the simpler task of just calculating doc-vectors for new texts, within the known vocabulary, is possible via infer_vector().

espdev · 2021-06-05T10:34:49Z

@gojomo,

Thanks for your answer. I think at the moment disabling update flag for doc2vec model and fix the documentation is reasonable action.

However, this functionality is important and for my case also. Currently, I have almost 10M documents in the index and every year almost 1M documents will be added (new doc bulks are adding every week). I need to add these docs to the index for searching similar documents through all documents in our database. Without update the existing model I need to train the whole model every week or month. It is expensive for us and our infrastructure.

I know about infer vector and I use it. But for new documents I cannot use it for adding the docs to the index. Or I need to use other index, faiss or hsnwlib for example.

piskvorky · 2021-06-05T13:50:09Z

I have almost 10M documents in the index and every year almost 1M documents will be added
It is expensive for us and our infrastructure.

Would you be able to quantify this? What are your expenses: CPU times, wallclock, $$$…

I'm only asking because the training is pretty fast, so having hard numbers from a real-world use-case will help motivate any work here. 10M documents doesn't strike me as a particularly large corpus. That's on the order of the English Wikipedia, which trains in hours, IIRC.

gojomo · 2021-06-05T20:27:35Z

I'm only asking because the training is pretty fast, so having hard numbers from a real-world use-case will help motivate any work here. 10M documents doesn't strike me as a particularly large corpus. That's on the order of the English Wikipedia, which trains in hours, IIRC.

Indeed, I believe people tend to spend a lot of effort trying to improvise/debug an incremental-update process, which then has never-quantified impacts on overall results quality. (Any time/compute savings you get from not re-presenting old texts includes with it a risk that the model's understanding of those old texts will be subtly diluted/erased by training only on a tiny number of different newer examples.)

That is often a premature optimization, when less effort could instead be spent on setting up a process for low-effort/low-cost automated reindexing on convenient intervals. For example, scheduling a 12hr from-scratch training to run overnight, or a 48hr from-scratch training to happen over a weekend, may involve just tens-of-dollars of computer costs, and (after initial creation) essentially no marginal R&D effort, and run well for months/years.

Also: if you are in fact using the Gensim big-sets-of-vectors – such as a Doc2Vec model's .dv set of doc-tag-named vectors, as your actual search/retrieval backend:

you can either split such searches over multiple groups of vectors (then merge the results), or (with a little effort) merge all the candidates into one large set - so you don't need build_vocab(..., update=True) style re-training of a model just to add new inferred vectors into the candidate set. (It'd only be needed if the model must learn new words, or adjust the impact of older words, based on newer data.) See for example the add_vectors() method of KeyedVectors (of which the Doc2Vec.dv object is an example) - but keep in mind each add requires a re-allocation equal to the old-plus-new size, so doing it in large batches, with plenty of free RAM, isbetter than lots of one-at-a-time additions.
you'll always be paying the cost of a full-scan of all candidate matches, then ranking all results, in most_similar() - something that will hit limits given the inherent delays accessing large amounts of RAM, or when the dataset outgrows one system's RAM, but can then be parallelized/sharded across multiple machines if needed

espdev · 2021-06-10T19:25:28Z

@piskvorky @gojomo

Sorry for not replying you earlier. Thanks for your replies and the explanation.
Probably you are right. Premature optimization is not required mostly.

Currently, our model training for about 8 hours on 12 CPU cores and 30 epochs with dim 100. I guess we can afford to do it every week. And I agree, full trained from scratch model will be better than incremental trained model... probably.

gojomo · 2021-06-11T03:00:15Z

Thanks for the extra context! A few other things that often help speed such jobs, if you haven't tried these already:

if using an iterable corpus, ensure your iterable isn't repeating expensive steps (like regex-based preprocessing) each iteration - and even if you have 12 cores, testing fewer than 12 workers (because often the fastest training throughput with that mode uses fewer threads)
or alternatively, use corpus_file mode (if you can live with auto-serial-number single tags per doc) - as it will more effectively use all cores
use a more-aggressive (larger) min_count or more-aggressive (smaller) sample - each shrinks effective corpus size

espdev · 2021-06-11T14:38:45Z

@gojomo

Currently, we use iterable_corpus because we use string identifiers for the documents, but the iterator is just a generator with minimum overhead for reading prepared/normalized TXT (or txt.gz) file something like this:

ID1 word11 word12 word1N
ID2 word21 word22 word2N
...

def text_tagged_corpus(doc_corpus: PathType) -> Iterable[TaggedDocument]:
    with patch_pathlib():
        with Path(doc_corpus).open('r', encoding='utf-8') as fp:
            for line in fp:
                doc_id, *words = line.strip().split()
                yield TaggedDocument(words, tags=[doc_id])

class TextTaggedCorpus:
    def __init__(self, doc_corpus: PathType):
        self.doc_corpus = Path(doc_corpus)

    def __iter__(self):
        return text_tagged_corpus(self.doc_corpus)

use a more-aggressive (larger) min_count or more-aggressive (smaller) sample - each shrinks effective corpus size

Yes, I try to play with these parameters also.

gojomo · 2021-06-11T17:35:11Z

OK, you're already doing the key things. One last consideration: if your doc_corpus is a path to anything that reads slowly - remote volume, spinning disk, compression-that-decompresses-slowly – then adjusting that has a chance of a noticeable speedup. (For example, moving the corpus to a local SSD, testing against a uncompressed or alt-compressed file, or even bringing the full corpus into RAM if the corpus & machine RAM allow it.)

jexterliangsufe · 2021-08-04T03:12:14Z

I don't think build_vocab(... update=True) has ever reliably worked for Doc2Vec models:

The original implementation was only focused on Word2Vec, with no real testing/demos/verification done on Doc2Vec.

There are zero unit tests of update=True functionality, either for minimal non-error behavior or meaningful results, in our test_doc2vec.py test suite.

A segfault-crashing bug was reliably reported all through the Gensim 3.x.y releases: Segmentation fault using build_vocab(..., update=True) for Doc2Vec #1019.

So, it is a bug that people are tempted to try it, and that it doesn't do anything useful - but unless a new strong advocate/implementor emerges for this functionality (which hasn't happened in the 4.5 years #1019 has been open), it's as likely to be formally disabled (per suggestion here in 2017) as to be fixed.

Note that while expanding the model's set of known words or pre-trained doc-tags is the thorny possibility that's not been working, the simpler task of just calculating doc-vectors for new texts, within the known vocabulary, is possible via infer_vector().

I'm new to use doc2vec and have a question for this. If I can't train my model on total data, then each time I input 10% data and build a completely new vocabulary for it. Will it work?

gojomo · 2021-08-04T16:32:49Z

I'm new to use doc2vec and have a question for this. If I can't train my model on total data, then each time I input 10% data and build a completely new vocabulary for it. Will it work?

The supported way of using it, derived from published work, is to (1) do initial vocabulary-discovery & training on as much relevant data as possible, after which the set of known-words is frozen; (2) if new texts arrive later, infer vectors for them from the frozen model, with the limitation that any all-new words will be ignored. That works.

Any other mode of use would be an ad hoc improvisation - whether it "works" for any purpose would depend on exactly what you're doing. I've not seen any writeups or documentation showing how vocabulary-expansion might work - and indeed per my comment above, for most (or maybe all) of the period that .build_vocab() on Doc2Vec has accepted an update=True option, followup training has had a regular crashing bug - which implies even when lucky runs don't crash, they might be doing the wrong thing, corrupting results.

jexterliangsufe · 2021-08-04T17:43:25Z

I'm new to use doc2vec and have a question for this. If I can't train my model on total data, then each time I input 10% data and build a completely new vocabulary for it. Will it work?

The supported way of using it, derived from published work, is to (1) do initial vocabulary-discovery & training on as much relevant data as possible, after which the set of known-words is frozen; (2) if new texts arrive later, infer vectors for them from the frozen model, with the limitation that any all-new words will be ignored. That works.

Any other mode of use would be an ad hoc improvisation - whether it "works" for any purpose would depend on exactly what you're doing. I've not seen any writeups or documentation showing how vocabulary-expansion might work - and indeed per my comment above, for most (or maybe all) of the period that .build_vocab() on Doc2Vec has accepted an update=True option, followup training has had a regular crashing bug - which implies even when lucky runs don't crash, they might be doing the wrong thing, corrupting results.

Thank you very much! Actually I used TaggedLineDocument to solve my problem after I commended. But I still have a little problem about the second point of what you said. If my target is to compute similarity of different docs, will training new docs and using Doc2Vec.docvecs[tag] work better than using infer_vector? I have not seen the codes behind infer_vector and just think it can't considerate new docs' structure, word order or else. Maybe I need to take more time on source codes? Thanks again!

gojomo · 2021-08-04T18:03:45Z

But I still have a little problem about the second point of what you said. If my target is to compute similarity of different docs, will training new docs and using Doc2Vec.docvecs[tag] work better than using infer_vector?

Looking in model.docvecs[tag] can only give you a vector for a tag that was part of original training. For new texts, you'd need to use .infer_vector() - which analyzes new texts in exactly the same way as training-texts were analyzed, except with the whole model (except for the vector to be returned for the new text) frozen against changes. (You can also use .infer_vector() on training texts, again - and should generally get a vector close to the one left over for the same text from training. Which is better is something you should evaluate for yourself, in your training-setup and tasks.)

I have not seen the codes behind infer_vector and just think it can't considerate new docs' structure, word order or else. Maybe I need to take more time on source codes? Thanks again!

It's all there if you need to look! But also, questions & discussions not about a known-bug or suggested-improvement are better pursued on the project discussion list: https://groups.google.com/g/gensim

jexterliangsufe · 2021-08-05T05:16:37Z

But I still have a little problem about the second point of what you said. If my target is to compute similarity of different docs, will training new docs and using Doc2Vec.docvecs[tag] work better than using infer_vector?

Looking in model.docvecs[tag] can only give you a vector for a tag that was part of original training. For new texts, you'd need to use .infer_vector() - which analyzes new texts in exactly the same way as training-texts were analyzed, except with the whole model (except for the vector to be returned for the new text) frozen against changes. (You can also use .infer_vector() on training texts, again - and should generally get a vector close to the one left over for the same text from training. Which is better is something you should evaluate for yourself, in your training-setup and tasks.)

I have not seen the codes behind infer_vector and just think it can't considerate new docs' structure, word order or else. Maybe I need to take more time on source codes? Thanks again!

It's all there if you need to look! But also, questions & discussions not about a known-bug or suggested-improvement are better pursued on the project discussion list: https://groups.google.com/g/gensim

Thank you very much!

piskvorky closed this as completed Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2Vec: when we have string tags, build_vocab with update removes previous index #3162

Doc2Vec: when we have string tags, build_vocab with update removes previous index #3162

espdev commented Jun 4, 2021 •

edited

Loading

gojomo commented Jun 4, 2021

espdev commented Jun 5, 2021 •

edited

Loading

piskvorky commented Jun 5, 2021 •

edited

Loading

gojomo commented Jun 5, 2021

espdev commented Jun 10, 2021 •

edited

Loading

gojomo commented Jun 11, 2021

espdev commented Jun 11, 2021 •

edited

Loading

gojomo commented Jun 11, 2021

jexterliangsufe commented Aug 4, 2021

gojomo commented Aug 4, 2021

jexterliangsufe commented Aug 4, 2021

gojomo commented Aug 4, 2021

jexterliangsufe commented Aug 5, 2021

Doc2Vec: when we have string tags, build_vocab with update removes previous index #3162

Doc2Vec: when we have string tags, build_vocab with update removes previous index #3162

Comments

espdev commented Jun 4, 2021 • edited Loading

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Jun 4, 2021

espdev commented Jun 5, 2021 • edited Loading

piskvorky commented Jun 5, 2021 • edited Loading

gojomo commented Jun 5, 2021

espdev commented Jun 10, 2021 • edited Loading

gojomo commented Jun 11, 2021

espdev commented Jun 11, 2021 • edited Loading

gojomo commented Jun 11, 2021

jexterliangsufe commented Aug 4, 2021

gojomo commented Aug 4, 2021

jexterliangsufe commented Aug 4, 2021

gojomo commented Aug 4, 2021

jexterliangsufe commented Aug 5, 2021

espdev commented Jun 4, 2021 •

edited

Loading

espdev commented Jun 5, 2021 •

edited

Loading

piskvorky commented Jun 5, 2021 •

edited

Loading

espdev commented Jun 10, 2021 •

edited

Loading

espdev commented Jun 11, 2021 •

edited

Loading