Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2Vec: when we have string tags, build_vocab with update removes previous index #3162

Closed
espdev opened this issue Jun 4, 2021 · 13 comments

Comments

@espdev
Copy link

espdev commented Jun 4, 2021

Problem description

I'm trying to resume training my Doc2Vec model with string tags, but model.build_vocab removes all previous index from model.dv.

Steps/code/corpus to reproduce

A simple example to reproduce this:

import string

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [tag]) for tag, doc in zip(string.ascii_lowercase, common_texts)]
documents1 = documents[:6]
documents2 = documents[6:]

model = Doc2Vec(vector_size=5, window=2, min_count=1)

model.build_vocab(documents1)
model.train(documents1, total_examples=len(documents1), epochs=5)

model.save('model')
model = Doc2Vec.load('model')

print('Vector count after train:', len(model.dv))
print('Keys:', model.dv.index_to_key)

model.build_vocab(documents2, update=True)
model.train(documents2, total_examples=model.corpus_count, epochs=model.epochs)

print('Vector count after update:', len(model.dv))
print('Keys:', model.dv.index_to_key)

model.save('model')
model = Doc2Vec.load('model')

print('Vector count after load:', len(model.dv))
print('Keys:', model.dv.index_to_key)

Output:

Vector count after train: 6
Keys: ['a', 'b', 'c', 'd', 'e', 'f']
Vector count after update: 3
Keys: ['g', 'h', 'i']
Vector count after load: 3
Keys: ['g', 'h', 'i']

And we have an interesting behavior:

print('b' in model.dv)
# True
print(model.dv['b'])
# [ 0.00524729 -0.19762747 -0.10339681 -0.19433555  0.04022206]

The tag seems still exists in the model after updating, but len and index_to_key do not show this.

At the same time the code with int tags works correctly (it seems to me):

documents = [TaggedDocument(doc, [tag]) for tag, doc in enumerate(common_texts)]
documents1 = documents[:6]
documents2 = documents[6:]
...
Vector count after train: 6
Keys: [0, 1, 2, 3, 4, 5]
Vector count after update: 9
Keys: [0, 1, 2, 3, 4, 5, 6, 7, 8]
Vector count after load: 9
Keys: [0, 1, 2, 3, 4, 5, 6, 7, 8]

Versions

Windows-10-10.0.19041-SP0
Python 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]
Bits 64
NumPy 1.20.3
SciPy 1.6.1
gensim 4.0.1
FAST_VERSION 0
@gojomo
Copy link
Collaborator

gojomo commented Jun 4, 2021

I don't think build_vocab(... update=True) has ever reliably worked for Doc2Vec models:

So, it is a bug that people are tempted to try it, and that it doesn't do anything useful - but unless a new strong advocate/implementor emerges for this functionality (which hasn't happened in the 4.5 years #1019 has been open), it's as likely to be formally disabled (per suggestion here in 2017) as to be fixed.

Note that while expanding the model's set of known words or pre-trained doc-tags is the thorny possibility that's not been working, the simpler task of just calculating doc-vectors for new texts, within the known vocabulary, is possible via infer_vector().

@espdev
Copy link
Author

espdev commented Jun 5, 2021

@gojomo,

Thanks for your answer. I think at the moment disabling update flag for doc2vec model and fix the documentation is reasonable action.

However, this functionality is important and for my case also. Currently, I have almost 10M documents in the index and every year almost 1M documents will be added (new doc bulks are adding every week). I need to add these docs to the index for searching similar documents through all documents in our database. Without update the existing model I need to train the whole model every week or month. It is expensive for us and our infrastructure.

I know about infer vector and I use it. But for new documents I cannot use it for adding the docs to the index. Or I need to use other index, faiss or hsnwlib for example.

@piskvorky
Copy link
Owner

piskvorky commented Jun 5, 2021

I have almost 10M documents in the index and every year almost 1M documents will be added
It is expensive for us and our infrastructure.

Would you be able to quantify this? What are your expenses: CPU times, wallclock, $$$…

I'm only asking because the training is pretty fast, so having hard numbers from a real-world use-case will help motivate any work here. 10M documents doesn't strike me as a particularly large corpus. That's on the order of the English Wikipedia, which trains in hours, IIRC.

@gojomo
Copy link
Collaborator

gojomo commented Jun 5, 2021

I'm only asking because the training is pretty fast, so having hard numbers from a real-world use-case will help motivate any work here. 10M documents doesn't strike me as a particularly large corpus. That's on the order of the English Wikipedia, which trains in hours, IIRC.

Indeed, I believe people tend to spend a lot of effort trying to improvise/debug an incremental-update process, which then has never-quantified impacts on overall results quality. (Any time/compute savings you get from not re-presenting old texts includes with it a risk that the model's understanding of those old texts will be subtly diluted/erased by training only on a tiny number of different newer examples.)

That is often a premature optimization, when less effort could instead be spent on setting up a process for low-effort/low-cost automated reindexing on convenient intervals. For example, scheduling a 12hr from-scratch training to run overnight, or a 48hr from-scratch training to happen over a weekend, may involve just tens-of-dollars of computer costs, and (after initial creation) essentially no marginal R&D effort, and run well for months/years.

Also: if you are in fact using the Gensim big-sets-of-vectors – such as a Doc2Vec model's .dv set of doc-tag-named vectors, as your actual search/retrieval backend:

  • you can either split such searches over multiple groups of vectors (then merge the results), or (with a little effort) merge all the candidates into one large set - so you don't need build_vocab(..., update=True) style re-training of a model just to add new inferred vectors into the candidate set. (It'd only be needed if the model must learn new words, or adjust the impact of older words, based on newer data.) See for example the add_vectors() method of KeyedVectors (of which the Doc2Vec.dv object is an example) - but keep in mind each add requires a re-allocation equal to the old-plus-new size, so doing it in large batches, with plenty of free RAM, isbetter than lots of one-at-a-time additions.
  • you'll always be paying the cost of a full-scan of all candidate matches, then ranking all results, in most_similar() - something that will hit limits given the inherent delays accessing large amounts of RAM, or when the dataset outgrows one system's RAM, but can then be parallelized/sharded across multiple machines if needed

@espdev
Copy link
Author

espdev commented Jun 10, 2021

@piskvorky @gojomo

Sorry for not replying you earlier. Thanks for your replies and the explanation.
Probably you are right. Premature optimization is not required mostly.

Currently, our model training for about 8 hours on 12 CPU cores and 30 epochs with dim 100. I guess we can afford to do it every week. And I agree, full trained from scratch model will be better than incremental trained model... probably.

@gojomo
Copy link
Collaborator

gojomo commented Jun 11, 2021

Thanks for the extra context! A few other things that often help speed such jobs, if you haven't tried these already:

  • if using an iterable corpus, ensure your iterable isn't repeating expensive steps (like regex-based preprocessing) each iteration - and even if you have 12 cores, testing fewer than 12 workers (because often the fastest training throughput with that mode uses fewer threads)
  • or alternatively, use corpus_file mode (if you can live with auto-serial-number single tags per doc) - as it will more effectively use all cores
  • use a more-aggressive (larger) min_count or more-aggressive (smaller) sample - each shrinks effective corpus size

@espdev
Copy link
Author

espdev commented Jun 11, 2021

@gojomo

Currently, we use iterable_corpus because we use string identifiers for the documents, but the iterator is just a generator with minimum overhead for reading prepared/normalized TXT (or txt.gz) file something like this:

ID1 word11 word12 word1N
ID2 word21 word22 word2N
...
def text_tagged_corpus(doc_corpus: PathType) -> Iterable[TaggedDocument]:
    with patch_pathlib():
        with Path(doc_corpus).open('r', encoding='utf-8') as fp:
            for line in fp:
                doc_id, *words = line.strip().split()
                yield TaggedDocument(words, tags=[doc_id])

class TextTaggedCorpus:
    def __init__(self, doc_corpus: PathType):
        self.doc_corpus = Path(doc_corpus)

    def __iter__(self):
        return text_tagged_corpus(self.doc_corpus)

use a more-aggressive (larger) min_count or more-aggressive (smaller) sample - each shrinks effective corpus size

Yes, I try to play with these parameters also.

@gojomo
Copy link
Collaborator

gojomo commented Jun 11, 2021

OK, you're already doing the key things. One last consideration: if your doc_corpus is a path to anything that reads slowly - remote volume, spinning disk, compression-that-decompresses-slowly – then adjusting that has a chance of a noticeable speedup. (For example, moving the corpus to a local SSD, testing against a uncompressed or alt-compressed file, or even bringing the full corpus into RAM if the corpus & machine RAM allow it.)

@jexterliangsufe
Copy link

I don't think build_vocab(... update=True) has ever reliably worked for Doc2Vec models:

So, it is a bug that people are tempted to try it, and that it doesn't do anything useful - but unless a new strong advocate/implementor emerges for this functionality (which hasn't happened in the 4.5 years #1019 has been open), it's as likely to be formally disabled (per suggestion here in 2017) as to be fixed.

Note that while expanding the model's set of known words or pre-trained doc-tags is the thorny possibility that's not been working, the simpler task of just calculating doc-vectors for new texts, within the known vocabulary, is possible via infer_vector().

I'm new to use doc2vec and have a question for this. If I can't train my model on total data, then each time I input 10% data and build a completely new vocabulary for it. Will it work?

@gojomo
Copy link
Collaborator

gojomo commented Aug 4, 2021

I'm new to use doc2vec and have a question for this. If I can't train my model on total data, then each time I input 10% data and build a completely new vocabulary for it. Will it work?

The supported way of using it, derived from published work, is to (1) do initial vocabulary-discovery & training on as much relevant data as possible, after which the set of known-words is frozen; (2) if new texts arrive later, infer vectors for them from the frozen model, with the limitation that any all-new words will be ignored. That works.

Any other mode of use would be an ad hoc improvisation - whether it "works" for any purpose would depend on exactly what you're doing. I've not seen any writeups or documentation showing how vocabulary-expansion might work - and indeed per my comment above, for most (or maybe all) of the period that .build_vocab() on Doc2Vec has accepted an update=True option, followup training has had a regular crashing bug - which implies even when lucky runs don't crash, they might be doing the wrong thing, corrupting results.

@jexterliangsufe
Copy link

I'm new to use doc2vec and have a question for this. If I can't train my model on total data, then each time I input 10% data and build a completely new vocabulary for it. Will it work?

The supported way of using it, derived from published work, is to (1) do initial vocabulary-discovery & training on as much relevant data as possible, after which the set of known-words is frozen; (2) if new texts arrive later, infer vectors for them from the frozen model, with the limitation that any all-new words will be ignored. That works.

Any other mode of use would be an ad hoc improvisation - whether it "works" for any purpose would depend on exactly what you're doing. I've not seen any writeups or documentation showing how vocabulary-expansion might work - and indeed per my comment above, for most (or maybe all) of the period that .build_vocab() on Doc2Vec has accepted an update=True option, followup training has had a regular crashing bug - which implies even when lucky runs don't crash, they might be doing the wrong thing, corrupting results.

Thank you very much! Actually I used TaggedLineDocument to solve my problem after I commended. But I still have a little problem about the second point of what you said. If my target is to compute similarity of different docs, will training new docs and using Doc2Vec.docvecs[tag] work better than using infer_vector? I have not seen the codes behind infer_vector and just think it can't considerate new docs' structure, word order or else. Maybe I need to take more time on source codes? Thanks again!

@gojomo
Copy link
Collaborator

gojomo commented Aug 4, 2021

But I still have a little problem about the second point of what you said. If my target is to compute similarity of different docs, will training new docs and using Doc2Vec.docvecs[tag] work better than using infer_vector?

Looking in model.docvecs[tag] can only give you a vector for a tag that was part of original training. For new texts, you'd need to use .infer_vector() - which analyzes new texts in exactly the same way as training-texts were analyzed, except with the whole model (except for the vector to be returned for the new text) frozen against changes. (You can also use .infer_vector() on training texts, again - and should generally get a vector close to the one left over for the same text from training. Which is better is something you should evaluate for yourself, in your training-setup and tasks.)

I have not seen the codes behind infer_vector and just think it can't considerate new docs' structure, word order or else. Maybe I need to take more time on source codes? Thanks again!

It's all there if you need to look! But also, questions & discussions not about a known-bug or suggested-improvement are better pursued on the project discussion list: https://groups.google.com/g/gensim

@jexterliangsufe
Copy link

But I still have a little problem about the second point of what you said. If my target is to compute similarity of different docs, will training new docs and using Doc2Vec.docvecs[tag] work better than using infer_vector?

Looking in model.docvecs[tag] can only give you a vector for a tag that was part of original training. For new texts, you'd need to use .infer_vector() - which analyzes new texts in exactly the same way as training-texts were analyzed, except with the whole model (except for the vector to be returned for the new text) frozen against changes. (You can also use .infer_vector() on training texts, again - and should generally get a vector close to the one left over for the same text from training. Which is better is something you should evaluate for yourself, in your training-setup and tasks.)

I have not seen the codes behind infer_vector and just think it can't considerate new docs' structure, word order or else. Maybe I need to take more time on source codes? Thanks again!

It's all there if you need to look! But also, questions & discussions not about a known-bug or suggested-improvement are better pursued on the project discussion list: https://groups.google.com/g/gensim

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants