-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc2Vec: when we have string tags, build_vocab with update removes previous index #3162
Comments
I don't think
So, it is a bug that people are tempted to try it, and that it doesn't do anything useful - but unless a new strong advocate/implementor emerges for this functionality (which hasn't happened in the 4.5 years #1019 has been open), it's as likely to be formally disabled (per suggestion here in 2017) as to be fixed. Note that while expanding the model's set of known words or pre-trained doc-tags is the thorny possibility that's not been working, the simpler task of just calculating doc-vectors for new texts, within the known vocabulary, is possible via |
Thanks for your answer. I think at the moment disabling update flag for doc2vec model and fix the documentation is reasonable action. However, this functionality is important and for my case also. Currently, I have almost 10M documents in the index and every year almost 1M documents will be added (new doc bulks are adding every week). I need to add these docs to the index for searching similar documents through all documents in our database. Without update the existing model I need to train the whole model every week or month. It is expensive for us and our infrastructure. I know about infer vector and I use it. But for new documents I cannot use it for adding the docs to the index. Or I need to use other index, faiss or hsnwlib for example. |
Would you be able to quantify this? What are your expenses: CPU times, wallclock, $$$… I'm only asking because the training is pretty fast, so having hard numbers from a real-world use-case will help motivate any work here. 10M documents doesn't strike me as a particularly large corpus. That's on the order of the English Wikipedia, which trains in hours, IIRC. |
Indeed, I believe people tend to spend a lot of effort trying to improvise/debug an incremental-update process, which then has never-quantified impacts on overall results quality. (Any time/compute savings you get from not re-presenting old texts includes with it a risk that the model's understanding of those old texts will be subtly diluted/erased by training only on a tiny number of different newer examples.) That is often a premature optimization, when less effort could instead be spent on setting up a process for low-effort/low-cost automated reindexing on convenient intervals. For example, scheduling a 12hr from-scratch training to run overnight, or a 48hr from-scratch training to happen over a weekend, may involve just tens-of-dollars of computer costs, and (after initial creation) essentially no marginal R&D effort, and run well for months/years. Also: if you are in fact using the Gensim big-sets-of-vectors – such as a
|
Sorry for not replying you earlier. Thanks for your replies and the explanation. Currently, our model training for about 8 hours on 12 CPU cores and 30 epochs with dim 100. I guess we can afford to do it every week. And I agree, full trained from scratch model will be better than incremental trained model... probably. |
Thanks for the extra context! A few other things that often help speed such jobs, if you haven't tried these already:
|
Currently, we use
def text_tagged_corpus(doc_corpus: PathType) -> Iterable[TaggedDocument]:
with patch_pathlib():
with Path(doc_corpus).open('r', encoding='utf-8') as fp:
for line in fp:
doc_id, *words = line.strip().split()
yield TaggedDocument(words, tags=[doc_id])
class TextTaggedCorpus:
def __init__(self, doc_corpus: PathType):
self.doc_corpus = Path(doc_corpus)
def __iter__(self):
return text_tagged_corpus(self.doc_corpus)
Yes, I try to play with these parameters also. |
OK, you're already doing the key things. One last consideration: if your |
I'm new to use doc2vec and have a question for this. If I can't train my model on total data, then each time I input 10% data and build a completely new vocabulary for it. Will it work? |
The supported way of using it, derived from published work, is to (1) do initial vocabulary-discovery & training on as much relevant data as possible, after which the set of known-words is frozen; (2) if new texts arrive later, infer vectors for them from the frozen model, with the limitation that any all-new words will be ignored. That works. Any other mode of use would be an ad hoc improvisation - whether it "works" for any purpose would depend on exactly what you're doing. I've not seen any writeups or documentation showing how vocabulary-expansion might work - and indeed per my comment above, for most (or maybe all) of the period that |
Thank you very much! Actually I used TaggedLineDocument to solve my problem after I commended. But I still have a little problem about the second point of what you said. If my target is to compute similarity of different docs, will training new docs and using Doc2Vec.docvecs[tag] work better than using infer_vector? I have not seen the codes behind infer_vector and just think it can't considerate new docs' structure, word order or else. Maybe I need to take more time on source codes? Thanks again! |
Looking in
It's all there if you need to look! But also, questions & discussions not about a known-bug or suggested-improvement are better pursued on the project discussion list: https://groups.google.com/g/gensim |
Thank you very much! |
Problem description
I'm trying to resume training my Doc2Vec model with string tags, but
model.build_vocab
removes all previous index frommodel.dv
.Steps/code/corpus to reproduce
A simple example to reproduce this:
Output:
And we have an interesting behavior:
The tag seems still exists in the model after updating, but
len
andindex_to_key
do not show this.At the same time the code with int tags works correctly (it seems to me):
Versions
The text was updated successfully, but these errors were encountered: