Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new tags in doctag_vectors in #3262

Open
raccoon-science opened this issue Oct 29, 2021 · 1 comment
Open

Adding new tags in doctag_vectors in #3262

raccoon-science opened this issue Oct 29, 2021 · 1 comment

Comments

@raccoon-science
Copy link

Hello!

I am training a doc2vec model on a tagged docset.
I need to update it on new sets that contain new tags. Is there a way to update docvectors in gensim.doc2vec? How can I do it?

There is an old issue #1019 on the same topic, but it didn't help me as there were many changes in gensim. Maybe there is another way?

@gojomo
Copy link
Collaborator

gojomo commented Oct 30, 2021

Expanding the set of known doctags hasn't been supported; the work allowing expanding the Word2Vec vocabulary (via build_vocab(..., update=True) was never tested/completed for Doc2Vec, with intermittent crashing bugs like #1019.

Note that even if supported, such incremental expansions of a model are fraught with difficult tradeoffs. To the extent a new batch contains a different mix of words, word-senses, & topics than earlier data – & if it didn't, why bother with more training? – it will only "drag" parts of the model towards new weights, leaving others untouched, which risks degrading its overall usefulness unless you're carefully considering the mixes/balances between older & newer training data, & monitoring for ill-effects. (You can't assume incremental batches of new training are always improving things.)

The surest way to ensure balance between all training data is to re-train everything in one sessiion. That is, when new data arrives, add it to the full corpus, & train again on the full corpus, & use the later model's values instead of any earlier model (with which the later model's coordinates may not be compatible).

But if you thought you really needed to just do smaller updates, other options could include:

  • Using infer_vector() to obtain vectors for new docs, using the frozen vocabulary/weights of the prior model. No new words would be learned, nor tags inserted into the model's set of known tag-vectors, but you could collect these new doc-vectors, and potentially also merge them with the original set of tag-vectors into some new, outside-the-model combined structure for searching them all.
  • Pre-reserving some tags for expected later batch training. EG: if your initial training contains 100,000 docs, & you know another 50,000 docs will appear later, you could include another 50,000 dummy docs with pre-reserved tags in your initial training - their vectors would be random junk at first. But calling train() later with these pre-reserved tags would improe those vectors, albeit with the same relative-balance issues I mentioned above. (Without interleaved re-presentation of the original 100,000 docs, the model might get arbitrarily wel-customized to the new docs, and tag-vectors/words would drift further out of comparability with the earlier docs.)

(There might be other options, depending on the details of how you're using the model/doc-vectors for downstream.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants