-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 #1856
Conversation
Of course, no problem @menshikh-iv, but here I read:
which will make this PR useless in the first place. By the way, are "online learning" and vocabulary expansions taken into consideration in the new *2vec architecture? |
Great, but I'd like to stress that online-learning without vocab expansion is useless for many (all?) practical purposes. I don't particularly care for my code, as long as post-#1777 we implement these two features. |
So, I don't agree with you. This needed for training on really large corpuses (when training process needs several days/weeks). Vocab expansion is really hard to implement well IMO. |
I think we mean different things by online learning: for me it means "being able of updating the model on a completely new corpus, to improve/finetune it", for you it is "checkpoint/snapshot the model during long training so that you can save/restart it". Your approach requires doing Thanks. |
@mino98 not quiet, "online" for me means that I no need full train data at one moment (I can pass it sentence by sentence). Also, I can save model, load and continue training with any data. You talking about changing dictionary in training time, am I right (i.e. you pass fresh data and add words that you never see in the model during training process)? |
Correct: I'm referring to training with documents that include words not present in the set of previously seen documents. In real world experiments, this happens all the time. Imagine for example:
Once in a while you need to repeat step 3. Say, for example, every week. Or every 12 hours. Or every time you get a decent amount of fresh user content... You clearly don't want (or cannot) re-train on the whole corpus (i.e., 100+20+... books) from zero every time. Those "20 new books" will for sure contain new words not present in the initial 100 books, which will force you to extend the vocabulary, right? (ps: thanks for this discussion. Maybe I'm misunderstanding how you do online learning in gensim?) |
ping @mino98, are you planning to finish PR? |
Sorry for the delay @menshikh-iv, realistically I won't have time for this for some time. By the way, I assumed that #1777 was going to drop completely the pure-python implementations, so this PR would become useless in the first place? |
@mino98 no, we still support pure-python (until the major release), but it's completely up to you. If you can't finish this - please close PR. |
Anyway, thanks for your good work @mino98 👍, sorry for this situation. |
Don't worry @menshikh-iv and thanks, Gensim is a great project. However, it's not worth investing more effort to untangle this PR: the Python-only implementation is too slow for some practical purposes and going to be dropped at the next major release anyway, so... |
@mino98 this PR can be used as the base for Cython version 👍 |
As of #1019, this fixes the slow implementation of Doc2Vec and enables online-learning on new documents.
Tested on this. Please make sure that you are using the "slow version" before testing (to do so, I delete
gensim/models/doc2vec_inner.c
before building/installing. Maybe there's a simpler method 😄 ).