Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 #1856

mino98 · 2018-01-24T15:51:23Z

As of #1019, this fixes the slow implementation of Doc2Vec and enables online-learning on new documents.

Tested on this. Please make sure that you are using the "slow version" before testing (to do so, I delete gensim/models/doc2vec_inner.c before building/installing. Maybe there's a simpler method 😄 ).

…ine-learning.

menshikh-iv · 2018-01-25T05:06:28Z

#1777 should be merged first, @mino98 sorry for waiting.

mino98 · 2018-01-26T07:19:33Z

Of course, no problem @menshikh-iv, but here I read:

Drops pure Python implementations and FastText wrapper.

which will make this PR useless in the first place. By the way, are "online learning" and vocabulary expansions taken into consideration in the new *2vec architecture?

menshikh-iv · 2018-01-26T08:31:36Z

@mino98 About dropping python support - yes, we'll do this, but slightly later (after #1777). Anyway, your code will exist very little time, sorry.

About architecture: online-learning - yes, vocab expansion - possible.

mino98 · 2018-01-26T08:34:38Z

About architecture: online-learning - yes, vocab expansion - possible.

Great, but I'd like to stress that online-learning without vocab expansion is useless for many (all?) practical purposes.

I don't particularly care for my code, as long as post-#1777 we implement these two features.

menshikh-iv · 2018-01-26T08:36:36Z

Great, but I'd like to stress that online-learning without vocab expansion is useless for many (all?) practical purposes.

So, I don't agree with you. This needed for training on really large corpuses (when training process needs several days/weeks).

Vocab expansion is really hard to implement well IMO.

mino98 · 2018-01-26T08:40:08Z

I think we mean different things by online learning: for me it means "being able of updating the model on a completely new corpus, to improve/finetune it", for you it is "checkpoint/snapshot the model during long training so that you can save/restart it".

Your approach requires doing build_vocab() of the whole huge corpus before the multi-day training, right?

Thanks.

menshikh-iv · 2018-01-26T09:33:03Z

@mino98 not quiet, "online" for me means that I no need full train data at one moment (I can pass it sentence by sentence). Also, I can save model, load and continue training with any data.

You talking about changing dictionary in training time, am I right (i.e. you pass fresh data and add words that you never see in the model during training process)?

mino98 · 2018-01-26T09:44:33Z

You talking about changing dictionary in training time, am I right (i.e. you pass fresh data and add words that you never see in the model during training process)?

Correct: I'm referring to training with documents that include words not present in the set of previously seen documents.

In real world experiments, this happens all the time. Imagine for example:

start with a new model (i.e., all initial weights are random)
build_vocab() + train() on a corpus of 100 books. Once completed, save() the model to storage.
some time later, you get 20 new books that you want to include in the model. So, I load() the model from disk, do build_vocab(update=True) on the new corpus, then train() on it and save() it back to disk.

Once in a while you need to repeat step 3. Say, for example, every week. Or every 12 hours. Or every time you get a decent amount of fresh user content... You clearly don't want (or cannot) re-train on the whole corpus (i.e., 100+20+... books) from zero every time.

Those "20 new books" will for sure contain new words not present in the initial 100 books, which will force you to extend the vocabulary, right?

(ps: thanks for this discussion. Maybe I'm misunderstanding how you do online learning in gensim?)

menshikh-iv · 2018-02-05T10:53:33Z

@mino98 #1777 merged successfully, please resolve merge-conflict

menshikh-iv · 2018-02-14T08:56:11Z

ping @mino98, are you planning to finish PR?

mino98 · 2018-02-14T09:04:53Z

Sorry for the delay @menshikh-iv, realistically I won't have time for this for some time.

By the way, I assumed that #1777 was going to drop completely the pure-python implementations, so this PR would become useless in the first place?

menshikh-iv · 2018-02-14T09:07:53Z

@mino98 no, we still support pure-python (until the major release), but it's completely up to you. If you can't finish this - please close PR.

menshikh-iv · 2018-02-14T09:14:35Z

Anyway, thanks for your good work @mino98 👍, sorry for this situation.

mino98 · 2018-02-14T09:19:45Z

Don't worry @menshikh-iv and thanks, Gensim is a great project.

However, it's not worth investing more effort to untangle this PR: the Python-only implementation is too slow for some practical purposes and going to be dropped at the next major release anyway, so...

menshikh-iv · 2018-02-14T09:34:03Z

@mino98 this PR can be used as the base for Cython version 👍

Fixes the "slow" pure Python implementation of doc2vec, including onl…

8a297f9

…ine-learning.

menshikh-iv changed the title ~~Fixes the "slow" pure Python implementation of doc2vec (w/online-learning)~~ Fix pure python implementation of doc2vec (w/online-learning). Partial fix 1019 Feb 1, 2018

menshikh-iv changed the title ~~Fix pure python implementation of doc2vec (w/online-learning). Partial fix 1019~~ Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 Feb 1, 2018

mino98 closed this Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 #1856

Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 #1856

mino98 commented Jan 24, 2018

menshikh-iv commented Jan 25, 2018

mino98 commented Jan 26, 2018

menshikh-iv commented Jan 26, 2018

mino98 commented Jan 26, 2018

menshikh-iv commented Jan 26, 2018

mino98 commented Jan 26, 2018 •

edited

Loading

menshikh-iv commented Jan 26, 2018

mino98 commented Jan 26, 2018 •

edited

Loading

menshikh-iv commented Feb 5, 2018

menshikh-iv commented Feb 14, 2018

mino98 commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

mino98 commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 #1856

Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 #1856

Conversation

mino98 commented Jan 24, 2018

menshikh-iv commented Jan 25, 2018

mino98 commented Jan 26, 2018

menshikh-iv commented Jan 26, 2018

mino98 commented Jan 26, 2018

menshikh-iv commented Jan 26, 2018

mino98 commented Jan 26, 2018 • edited Loading

menshikh-iv commented Jan 26, 2018

mino98 commented Jan 26, 2018 • edited Loading

menshikh-iv commented Feb 5, 2018

menshikh-iv commented Feb 14, 2018

mino98 commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

mino98 commented Feb 14, 2018

menshikh-iv commented Feb 14, 2018

mino98 commented Jan 26, 2018 •

edited

Loading

mino98 commented Jan 26, 2018 •

edited

Loading