Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 #1856

Closed
wants to merge 1 commit into from

Conversation

mino98
Copy link

@mino98 mino98 commented Jan 24, 2018

As of #1019, this fixes the slow implementation of Doc2Vec and enables online-learning on new documents.

Tested on this. Please make sure that you are using the "slow version" before testing (to do so, I delete gensim/models/doc2vec_inner.c before building/installing. Maybe there's a simpler method 😄 ).

@menshikh-iv
Copy link
Contributor

#1777 should be merged first, @mino98 sorry for waiting.

@mino98
Copy link
Author

mino98 commented Jan 26, 2018

Of course, no problem @menshikh-iv, but here I read:

Drops pure Python implementations and FastText wrapper.

which will make this PR useless in the first place. By the way, are "online learning" and vocabulary expansions taken into consideration in the new *2vec architecture?

@menshikh-iv
Copy link
Contributor

@mino98 About dropping python support - yes, we'll do this, but slightly later (after #1777). Anyway, your code will exist very little time, sorry.

About architecture: online-learning - yes, vocab expansion - possible.

@mino98
Copy link
Author

mino98 commented Jan 26, 2018

About architecture: online-learning - yes, vocab expansion - possible.

Great, but I'd like to stress that online-learning without vocab expansion is useless for many (all?) practical purposes.

I don't particularly care for my code, as long as post-#1777 we implement these two features.

@menshikh-iv
Copy link
Contributor

Great, but I'd like to stress that online-learning without vocab expansion is useless for many (all?) practical purposes.

So, I don't agree with you. This needed for training on really large corpuses (when training process needs several days/weeks).

Vocab expansion is really hard to implement well IMO.

@mino98
Copy link
Author

mino98 commented Jan 26, 2018

I think we mean different things by online learning: for me it means "being able of updating the model on a completely new corpus, to improve/finetune it", for you it is "checkpoint/snapshot the model during long training so that you can save/restart it".

Your approach requires doing build_vocab() of the whole huge corpus before the multi-day training, right?

Thanks.

@menshikh-iv
Copy link
Contributor

@mino98 not quiet, "online" for me means that I no need full train data at one moment (I can pass it sentence by sentence). Also, I can save model, load and continue training with any data.

You talking about changing dictionary in training time, am I right (i.e. you pass fresh data and add words that you never see in the model during training process)?

@mino98
Copy link
Author

mino98 commented Jan 26, 2018

You talking about changing dictionary in training time, am I right (i.e. you pass fresh data and add words that you never see in the model during training process)?

Correct: I'm referring to training with documents that include words not present in the set of previously seen documents.

In real world experiments, this happens all the time. Imagine for example:

  1. start with a new model (i.e., all initial weights are random)
  2. build_vocab() + train() on a corpus of 100 books. Once completed, save() the model to storage.
  3. some time later, you get 20 new books that you want to include in the model. So, I load() the model from disk, do build_vocab(update=True) on the new corpus, then train() on it and save() it back to disk.

Once in a while you need to repeat step 3. Say, for example, every week. Or every 12 hours. Or every time you get a decent amount of fresh user content... You clearly don't want (or cannot) re-train on the whole corpus (i.e., 100+20+... books) from zero every time.

Those "20 new books" will for sure contain new words not present in the initial 100 books, which will force you to extend the vocabulary, right?

(ps: thanks for this discussion. Maybe I'm misunderstanding how you do online learning in gensim?)

@menshikh-iv menshikh-iv changed the title Fixes the "slow" pure Python implementation of doc2vec (w/online-learning) Fix pure python implementation of doc2vec (w/online-learning). Partial fix 1019 Feb 1, 2018
@menshikh-iv menshikh-iv changed the title Fix pure python implementation of doc2vec (w/online-learning). Partial fix 1019 Fix pure python implementation of doc2vec (w/online-learning). Partial fix #1019 Feb 1, 2018
@menshikh-iv
Copy link
Contributor

@mino98 #1777 merged successfully, please resolve merge-conflict

@menshikh-iv
Copy link
Contributor

ping @mino98, are you planning to finish PR?

@mino98
Copy link
Author

mino98 commented Feb 14, 2018

Sorry for the delay @menshikh-iv, realistically I won't have time for this for some time.

By the way, I assumed that #1777 was going to drop completely the pure-python implementations, so this PR would become useless in the first place?

@menshikh-iv
Copy link
Contributor

@mino98 no, we still support pure-python (until the major release), but it's completely up to you. If you can't finish this - please close PR.

@mino98 mino98 closed this Feb 14, 2018
@menshikh-iv
Copy link
Contributor

Anyway, thanks for your good work @mino98 👍, sorry for this situation.

@mino98
Copy link
Author

mino98 commented Feb 14, 2018

Don't worry @menshikh-iv and thanks, Gensim is a great project.

However, it's not worth investing more effort to untangle this PR: the Python-only implementation is too slow for some practical purposes and going to be dropped at the next major release anyway, so...

@menshikh-iv
Copy link
Contributor

@mino98 this PR can be used as the base for Cython version 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants