-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sklearn API for Gensim models #1473
Changes from 12 commits
c55fcc9
dde234f
721806b
3cdcfde
99dba85
4d1eaf4
155a1ec
ae6c0f3
3c78873
341ed1f
9113f82
2935680
9628f99
3849d06
6097349
5b21875
6bfdb4d
9f0be87
6004eee
3262ec2
d4e560e
ad3f1f7
877632e
0871b50
3f363a1
c0894bc
9b7402d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (C) 2011 Radim Rehurek <[email protected]> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
"""Scikit learn wrapper for gensim. | ||
Contains various gensim based implementations which match with scikit-learn standards. | ||
See [1] for complete set of conventions. | ||
[1] http://scikit-learn.org/stable/developers/ | ||
""" | ||
|
||
|
||
from .basemodel import BaseTransformer # noqa: F401 | ||
from .ldamodel import LdaTransformer # noqa: F401 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is all that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A line having There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It feels wrong to litter the gensim code with such constructs -- the gensim code is correct, this is essentially working around some idiosyncrasy (bug?) of an unrelated library. By the way, how come we don't these errors from all the other There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @piskvorky because flake8 analyze the only diff every time, if you change same lines in any other There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, so the fake warnings will go away immediately? Why did we add these comments then. Let's remove it. |
||
from .lsimodel import LsiTransformer # noqa: F401 | ||
from .rpmodel import RpTransformer # noqa: F401 | ||
from .ldaseqmodel import LdaSeqTransformer # noqa: F401 | ||
from .w2vmodel import W2VTransformer # noqa: F401 | ||
from .atmodel import AuthorTopicTransformer # noqa: F401 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,33 +11,16 @@ | |
from abc import ABCMeta, abstractmethod | ||
|
||
|
||
class BaseSklearnWrapper(object): | ||
class BaseTransformer(object): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With this refactoring... I'm again wondering, is this class even helpful? Its There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gojomo scikit-learn indeed does not impose such a hierarchy among its classes. However, as you stated above, in addition to not having to implement the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But, the existing And then, the only benefit of this shared superclass is forcing an error as a reminder of what to implement. But if sklearn doesn't do that itself, is it that important to do so? It's also over-enforcing – it is not a strict requirement that all transformers support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my head I expect all sklearn models to have Also There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, providing a place to include a error/warning is a potential reason to implement But with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hi, jumping in the conversation a little late here, but I do agree that the old
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would struggle to find a gensim algorithm for which I haven't seen a "is it possible to update the model post-training?" question on the mailing list... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, it's always asked, but not always supported. The scikit-learn convention for not supporting incremental training is to simply not implement There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @gojomo 's point of view here. The point of these adapters is to assume the "philosophy" of another package, whatever structure or idiosyncrasies that may have. The least amount of surprises and novelty for users of that package. It is not the place of the adapters to introduce their own philosophy on class structures or missing methods or whatever. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have removed the |
||
""" | ||
Base sklearn wrapper module | ||
""" | ||
__metaclass__ = ABCMeta | ||
|
||
@abstractmethod | ||
def get_params(self, deep=True): | ||
pass | ||
|
||
@abstractmethod | ||
def set_params(self, **parameters): | ||
""" | ||
Set all parameters. | ||
""" | ||
for parameter, value in parameters.items(): | ||
setattr(self, parameter, value) | ||
return self | ||
|
||
@abstractmethod | ||
def fit(self, X, y=None): | ||
pass | ||
|
||
@abstractmethod | ||
def transform(self, docs, minimum_probability=None): | ||
pass | ||
|
||
@abstractmethod | ||
def partial_fit(self, X): | ||
pass |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,10 +15,10 @@ | |
from sklearn.exceptions import NotFittedError | ||
|
||
from gensim import models | ||
from gensim.sklearn_integration import BaseSklearnWrapper | ||
from gensim.sklearn_api import BaseTransformer | ||
|
||
|
||
class SklW2VModel(BaseSklearnWrapper, TransformerMixin, BaseEstimator): | ||
class W2VTransformer(BaseTransformer, TransformerMixin, BaseEstimator): | ||
""" | ||
Base Word2Vec module | ||
""" | ||
|
@@ -28,7 +28,7 @@ def __init__(self, size=100, alpha=0.025, window=5, min_count=5, | |
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, | ||
trim_rule=None, sorted_vocab=1, batch_words=10000): | ||
""" | ||
Sklearn wrapper for Word2Vec model. Class derived from gensim.models.Word2Vec | ||
Sklearn wrapper for Word2Vec model. See gensim.models.Word2Vec for parameter details. | ||
""" | ||
self.gensim_model = None | ||
self.size = size | ||
|
@@ -51,24 +51,6 @@ def __init__(self, size=100, alpha=0.025, window=5, min_count=5, | |
self.sorted_vocab = sorted_vocab | ||
self.batch_words = batch_words | ||
|
||
def get_params(self, deep=True): | ||
""" | ||
Returns all parameters as dictionary. | ||
""" | ||
return {"size": self.size, "alpha": self.alpha, "window": self.window, "min_count": self.min_count, | ||
"max_vocab_size": self.max_vocab_size, "sample": self.sample, "seed": self.seed, | ||
"workers": self.workers, "min_alpha": self.min_alpha, "sg": self.sg, "hs": self.hs, | ||
"negative": self.negative, "cbow_mean": self.cbow_mean, "hashfxn": self.hashfxn, | ||
"iter": self.iter, "null_word": self.null_word, "trim_rule": self.trim_rule, | ||
"sorted_vocab": self.sorted_vocab, "batch_words": self.batch_words} | ||
|
||
def set_params(self, **parameters): | ||
""" | ||
Set all parameters. | ||
""" | ||
super(SklW2VModel, self).set_params(**parameters) | ||
return self | ||
|
||
def fit(self, X, y=None): | ||
""" | ||
Fit the model according to the given training data. | ||
|
@@ -101,4 +83,5 @@ def transform(self, words): | |
return np.reshape(np.array(X), (len(words), self.size)) | ||
|
||
def partial_fit(self, X): | ||
raise NotImplementedError("'partial_fit' has not been implemented for SklW2VModel") | ||
raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer since 'update()' function for Word2Vec class is experimental. " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this correct? Why would the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't even see any There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @piskvorky The original error message was not correct (since there is no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @chinmayapancholi13 thanks for the clarification. So the concern is about initializing the vocabulary. That's a separate concern to incremental training, which is fully supported and not experimental (in fact, it's the same code as non-incremental training). I don't think So I see two directions (both non-critical, nice-to-have) here:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see. So the issue is actually "updating the vocab" (and not "training") being experimental. I have updated the error message in |
||
"See usage and documentation of Word2Vec's 'update()' function if you need to train your Word2Vec model incrementally.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chinmayapancholi13 please add yourself as the author. This is your baby :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe. Ok! 😄