-
-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Poly trans: Issue #347 #367
Conversation
Thanks for this. Sorry for not laying this out in the issue, but I had a (hopefully) easier way of doing this in my mind. This is entirely untested, so it may not end up working. In my mind, our transformer would use a kind of dummy So our fit would look like def fit(self, X, y=None):
self._transformer = sklearn.preprocessing.PolynomialFeatures()
X_sample = ...
self._transformer.fit(X_sample)
Then our transform is hopefully something simple like def transform(self, X, y=None):
# for array
X.map_blocks(self._transformer.fit_transform)
# for dataframe
X.map_partitions(self._transformer.fit_transform) Again, I haven't really tested that, but hopefully it will work.
Hopefully this implementation is robust to that.
Again, not tested, but hopefully correct.
I wouldn't worry about this until it becomes a problem. |
No, I have to apologize for not having invested more time on the dask fundamentals. Anyway, thanks for you valuable comments. I'm really happy to learn stuff. |
Short Update: Following the lines of your suggestion works for dask arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are some linting errors. You can check locally with flake8 dask_ml/preprocessing tests/preprocessing
.
__doc__ = skdata.PolynomialFeatures.__doc__ | ||
|
||
def __init_(self, degree=2, interaction_only=False, include_bias=True): | ||
super(PolynomialFeatures, self).__init__(degree, interaction_only, include_bias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This __init__
can just be removed if we aren't doing anything.
dask_ml/preprocessing/data.py
Outdated
super(PolynomialFeatures, self).__init__(degree, interaction_only, include_bias) | ||
|
||
def fit(self, X, y=None): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just inherit the docstring from scikit-learn I think.
dask_ml/preprocessing/data.py
Outdated
return self | ||
|
||
def transform(self, X, y=None): | ||
"""Transform data to polynomial features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same: just inherit.
if isinstance(X, da.Array): | ||
X_sample = np.ones((1, X.shape[1]), dtype=X.dtype) | ||
|
||
self._transformer.fit(X_sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we fit, we'll want to copy over the attribute learned attributes. I think that dask_ml._utils.copy_learned_attributes(self._transformer, self)
will work here.
dask_ml/preprocessing/data.py
Outdated
|
||
class PolynomialFeatures(skdata.PolynomialFeatures): | ||
|
||
__doc__ = skdata.PolynomialFeatures.__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we inherit, this should be unnecessary (I may be wrong though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the approach from the additional transformers. However,
class C(skdata.PolynomialFeatures):
pass
print(C.__doc__)
None```
shows that the doc string is not inherited for sklearn based transformers (that inherit from BaseEstimator). I also checked that for standard python classes the doc string is not passed to "children"
Hmm, it seems like my only one of my local commits appears here. Anyway. Code seems to work: The transformed arrays coincide (even after starting with data frames). TODOs
Hope to commit tomorrow around the same time |
dask_ml/preprocessing/data.py
Outdated
meta = pd.DataFrame( | ||
columns=self._transformer.get_feature_names(X.columns), dtype=float | ||
) | ||
data = X.map_partitions(self._transformer.transform, meta=meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, slight issue here I think.
_transformer.transform
is going to (AFAICT) always return an ndarray. So we have a choice
- mimic scikit-learn, and return a dask array here (so just
XP = X.map_partitions(self._transformer.transform)
. - Add a keyword like
preserve_dask_dataframe=True
, to control whether we should convert the dask array back to a dask dataframe
Right now, I think that just returning a dask array is fine here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to think about this. In the other approach I have I didn't use meta. There map_partitions returned a dask dataframe. I'll have a deeper look into this. Maybe it's just too late ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make sure you're on a recent version of dask. Returning a dask array from DataFrame.map_partitions
is somewhat new.
Unfortunately, I was too optimistic, probably some artifacts in jupyter.
Now
throws - in jupyter - an AttributeError
However
provides an output, and the arrays seem to be correct. However, the index isn't. Any ideas? Besides: I'd assume there should be some documentation added right? Any wishes on what to put where? |
The AttributeError indicates that something is wrong with the metadata. Haven't looked to see if it's a dask bug or not. However, given my earlier comment we may not need to worry about it. If you don't pass |
For documentation you can add it to the list of "scikit-learn clones" in |
It seems that It seems, that falling back to Additionally, I realized that map_blocks is (of course) not predicting the shape correctly (should add a test here as well). However, I just discovered that |
Okay, so I investigated a bit and two (maybe three options) unfold.
I'm not sure what is computational more expensive. Set Up
Essentially, the issue is that the concatenation of
For a dask array, we may provide the correct chunk sizes in the map_blocks function, and we obtain the correct output shapes. However, we also need to pass a dtype!
Finally, the suggestion and issues for data frames:
We see, that there is no index information for the map_partitions approach (a compute shows duplicate index values). Moreover a third approach could be to use the dataframe-constructor, but we would need to calculate the divisions. So: Which approach do you prefer? What is more expensive? Resetting the index or calculation the number of rows? The casting to float also may not be really optimal. One could do this a bit more intelligent, and just choose the "biggest" (in terms of subclass) of bool, int and float depending on which types appear. Another issue is that the blocks for the dask arrays. If one wants to remove this assumption, it's probably best to rewrite the transformer completely from scratch. Edit: I just realized, that I also realized that chunks of the form So, I'll try to provide a working solution by the end of this weekend. |
Any comments on the last commit? |
dask_ml/preprocessing/data.py
Outdated
|
||
if isinstance(X, da.Array): | ||
n_cols = len(self._transformer.get_feature_names()) | ||
# might raise Error if columns are separated by chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this error to me? Is the issue something like
X = da.random.uniform(size=(100, 10), chunks=(25, 5))`
i.e. we have two blocks on the features?
Typically, we use dask_ml.utils.check_array to handle this. There's a accept_multiple_blocks
keyword.
I think that we want to raise an exception in this case, since the transform would need to be different for the different blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Sorry for not directly raising an error here. Thanks for the hint on the accept_multiple_blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, added check_array
. Btw, should I also use check_array for the dask dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed the update.
dask_ml/preprocessing/data.py
Outdated
elif isinstance(X, pd.DataFrame): | ||
data = X.pipe(self._transformer.transform) | ||
columns = self._transformer.get_feature_names(X.columns) | ||
XP = pd.DataFrame(data=data, columns=columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference to match scikit-learn exactly here, since this is a clone.
If you want, you could add a keyword preserve_dataframe
to control whether we should convert the array to a dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since, this will force me to alter the doc-strings, I'll not be able to tackle the issue before the weekend. I think, I'll have enough time on Sunday
dask_ml/preprocessing/data.py
Outdated
elif isinstance(X, dd.DataFrame): | ||
data = X.map_partitions(self._transformer.transform) | ||
columns = self._transformer.get_feature_names(X.columns) | ||
XP = dd.from_dask_array(data, columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment on preserve_dataframe
.
Any idea whats currently wrong with the build process? Doc-strings will be added soon |
I think that @jrbourbeau is handling this now at #382 |
So what do to with the |
Another question: As of the contributing pipelines, dask transformers should have a columns keyword. Should this be added here as well? Maybe one could add a follow up issue. |
No, I think that is unnecessary now that ColumnTransformer is in scikit-learn. |
Fixed a merge conflict and pushed. Hopefully the CI is passing now. |
tests/preprocessing/test_data.py
Outdated
b = spp.PolynomialFeatures() | ||
|
||
# pandas dataframe | ||
res_pdf = a.fit_transform(df.compute()).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior for a dask dataframe? Can you test that (if it isn't tested elsewhere)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see test_daskdf_tranform
below. Any reason not to include that here? We could pametrize the test by daskify
if not daskify:
df = df.compute()
...
if daskify:
assert dask.is_dask_collection(res_pdf)
assert_eq_ar(res_pdf, res_b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally, I didn't combine it because of the dataframe comparison in test_daskdf_transform
. But I think using daskify
will solve the issue.
tests/preprocessing/test_data.py
Outdated
assert_eq_ar(trans_dask_df.compute().values, res_b) | ||
assert_eq_df(trans_dask_df.compute().reset_index(drop=True), res_pandas_df) | ||
|
||
def test_df_dont_preserve_df(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test feels redundant. I believe that everything here is covered elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels redundant, but actually it's not. So far I haven't considered a df as an input with an array as an output. An earlier implementation was not having that, as I was always returning a dataframe if the input was also a dataframe. I'll just add those tests in the test_df_transform
if I'm managing to daskify that. But maybe it's too cautious.
I just got a Moreover, something like
seems to work, but inspecting
gives
I thought I resolved that error. Edit: I had something working in between, don't know why this is not in anymore. Seems to be related to the meta argument. Will try to fix that issue. Not being able to inspect the result is not acceptable in my perspective. I will add an additional test, concerning that issue |
Sorry I missed that. That's because you're using |
Sorry for mentioning that I'm aware of that. Next two days are a but busy. I hope I'll find some time on Thursday or Friday to solve the last remaining issues and modify the tests accordingly. Moreover, I'll add a test that ensures that a dataframe can be inspected. |
I found the "bug" in dask. I'll check if dask already has an issue concerning that |
that seems fine. |
The upstream issue seems pretty solvable as well if anyone wants to take a
shot.
…On Wed, Oct 10, 2018 at 2:37 PM Tom Augspurger ***@***.***> wrote:
So a hot fix is to ensure that the columns of the dataframe don't contain
the 1 and maybe raise a warning. Other suggestions?
that seems fine.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#367 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszIAIubI1Un-1T0SZtFJ_nxH917pYks5ujj6FgaJpZM4Wqu1L>
.
|
So I'd suggest that I try to solve the issue in dask, as @mrocklin suggested. If I'm not able to fix this quickly, I'd provide a solution with the hot fix. Would this be okay, or should we rather close this issue first with the hot fix and create a new task to remove the fix once dask has the "bug" fixed? |
also works for numpy arrays tests are provided for dask arrays with known shape Additionally, tests for the same array with unknown chunk size and shape that fails is provided
dask dataframes not working properly, I assume something about missing meta data, A failing test is provided Another example will be provided
this fixes the failing test
This reverts commit 6517840.
The shape of the output arrays/frames should be handled correctly.
added entry to docs/source/modules/api.rst added entry to docs/source/preprocessing.rst
we prefer to raise an error message if the dask array has block not covering the full width as passing to sklearn will not produce the correct output.
documentation still missing
with the newest version of dask, the representation bug is gone condenses the tests and drop some duplicated tests
Any hints on the coverage bug? |
I think it's not actually coverage, I think that it's a flake issue. I recommend running cc @TomAugspurger feedback on the black process ^^ It's not entirely clear to users what went wrong when things go wrong, or how to fix it. |
Hmm i ran black and pushen the changes. Will Check tomorrow morning again and push the Update. Thanks for your comments |
|
Damn, i rebased and did not run black again. Maybe it's too late today. Will push the changes tomorrow. Thanks again |
btw, flake8 was fine. the issue was only related to black. Hope, everything is fine now. Coverage also seems to be ok/good enough |
@TomAugspurger and @mrocklin I think this could be merged right now. Any further comments? With my latest addition in dask, the representation bug is also gone. So assuming one hast the nightly dask, this will work without issues. Is there any suggestion how/if to make users aware of that? |
Looks good, thanks @datajanko. Dask-ML tends to require fairly recent versions of dask. We'll likely bump the required version of dask shortly after the next release. But since it's just a bug in the repr I don't think we're in a rush. |
At the current stage the pull request implements PolynomialFeatures in dask-ml for dask arrays with known shape and chunk size.
Implements Issue #347