[MRG] Poly trans: Issue #347 #367

datajanko · 2018-09-16T07:55:31Z

At the current stage the pull request implements PolynomialFeatures in dask-ml for dask arrays with known shape and chunk size.

Implements Issue #347

TomAugspurger · 2018-09-16T14:25:45Z

Thanks for this.

Sorry for not laying this out in the issue, but I had a (hopefully) easier way of doing this in my mind. This is entirely untested, so it may not end up working.

In my mind, our transformer would use a kind of dummy sklearn.preprocessing.PolynomialFeatures internally, and we would use X.map_blocks or X.map_partitions, passing that dummy transformer's .transform method.

So our fit would look like

def fit(self, X, y=None):
    self._transformer = sklearn.preprocessing.PolynomialFeatures()
    X_sample = ...
    self._transformer.fit(X_sample)

X_sample is the kind of tricky thing. For dask.dataframe we would use X._meta, which has the right shape and dtypes. For dask array, I think we'd be fine with something like np.ones((1, X.shape[1]), dtype=X.dtype) (untested).

Then our transform is hopefully something simple like

def transform(self, X, y=None):
    # for array
    X.map_blocks(self._transformer.fit_transform)
    # for dataframe
    X.map_partitions(self._transformer.fit_transform)

Again, I haven't really tested that, but hopefully it will work.

What should I do with chunks of unknown size/shape. This can happen if you have a dask data frame and call values.

Hopefully this implementation is robust to that.

Concerning DataFrames:

Again, not tested, but hopefully correct.

Sparse matrices/frames.

I wouldn't worry about this until it becomes a problem.

datajanko · 2018-09-16T17:38:19Z

No, I have to apologize for not having invested more time on the dask fundamentals.
I was a bit distracted by realizing that the sklearn transformer is not stateless, which made me think that the "mapping" approach is not feasible.

Anyway, thanks for you valuable comments. I'm really happy to learn stuff.

datajanko · 2018-09-17T08:17:39Z

Short Update: Following the lines of your suggestion works for dask arrays.

TomAugspurger

Looks like there are some linting errors. You can check locally with flake8 dask_ml/preprocessing tests/preprocessing.

TomAugspurger · 2018-09-17T21:34:12Z

dask_ml/preprocessing/data.py

+    __doc__ = skdata.PolynomialFeatures.__doc__
+
+    def __init_(self, degree=2, interaction_only=False, include_bias=True):
+        super(PolynomialFeatures, self).__init__(degree, interaction_only, include_bias)


This __init__ can just be removed if we aren't doing anything.

TomAugspurger · 2018-09-17T21:34:58Z

dask_ml/preprocessing/data.py

+        super(PolynomialFeatures, self).__init__(degree, interaction_only, include_bias)
+
+    def fit(self, X, y=None):
+        """


We can just inherit the docstring from scikit-learn I think.

TomAugspurger · 2018-09-17T21:35:33Z

dask_ml/preprocessing/data.py

+        return self
+
+    def transform(self, X, y=None):
+        """Transform data to polynomial features


Same: just inherit.

TomAugspurger · 2018-09-17T21:39:18Z

dask_ml/preprocessing/data.py

+        if isinstance(X, da.Array):
+            X_sample = np.ones((1, X.shape[1]), dtype=X.dtype)
+
+        self._transformer.fit(X_sample)


After we fit, we'll want to copy over the attribute learned attributes. I think that dask_ml._utils.copy_learned_attributes(self._transformer, self) will work here.

tests/preprocessing/test_data.py

TomAugspurger · 2018-09-17T21:40:20Z

dask_ml/preprocessing/data.py

+
+class PolynomialFeatures(skdata.PolynomialFeatures):
+
+    __doc__ = skdata.PolynomialFeatures.__doc__


Since we inherit, this should be unnecessary (I may be wrong though).

I just copied the approach from the additional transformers. However,

class C(skdata.PolynomialFeatures): pass print(C.__doc__) None``` shows that the doc string is not inherited for sklearn based transformers (that inherit from BaseEstimator). I also checked that for standard python classes the doc string is not passed to "children"

datajanko · 2018-09-19T20:22:31Z

Hmm, it seems like my only one of my local commits appears here. Anyway.

Code seems to work: The transformed arrays coincide (even after starting with data frames).

TODOs

Documentation
Code Cleanup
Further tests, e.g. checking data frames (pandas and dask) will give the same result. Test fit more extensively, and maybe some ideas I'll have

Hope to commit tomorrow around the same time

TomAugspurger · 2018-09-20T19:45:21Z

dask_ml/preprocessing/data.py

+            meta = pd.DataFrame(
+                columns=self._transformer.get_feature_names(X.columns), dtype=float
+            )
+            data = X.map_partitions(self._transformer.transform, meta=meta)


Hmm, slight issue here I think.

_transformer.transform is going to (AFAICT) always return an ndarray. So we have a choice

mimic scikit-learn, and return a dask array here (so just XP = X.map_partitions(self._transformer.transform).

Add a keyword like preserve_dask_dataframe=True, to control whether we should convert the dask array back to a dask dataframe

Right now, I think that just returning a dask array is fine here.

I'll have to think about this. In the other approach I have I didn't use meta. There map_partitions returned a dask dataframe. I'll have a deeper look into this. Maybe it's just too late ;)

Maybe make sure you're on a recent version of dask. Returning a dask array from DataFrame.map_partitions is somewhat new.

datajanko · 2018-09-20T19:52:09Z

Unfortunately, I was too optimistic, probably some artifacts in jupyter.
So as the test indicates, I'm creating duplicate index entries.
An example to reproduce:

from dask import array as da
from dask_ml.preprocessing import PolynomialFeatures
from dask_ml.datasets import make_classification
import dask.dataframe as ddf
X, y = make_classification(chunks=50)
df = X.to_dask_dataframe().rename(columns=str)
p = PolynomialFeatures()

Now

p.fit_transform(df)

throws - in jupyter - an AttributeError

AttributeError: 'DataFrame' object has no attribute '_repr_data'

However

p.fit_transform(df).compute()

provides an output, and the arrays seem to be correct. However, the index isn't.
I was training to add a meta keyword argument to map_partitions, but this didn't help either.

Any ideas?

Besides: I'd assume there should be some documentation added right? Any wishes on what to put where?

TomAugspurger · 2018-09-20T19:58:27Z

The AttributeError indicates that something is wrong with the metadata. Haven't looked to see if it's a dask bug or not.

However, given my earlier comment we may not need to worry about it. If you don't pass meta=meta to the X.map_partitions, then we'll just return an ndarray and everything will be fine.

TomAugspurger · 2018-09-20T19:59:59Z

For documentation you can add it to the list of "scikit-learn clones" in doc/source/preprocessing.rst and to doc/source/modules/api.rst, again under preprocessing.

datajanko · 2018-09-21T05:15:33Z

It seems that map_partitionhas some issues if the function creates more columns than originally in the dataframe! Even providing meta information was not helping. I'll post some details later. Though map_partition somehow shows some values, it seems one always would have to call reset_index() afterwards, if one can avoid errors. This seems a bit odd and unsafe.

It seems, that falling back to map_block for dask arrays is a solution. (Edit: I think this statement is false)

Additionally, I realized that map_blocks is (of course) not predicting the shape correctly (should add a test here as well). However, I just discovered that apply_gufunc has been implemented recently in dask. I'll have a look at that

datajanko · 2018-09-21T20:20:54Z

Okay, so I investigated a bit and two (maybe three options) unfold.

Use map_partitions followed by reset_index
Use map_blocks but compute the number of rows

I'm not sure what is computational more expensive.

Set Up

import dask
from sklearn.preprocessing.data import PolynomialFeatures

from dask import array as da
from dask_ml.datasets import make_classification
import dask.dataframe as ddf

X, y = make_classification(chunks=3, n_samples=6, n_features=2)
df = X.to_dask_dataframe().rename(columns=str)
X_from_df = df.to_dask_array()

spf = PolynomialFeatures()
spf.fit(X.compute())

Essentially, the issue is that the concatenation of to_dask_array and to_dask_dataframe
is not the identity function, and thus the divisions can not be inferred for data frames

print(X)
print(df.shape, df.divisions)

dask.array<normal, shape=(6, 2), dtype=float64, chunksize=(3, 2)>
(Delayed('int-97a26258-e448-4f93-830e-72985a66cde2'), 2) (0, 3, 5)

print('X:', X)
print("'id'(X):", X_from_df)
print("df:", df.shape, df.divisions) 
print("id(df):", df.to_dask_array().to_dask_dataframe().shape,  df.to_dask_array().to_dask_dataframe().divisions)

X: dask.array<normal, shape=(6, 2), dtype=float64, chunksize=(3, 2)>
'id'(X): dask.array<array, shape=(nan, 2), dtype=float64, chunksize=(nan, 2)>
df: (Delayed('int-756aa071-eaf0-4883-9bcf-383c630872a5'), 2) (0, 3, 5)
id(df): (Delayed('int-d7b1c932-daf6-44c6-943c-5de1ddd49297'), 2) (None, None, None)

For a dask array, we may provide the correct chunk sizes in the map_blocks function, and we obtain the correct output shapes. However, we also need to pass a dtype!

n_cols = len(spf.get_feature_names())
row_chunk_size = X.chunksize[0]
print(n_cols, row_chunk_size)

6 3

# wrong dimensions, need to set dtype
print(X.map_blocks(spf.transform, dtype='float'))
# only accurate solution: necessary to now the row_count!
print(X.map_blocks(spf.transform, dtype='float', chunks=(row_chunk_size, n_cols)))
# nan rows
print(X_from_df.map_blocks(spf.transform, dtype='float'))
# random guess also false
print(X_from_df.map_blocks(spf.transform, dtype='float', chunks=(10, n_cols)))
# shape: only one chunk_size predicts the correct shape!
print(X_from_df.map_blocks(spf.transform, dtype='float').compute().shape)

dask.array<transform, shape=(6, 2), dtype=float64, chunksize=(3, 2)>
dask.array<transform, shape=(6, 6), dtype=float64, chunksize=(3, 6)>
dask.array<transform, shape=(nan, 2), dtype=float64, chunksize=(nan, 2)>
dask.array<transform, shape=(20, 6), dtype=float64, chunksize=(10, 6)>
(6, 6)

Finally, the suggestion and issues for data frames:

#duplicate index -> reset_index will solve this
df_t = df.map_partitions(spf.transform).to_dask_dataframe(columns=spf.get_feature_names())
# Avoiding reset_index
X_t = X_from_df.map_blocks(spf.transform, dtype='float', chunks=(row_chunk_size, n_cols)).to_dask_dataframe(columns=spf.get_feature_names())
# Could use the dask graph and set the divisions manually, might be hard to do

print(df_t.index)
print('---------')
print(X_t.index)

Dask Index Structure:
npartitions=2
    int64
      ...
      ...
dtype: int64
Dask Name: from-dask, 12 tasks
---------
Dask Index Structure:
npartitions=2
0    int64
3      ...
5      ...
dtype: int64
Dask Name: from-dask, 14 tasks

We see, that there is no index information for the map_partitions approach (a compute shows duplicate index values). Moreover a third approach could be to use the dataframe-constructor, but we would need to calculate the divisions.

So: Which approach do you prefer? What is more expensive? Resetting the index or calculation the number of rows? The casting to float also may not be really optimal. One could do this a bit more intelligent, and just choose the "biggest" (in terms of subclass) of bool, int and float depending on which types appear.

Another issue is that the blocks for the dask arrays. If one wants to remove this assumption, it's probably best to rewrite the transformer completely from scratch.

Edit:

I just realized, that reset_index will not provide a monotonically increasing index, see here. So a reset_index will only help after a call to compute. This means, thatI have to modify my tests when comparing to a pandas dataframe!

I also realized that chunks of the form (np.nan, n_cols) for the dask arrays provides the correct column shape, which should be sufficient. (I don't know why I didn't see this yesterday)

So, I'll try to provide a working solution by the end of this weekend.
Please apologize this extensive post ;)

datajanko · 2018-10-02T16:36:29Z

Any comments on the last commit?

TomAugspurger · 2018-10-02T16:42:25Z

dask_ml/preprocessing/data.py

+
+        if isinstance(X, da.Array):
+            n_cols = len(self._transformer.get_feature_names())
+            # might raise Error if columns are separated by chunks


Can you explain this error to me? Is the issue something like

X = da.random.uniform(size=(100, 10), chunks=(25, 5))`

i.e. we have two blocks on the features?

Typically, we use dask_ml.utils.check_array to handle this. There's a accept_multiple_blocks keyword.

I think that we want to raise an exception in this case, since the transform would need to be different for the different blocks.

Exactly. Sorry for not directly raising an error here. Thanks for the hint on the accept_multiple_blocks.

Okay, added check_array. Btw, should I also use check_array for the dask dataframe?

TomAugspurger

Sorry, I missed the update.

TomAugspurger · 2018-10-02T16:43:18Z

dask_ml/preprocessing/data.py

+        elif isinstance(X, pd.DataFrame):
+            data = X.pipe(self._transformer.transform)
+            columns = self._transformer.get_feature_names(X.columns)
+            XP = pd.DataFrame(data=data, columns=columns)


Slight preference to match scikit-learn exactly here, since this is a clone.

If you want, you could add a keyword preserve_dataframe to control whether we should convert the array to a dataframe.

Since, this will force me to alter the doc-strings, I'll not be able to tackle the issue before the weekend. I think, I'll have enough time on Sunday

TomAugspurger · 2018-10-02T16:43:34Z

dask_ml/preprocessing/data.py

+        elif isinstance(X, dd.DataFrame):
+            data = X.map_partitions(self._transformer.transform)
+            columns = self._transformer.get_feature_names(X.columns)
+            XP = dd.from_dask_array(data, columns)


same comment on preserve_dataframe.

datajanko · 2018-10-04T20:23:41Z

Any idea whats currently wrong with the build process? Doc-strings will be added soon

mrocklin · 2018-10-04T20:24:28Z

I think that @jrbourbeau is handling this now at #382

datajanko · 2018-10-07T09:15:31Z

So what do to with the sklearn_dev error. Besides it I think, this should be mergeable right now. I already performed the rebase containing the fixes from PR #382

datajanko · 2018-10-07T10:38:36Z

Another question: As of the contributing pipelines, dask transformers should have a columns keyword. Should this be added here as well? Maybe one could add a follow up issue.

TomAugspurger · 2018-10-08T18:08:50Z

As of the contributing pipelines, dask transformers should have a columns keyword. Should this be added here as well?

No, I think that is unnecessary now that ColumnTransformer is in scikit-learn.

TomAugspurger · 2018-10-08T18:09:21Z

Fixed a merge conflict and pushed. Hopefully the CI is passing now.

tests/preprocessing/test_data.py

TomAugspurger · 2018-10-08T18:16:52Z

tests/preprocessing/test_data.py

+        b = spp.PolynomialFeatures()
+
+        # pandas dataframe
+        res_pdf = a.fit_transform(df.compute()).values


What's the behavior for a dask dataframe? Can you test that (if it isn't tested elsewhere)?

Ah I see test_daskdf_tranform below. Any reason not to include that here? We could pametrize the test by daskify

if not daskify: df = df.compute() ... if daskify: assert dask.is_dask_collection(res_pdf) assert_eq_ar(res_pdf, res_b)

originally, I didn't combine it because of the dataframe comparison in test_daskdf_transform. But I think using daskify will solve the issue.

TomAugspurger · 2018-10-08T18:18:00Z

tests/preprocessing/test_data.py

+        assert_eq_ar(trans_dask_df.compute().values, res_b)
+        assert_eq_df(trans_dask_df.compute().reset_index(drop=True), res_pandas_df)
+
+    def test_df_dont_preserve_df(self):


This test feels redundant. I believe that everything here is covered elsewhere.

It feels redundant, but actually it's not. So far I haven't considered a df as an input with an array as an output. An earlier implementation was not having that, as I was always returning a dataframe if the input was also a dataframe. I'll just add those tests in the test_df_transformif I'm managing to daskify that. But maybe it's too cautious.

datajanko · 2018-10-09T04:53:34Z

I just got a RuntimeError
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'dask_ml.preprocessing.data.PolynomialFeatures'> with constructor (self, preserve_dataframe=False, *args, **kwargs) doesn't follow this convention.

Moreover, something like

X, y = make_classification(chunks=50)
df = X.to_dask_dataframe().rename(columns=str)
a = dpp.PolynomialFeatures(preserve_dataframe=True)
res_df = a.fit_transform(frame)

seems to work, but inspecting

df

gives

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    398                         if cls is not object \
    399                                 and callable(cls.__dict__.get('__repr__')):
--> 400                             return _repr_pprint(obj, self, cycle)
    401 
    402             return _default_pprint(obj, self, cycle)

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     for idx,output_line in enumerate(output.splitlines()):
    697         if idx:

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in __repr__(self)
    392 
    393     def __repr__(self):
--> 394         data = self._repr_data.to_string(max_rows=5, show_dimensions=False)
    395         return """Dask {klass} Structure:
    396 {data}

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   2518             return new_dd_object(merge(self.dask, dsk), name,
   2519                                  meta, self.divisions)
-> 2520         raise AttributeError("'DataFrame' object has no attribute %r" % key)
   2521 
   2522     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute '_repr_data'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in _repr_html_(self)
   3131 
   3132     def _repr_html_(self):
-> 3133         data = self._repr_data.to_html(max_rows=5,
   3134                                        show_dimensions=False, notebook=True)
   3135         return self._HTML_FMT.format(data=data, name=key_split(self._name),

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   2518             return new_dd_object(merge(self.dask, dsk), name,
   2519                                  meta, self.divisions)
-> 2520         raise AttributeError("'DataFrame' object has no attribute %r" % key)
   2521
   2522     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute '_repr_data'

I thought I resolved that error.

Edit:

I had something working in between, don't know why this is not in anymore. Seems to be related to the meta argument. Will try to fix that issue. Not being able to inspect the result is not acceptable in my perspective. I will add an additional test, concerning that issue

TomAugspurger · 2018-10-09T11:39:19Z

I just got a RuntimeError
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their init (no varargs). <class 'dask_ml.preprocessing.data.PolynomialFeatures'> with constructor (self, preserve_dataframe=False, *args, **kwargs) doesn't follow this convention.

Sorry I missed that. That's because you're using *args, **kwargs in PolynomialFeatures.__init__. We need to list out the keyword arguments explicitly.

datajanko · 2018-10-09T14:56:24Z

I just got a RuntimeError
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their init (no varargs). <class 'dask_ml.preprocessing.data.PolynomialFeatures'> with constructor (self, preserve_dataframe=False, *args, **kwargs) doesn't follow this convention.

Sorry I missed that. That's because you're using *args, **kwargs in PolynomialFeatures.__init__. We need to list out the keyword arguments explicitly.

Sorry for mentioning that I'm aware of that. Next two days are a but busy. I hope I'll find some time on Thursday or Friday to solve the last remaining issues and modify the tests accordingly.

Moreover, I'll add a test that ensures that a dataframe can be inspected.

datajanko · 2018-10-10T18:18:09Z

I found the "bug" in dask. dd.from_dask_array does not like multiple column names. The test dataframe contains a 1which is the same column description as that of the bias.
So a hot fix is to ensure that the columns of the dataframe don't contain the 1 and maybe raise a warning. Other suggestions?

I'll check if dask already has an issue concerning that

TomAugspurger · 2018-10-10T18:37:56Z

So a hot fix is to ensure that the columns of the dataframe don't contain the 1 and maybe raise a warning. Other suggestions?

that seems fine.

mrocklin · 2018-10-10T18:38:41Z

The upstream issue seems pretty solvable as well if anyone wants to take a shot.

…

On Wed, Oct 10, 2018 at 2:37 PM Tom Augspurger ***@***.***> wrote: So a hot fix is to ensure that the columns of the dataframe don't contain the 1 and maybe raise a warning. Other suggestions? that seems fine. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#367 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIAIubI1Un-1T0SZtFJ_nxH917pYks5ujj6FgaJpZM4Wqu1L> .

datajanko · 2018-10-10T20:32:52Z

So I'd suggest that I try to solve the issue in dask, as @mrocklin suggested. If I'm not able to fix this quickly, I'd provide a solution with the hot fix. Would this be okay, or should we rather close this issue first with the hot fix and create a new task to remove the fix once dask has the "bug" fixed?

also works for numpy arrays tests are provided for dask arrays with known shape Additionally, tests for the same array with unknown chunk size and shape that fails is provided

dask dataframes not working properly, I assume something about missing meta data, A failing test is provided Another example will be provided

this fixes the failing test

This reverts commit 6517840.

The shape of the output arrays/frames should be handled correctly.

added entry to docs/source/modules/api.rst added entry to docs/source/preprocessing.rst

we prefer to raise an error message if the dask array has block not covering the full width as passing to sklearn will not produce the correct output.

documentation still missing

with the newest version of dask, the representation bug is gone condenses the tests and drop some duplicated tests

datajanko · 2018-10-12T21:42:42Z

Any hints on the coverage bug?

mrocklin · 2018-10-12T22:09:01Z

I think it's not actually coverage, I think that it's a flake issue. I recommend running flake8 on your codebase and seeing if anything comes up.

cc @TomAugspurger feedback on the black process ^^ It's not entirely clear to users what went wrong when things go wrong, or how to fix it.

datajanko · 2018-10-12T22:14:50Z

Hmm i ran black and pushen the changes. Will Check tomorrow morning again and push the Update. Thanks for your comments

mrocklin · 2018-10-12T22:16:43Z

[black]
would reformat /root/project/dask_ml/preprocessing/__init__.py
All done! 💥 💔 💥
1 file would be reformatted, 92 files would be left unchanged.

datajanko · 2018-10-12T22:22:42Z

Damn, i rebased and did not run black again. Maybe it's too late today. Will push the changes tomorrow. Thanks again

datajanko · 2018-10-13T07:27:28Z

btw, flake8 was fine. the issue was only related to black. Hope, everything is fine now. Coverage also seems to be ok/good enough

datajanko · 2018-10-14T10:30:42Z

@TomAugspurger and @mrocklin I think this could be merged right now. Any further comments? With my latest addition in dask, the representation bug is also gone. So assuming one hast the nightly dask, this will work without issues. Is there any suggestion how/if to make users aware of that?

TomAugspurger · 2018-10-14T12:06:23Z

Looks good, thanks @datajanko.

Dask-ML tends to require fairly recent versions of dask. We'll likely bump the required version of dask shortly after the next release. But since it's just a bug in the repr I don't think we're in a rush.

TomAugspurger reviewed Sep 17, 2018

View reviewed changes

TomAugspurger reviewed Sep 20, 2018

View reviewed changes

datajanko changed the title ~~[WIP] Poly trans: Issue #347~~ [MRG] Poly trans: Issue #347 Sep 23, 2018

TomAugspurger reviewed Oct 2, 2018

View reviewed changes

datajanko changed the title ~~[MRG] Poly trans: Issue #347~~ [WIP] Poly trans: Issue #347 Oct 3, 2018

datajanko force-pushed the poly_trans branch from fb19796 to 6df8c61 Compare October 5, 2018 04:54

datajanko changed the title ~~[WIP] Poly trans: Issue #347~~ [MRG] Poly trans: Issue #347 Oct 7, 2018

TomAugspurger reviewed Oct 8, 2018

View reviewed changes

J42994 added 15 commits October 12, 2018 23:23

adds class template and first test

fa87d9d

modifies fit function from sklearn and adds first tests

3553f01

adds PolynomialFeatures to preprocessing init file

ad6b6d6

Implements PolynomialFeatures for arrays with known shape

5e1040f

also works for numpy arrays tests are provided for dask arrays with known shape Additionally, tests for the same array with unknown chunk size and shape that fails is provided

reimplementation using map_block

a69c2ea

black cleanup

840356c

adds tests for fit and further tests for transform methods

75aba4d

dask dataframes not working properly, I assume something about missing meta data, A failing test is provided Another example will be provided

adds meta argument when using map_partitions

9e992f0

this fixes the failing test

Revert "adds meta argument when using map_partitions"

a10c340

This reverts commit 6517840.

Proper shape management implemented

d262a01

The shape of the output arrays/frames should be handled correctly.

added to documentation

a0395f6

added entry to docs/source/modules/api.rst added entry to docs/source/preprocessing.rst

does not accept multiple block dask arrays

b1bb2de

we prefer to raise an error message if the dask array has block not covering the full width as passing to sklearn will not produce the correct output.

adds preserve_dataframe argument

8303efc

documentation still missing

adds docstring for preserve_dataframe

e893670

condensing tests, proper columns names for dd

dd50699

with the newest version of dask, the representation bug is gone condenses the tests and drop some duplicated tests

datajanko force-pushed the poly_trans branch from a764365 to dd50699 Compare October 12, 2018 21:29

black

099e954

TomAugspurger merged commit ac2fdb7 into dask:master Oct 14, 2018

TomAugspurger mentioned this pull request Mar 11, 2019

Fix the link in dask documentation #479

Closed


		class PolynomialFeatures(skdata.PolynomialFeatures):

		__doc__ = skdata.PolynomialFeatures.__doc__

[MRG] Poly trans: Issue #347 #367

[MRG] Poly trans: Issue #347 #367

Conversation

datajanko commented Sep 16, 2018 • edited Loading

TomAugspurger commented Sep 16, 2018

datajanko commented Sep 16, 2018

datajanko commented Sep 17, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datajanko Sep 18, 2018 • edited Loading

Choose a reason for hiding this comment

datajanko commented Sep 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datajanko commented Sep 20, 2018

TomAugspurger commented Sep 20, 2018

TomAugspurger commented Sep 20, 2018

datajanko commented Sep 21, 2018 • edited Loading

datajanko commented Sep 21, 2018 • edited Loading

datajanko commented Oct 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datajanko commented Oct 4, 2018

mrocklin commented Oct 4, 2018

datajanko commented Oct 7, 2018 • edited Loading

datajanko commented Oct 7, 2018

TomAugspurger commented Oct 8, 2018

TomAugspurger commented Oct 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datajanko Oct 8, 2018 • edited Loading

Choose a reason for hiding this comment

datajanko commented Oct 9, 2018 • edited Loading

TomAugspurger commented Oct 9, 2018

datajanko commented Oct 9, 2018

datajanko commented Oct 10, 2018

TomAugspurger commented Oct 10, 2018

mrocklin commented Oct 10, 2018 via email

datajanko commented Oct 10, 2018

datajanko commented Oct 12, 2018 • edited Loading

mrocklin commented Oct 12, 2018

datajanko commented Oct 12, 2018

mrocklin commented Oct 12, 2018

datajanko commented Oct 12, 2018

datajanko commented Oct 13, 2018

datajanko commented Oct 14, 2018

TomAugspurger commented Oct 14, 2018

datajanko commented Sep 16, 2018 •

edited

Loading

datajanko Sep 18, 2018 •

edited

Loading

datajanko commented Sep 21, 2018 •

edited

Loading

datajanko commented Sep 21, 2018 •

edited

Loading

datajanko commented Oct 7, 2018 •

edited

Loading

datajanko Oct 8, 2018 •

edited

Loading

datajanko commented Oct 9, 2018 •

edited

Loading

datajanko commented Oct 12, 2018 •

edited

Loading