Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Poly trans: Issue #347 #367

Merged
merged 16 commits into from
Oct 14, 2018
Merged

Conversation

datajanko
Copy link
Contributor

@datajanko datajanko commented Sep 16, 2018

At the current stage the pull request implements PolynomialFeatures in dask-ml for dask arrays with known shape and chunk size.

Implements Issue #347

@TomAugspurger
Copy link
Member

Thanks for this.

Sorry for not laying this out in the issue, but I had a (hopefully) easier way of doing this in my mind. This is entirely untested, so it may not end up working.

In my mind, our transformer would use a kind of dummy sklearn.preprocessing.PolynomialFeatures internally, and we would use X.map_blocks or X.map_partitions, passing that dummy transformer's .transform method.

So our fit would look like

def fit(self, X, y=None):
    self._transformer = sklearn.preprocessing.PolynomialFeatures()
    X_sample = ...
    self._transformer.fit(X_sample)

X_sample is the kind of tricky thing. For dask.dataframe we would use X._meta, which has the right shape and dtypes. For dask array, I think we'd be fine with something like np.ones((1, X.shape[1]), dtype=X.dtype) (untested).

Then our transform is hopefully something simple like

def transform(self, X, y=None):
    # for array
    X.map_blocks(self._transformer.fit_transform)
    # for dataframe
    X.map_partitions(self._transformer.fit_transform)

Again, I haven't really tested that, but hopefully it will work.

What should I do with chunks of unknown size/shape. This can happen if you have a dask data frame and call values.

Hopefully this implementation is robust to that.

Concerning DataFrames:

Again, not tested, but hopefully correct.

Sparse matrices/frames.

I wouldn't worry about this until it becomes a problem.

@datajanko
Copy link
Contributor Author

No, I have to apologize for not having invested more time on the dask fundamentals.
I was a bit distracted by realizing that the sklearn transformer is not stateless, which made me think that the "mapping" approach is not feasible.

Anyway, thanks for you valuable comments. I'm really happy to learn stuff.

@datajanko
Copy link
Contributor Author

Short Update: Following the lines of your suggestion works for dask arrays.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are some linting errors. You can check locally with flake8 dask_ml/preprocessing tests/preprocessing.

__doc__ = skdata.PolynomialFeatures.__doc__

def __init_(self, degree=2, interaction_only=False, include_bias=True):
super(PolynomialFeatures, self).__init__(degree, interaction_only, include_bias)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This __init__ can just be removed if we aren't doing anything.

super(PolynomialFeatures, self).__init__(degree, interaction_only, include_bias)

def fit(self, X, y=None):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just inherit the docstring from scikit-learn I think.

return self

def transform(self, X, y=None):
"""Transform data to polynomial features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: just inherit.

if isinstance(X, da.Array):
X_sample = np.ones((1, X.shape[1]), dtype=X.dtype)

self._transformer.fit(X_sample)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we fit, we'll want to copy over the attribute learned attributes. I think that dask_ml._utils.copy_learned_attributes(self._transformer, self) will work here.

tests/preprocessing/test_data.py Show resolved Hide resolved

class PolynomialFeatures(skdata.PolynomialFeatures):

__doc__ = skdata.PolynomialFeatures.__doc__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we inherit, this should be unnecessary (I may be wrong though).

Copy link
Contributor Author

@datajanko datajanko Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied the approach from the additional transformers. However,

class C(skdata.PolynomialFeatures):
    pass
print(C.__doc__)
None```
shows that the doc string is not inherited for sklearn based transformers (that inherit from BaseEstimator). I also checked that for standard python classes the doc string is not passed to "children"

@datajanko
Copy link
Contributor Author

Hmm, it seems like my only one of my local commits appears here. Anyway.

Code seems to work: The transformed arrays coincide (even after starting with data frames).

TODOs

  • Documentation
  • Code Cleanup
  • Further tests, e.g. checking data frames (pandas and dask) will give the same result. Test fit more extensively, and maybe some ideas I'll have

Hope to commit tomorrow around the same time

meta = pd.DataFrame(
columns=self._transformer.get_feature_names(X.columns), dtype=float
)
data = X.map_partitions(self._transformer.transform, meta=meta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, slight issue here I think.

_transformer.transform is going to (AFAICT) always return an ndarray. So we have a choice

  1. mimic scikit-learn, and return a dask array here (so just XP = X.map_partitions(self._transformer.transform).
  2. Add a keyword like preserve_dask_dataframe=True, to control whether we should convert the dask array back to a dask dataframe

Right now, I think that just returning a dask array is fine here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to think about this. In the other approach I have I didn't use meta. There map_partitions returned a dask dataframe. I'll have a deeper look into this. Maybe it's just too late ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make sure you're on a recent version of dask. Returning a dask array from DataFrame.map_partitions is somewhat new.

@datajanko
Copy link
Contributor Author

Unfortunately, I was too optimistic, probably some artifacts in jupyter.
So as the test indicates, I'm creating duplicate index entries.
An example to reproduce:

from dask import array as da
from dask_ml.preprocessing import PolynomialFeatures
from dask_ml.datasets import make_classification
import dask.dataframe as ddf
X, y = make_classification(chunks=50)
df = X.to_dask_dataframe().rename(columns=str)
p = PolynomialFeatures()

Now

p.fit_transform(df)

throws - in jupyter - an AttributeError

AttributeError: 'DataFrame' object has no attribute '_repr_data'

However

p.fit_transform(df).compute()

provides an output, and the arrays seem to be correct. However, the index isn't.
I was training to add a meta keyword argument to map_partitions, but this didn't help either.

Any ideas?

Besides: I'd assume there should be some documentation added right? Any wishes on what to put where?

@TomAugspurger
Copy link
Member

The AttributeError indicates that something is wrong with the metadata. Haven't looked to see if it's a dask bug or not.

However, given my earlier comment we may not need to worry about it. If you don't pass meta=meta to the X.map_partitions, then we'll just return an ndarray and everything will be fine.

@TomAugspurger
Copy link
Member

For documentation you can add it to the list of "scikit-learn clones" in doc/source/preprocessing.rst and to doc/source/modules/api.rst, again under preprocessing.

@datajanko
Copy link
Contributor Author

datajanko commented Sep 21, 2018

It seems that map_partitionhas some issues if the function creates more columns than originally in the dataframe! Even providing meta information was not helping. I'll post some details later. Though map_partition somehow shows some values, it seems one always would have to call reset_index() afterwards, if one can avoid errors. This seems a bit odd and unsafe.

It seems, that falling back to map_block for dask arrays is a solution. (Edit: I think this statement is false)

Additionally, I realized that map_blocks is (of course) not predicting the shape correctly (should add a test here as well). However, I just discovered that apply_gufunc has been implemented recently in dask. I'll have a look at that

@datajanko
Copy link
Contributor Author

datajanko commented Sep 21, 2018

Okay, so I investigated a bit and two (maybe three options) unfold.

  1. Use map_partitions followed by reset_index
  2. Use map_blocks but compute the number of rows

I'm not sure what is computational more expensive.

Set Up
import dask
from sklearn.preprocessing.data import PolynomialFeatures

from dask import array as da
from dask_ml.datasets import make_classification
import dask.dataframe as ddf

X, y = make_classification(chunks=3, n_samples=6, n_features=2)
df = X.to_dask_dataframe().rename(columns=str)
X_from_df = df.to_dask_array()

spf = PolynomialFeatures()
spf.fit(X.compute())

Essentially, the issue is that the concatenation of to_dask_array and to_dask_dataframe
is not the identity function, and thus the divisions can not be inferred for data frames

print(X)
print(df.shape, df.divisions)

dask.array<normal, shape=(6, 2), dtype=float64, chunksize=(3, 2)>
(Delayed('int-97a26258-e448-4f93-830e-72985a66cde2'), 2) (0, 3, 5)

print('X:', X)
print("'id'(X):", X_from_df)
print("df:", df.shape, df.divisions) 
print("id(df):", df.to_dask_array().to_dask_dataframe().shape,  df.to_dask_array().to_dask_dataframe().divisions)

X: dask.array<normal, shape=(6, 2), dtype=float64, chunksize=(3, 2)>
'id'(X): dask.array<array, shape=(nan, 2), dtype=float64, chunksize=(nan, 2)>
df: (Delayed('int-756aa071-eaf0-4883-9bcf-383c630872a5'), 2) (0, 3, 5)
id(df): (Delayed('int-d7b1c932-daf6-44c6-943c-5de1ddd49297'), 2) (None, None, None)

For a dask array, we may provide the correct chunk sizes in the map_blocks function, and we obtain the correct output shapes. However, we also need to pass a dtype!

n_cols = len(spf.get_feature_names())
row_chunk_size = X.chunksize[0]
print(n_cols, row_chunk_size)

6 3

# wrong dimensions, need to set dtype
print(X.map_blocks(spf.transform, dtype='float'))
# only accurate solution: necessary to now the row_count!
print(X.map_blocks(spf.transform, dtype='float', chunks=(row_chunk_size, n_cols)))
# nan rows
print(X_from_df.map_blocks(spf.transform, dtype='float'))
# random guess also false
print(X_from_df.map_blocks(spf.transform, dtype='float', chunks=(10, n_cols)))
# shape: only one chunk_size predicts the correct shape!
print(X_from_df.map_blocks(spf.transform, dtype='float').compute().shape)

dask.array<transform, shape=(6, 2), dtype=float64, chunksize=(3, 2)>
dask.array<transform, shape=(6, 6), dtype=float64, chunksize=(3, 6)>
dask.array<transform, shape=(nan, 2), dtype=float64, chunksize=(nan, 2)>
dask.array<transform, shape=(20, 6), dtype=float64, chunksize=(10, 6)>
(6, 6)

Finally, the suggestion and issues for data frames:

#duplicate index -> reset_index will solve this
df_t = df.map_partitions(spf.transform).to_dask_dataframe(columns=spf.get_feature_names())
# Avoiding reset_index
X_t = X_from_df.map_blocks(spf.transform, dtype='float', chunks=(row_chunk_size, n_cols)).to_dask_dataframe(columns=spf.get_feature_names())
# Could use the dask graph and set the divisions manually, might be hard to do

print(df_t.index)
print('---------')
print(X_t.index)

Dask Index Structure:
npartitions=2
    int64
      ...
      ...
dtype: int64
Dask Name: from-dask, 12 tasks
---------
Dask Index Structure:
npartitions=2
0    int64
3      ...
5      ...
dtype: int64
Dask Name: from-dask, 14 tasks

We see, that there is no index information for the map_partitions approach (a compute shows duplicate index values). Moreover a third approach could be to use the dataframe-constructor, but we would need to calculate the divisions.

So: Which approach do you prefer? What is more expensive? Resetting the index or calculation the number of rows? The casting to float also may not be really optimal. One could do this a bit more intelligent, and just choose the "biggest" (in terms of subclass) of bool, int and float depending on which types appear.

Another issue is that the blocks for the dask arrays. If one wants to remove this assumption, it's probably best to rewrite the transformer completely from scratch.

Edit:

I just realized, that reset_index will not provide a monotonically increasing index, see here. So a reset_index will only help after a call to compute. This means, thatI have to modify my tests when comparing to a pandas dataframe!

I also realized that chunks of the form (np.nan, n_cols) for the dask arrays provides the correct column shape, which should be sufficient. (I don't know why I didn't see this yesterday)

So, I'll try to provide a working solution by the end of this weekend.
Please apologize this extensive post ;)

@datajanko datajanko changed the title [WIP] Poly trans: Issue #347 [MRG] Poly trans: Issue #347 Sep 23, 2018
@datajanko
Copy link
Contributor Author

Any comments on the last commit?


if isinstance(X, da.Array):
n_cols = len(self._transformer.get_feature_names())
# might raise Error if columns are separated by chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this error to me? Is the issue something like

X = da.random.uniform(size=(100, 10), chunks=(25, 5))`

i.e. we have two blocks on the features?

Typically, we use dask_ml.utils.check_array to handle this. There's a accept_multiple_blocks keyword.

I think that we want to raise an exception in this case, since the transform would need to be different for the different blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Sorry for not directly raising an error here. Thanks for the hint on the accept_multiple_blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, added check_array. Btw, should I also use check_array for the dask dataframe?

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the update.

elif isinstance(X, pd.DataFrame):
data = X.pipe(self._transformer.transform)
columns = self._transformer.get_feature_names(X.columns)
XP = pd.DataFrame(data=data, columns=columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference to match scikit-learn exactly here, since this is a clone.

If you want, you could add a keyword preserve_dataframe to control whether we should convert the array to a dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, this will force me to alter the doc-strings, I'll not be able to tackle the issue before the weekend. I think, I'll have enough time on Sunday

elif isinstance(X, dd.DataFrame):
data = X.map_partitions(self._transformer.transform)
columns = self._transformer.get_feature_names(X.columns)
XP = dd.from_dask_array(data, columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on preserve_dataframe.

@datajanko datajanko changed the title [MRG] Poly trans: Issue #347 [WIP] Poly trans: Issue #347 Oct 3, 2018
@datajanko
Copy link
Contributor Author

Any idea whats currently wrong with the build process? Doc-strings will be added soon

@mrocklin
Copy link
Member

mrocklin commented Oct 4, 2018

I think that @jrbourbeau is handling this now at #382

@datajanko datajanko changed the title [WIP] Poly trans: Issue #347 [MRG] Poly trans: Issue #347 Oct 7, 2018
@datajanko
Copy link
Contributor Author

datajanko commented Oct 7, 2018

So what do to with the sklearn_dev error. Besides it I think, this should be mergeable right now. I already performed the rebase containing the fixes from PR #382

@datajanko
Copy link
Contributor Author

Another question: As of the contributing pipelines, dask transformers should have a columns keyword. Should this be added here as well? Maybe one could add a follow up issue.

@TomAugspurger
Copy link
Member

As of the contributing pipelines, dask transformers should have a columns keyword. Should this be added here as well?

No, I think that is unnecessary now that ColumnTransformer is in scikit-learn.

@TomAugspurger
Copy link
Member

Fixed a merge conflict and pushed. Hopefully the CI is passing now.

tests/preprocessing/test_data.py Outdated Show resolved Hide resolved
tests/preprocessing/test_data.py Outdated Show resolved Hide resolved
b = spp.PolynomialFeatures()

# pandas dataframe
res_pdf = a.fit_transform(df.compute()).values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior for a dask dataframe? Can you test that (if it isn't tested elsewhere)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see test_daskdf_tranform below. Any reason not to include that here? We could pametrize the test by daskify

if not daskify:
    df = df.compute()

...

if daskify:
    assert dask.is_dask_collection(res_pdf)

assert_eq_ar(res_pdf, res_b)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally, I didn't combine it because of the dataframe comparison in test_daskdf_transform. But I think using daskify will solve the issue.

assert_eq_ar(trans_dask_df.compute().values, res_b)
assert_eq_df(trans_dask_df.compute().reset_index(drop=True), res_pandas_df)

def test_df_dont_preserve_df(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels redundant. I believe that everything here is covered elsewhere.

Copy link
Contributor Author

@datajanko datajanko Oct 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels redundant, but actually it's not. So far I haven't considered a df as an input with an array as an output. An earlier implementation was not having that, as I was always returning a dataframe if the input was also a dataframe. I'll just add those tests in the test_df_transformif I'm managing to daskify that. But maybe it's too cautious.

@datajanko
Copy link
Contributor Author

datajanko commented Oct 9, 2018

I just got a RuntimeError
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'dask_ml.preprocessing.data.PolynomialFeatures'> with constructor (self, preserve_dataframe=False, *args, **kwargs) doesn't follow this convention.

Moreover, something like

X, y = make_classification(chunks=50)
df = X.to_dask_dataframe().rename(columns=str)
a = dpp.PolynomialFeatures(preserve_dataframe=True)
res_df = a.fit_transform(frame)

seems to work, but inspecting

df

gives

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    398                         if cls is not object \
    399                                 and callable(cls.__dict__.get('__repr__')):
--> 400                             return _repr_pprint(obj, self, cycle)
    401 
    402             return _default_pprint(obj, self, cycle)

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     for idx,output_line in enumerate(output.splitlines()):
    697         if idx:

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in __repr__(self)
    392 
    393     def __repr__(self):
--> 394         data = self._repr_data.to_string(max_rows=5, show_dimensions=False)
    395         return """Dask {klass} Structure:
    396 {data}

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   2518             return new_dd_object(merge(self.dask, dsk), name,
   2519                                  meta, self.divisions)
-> 2520         raise AttributeError("'DataFrame' object has no attribute %r" % key)
   2521 
   2522     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute '_repr_data'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in _repr_html_(self)
   3131 
   3132     def _repr_html_(self):
-> 3133         data = self._repr_data.to_html(max_rows=5,
   3134                                        show_dimensions=False, notebook=True)
   3135         return self._HTML_FMT.format(data=data, name=key_split(self._name),

~/anaconda3/envs/dask-ml-dev/lib/python3.6/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   2518             return new_dd_object(merge(self.dask, dsk), name,
   2519                                  meta, self.divisions)
-> 2520         raise AttributeError("'DataFrame' object has no attribute %r" % key)
   2521
   2522     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute '_repr_data'

I thought I resolved that error.

Edit:

I had something working in between, don't know why this is not in anymore. Seems to be related to the meta argument. Will try to fix that issue. Not being able to inspect the result is not acceptable in my perspective. I will add an additional test, concerning that issue

@TomAugspurger
Copy link
Member

I just got a RuntimeError
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their init (no varargs). <class 'dask_ml.preprocessing.data.PolynomialFeatures'> with constructor (self, preserve_dataframe=False, *args, **kwargs) doesn't follow this convention.

Sorry I missed that. That's because you're using *args, **kwargs in PolynomialFeatures.__init__. We need to list out the keyword arguments explicitly.

@datajanko
Copy link
Contributor Author

I just got a RuntimeError
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their init (no varargs). <class 'dask_ml.preprocessing.data.PolynomialFeatures'> with constructor (self, preserve_dataframe=False, *args, **kwargs) doesn't follow this convention.

Sorry I missed that. That's because you're using *args, **kwargs in PolynomialFeatures.__init__. We need to list out the keyword arguments explicitly.

Sorry for mentioning that I'm aware of that. Next two days are a but busy. I hope I'll find some time on Thursday or Friday to solve the last remaining issues and modify the tests accordingly.

Moreover, I'll add a test that ensures that a dataframe can be inspected.

@datajanko
Copy link
Contributor Author

I found the "bug" in dask. dd.from_dask_array does not like multiple column names. The test dataframe contains a 1which is the same column description as that of the bias.
So a hot fix is to ensure that the columns of the dataframe don't contain the 1 and maybe raise a warning. Other suggestions?

I'll check if dask already has an issue concerning that

@TomAugspurger
Copy link
Member

So a hot fix is to ensure that the columns of the dataframe don't contain the 1 and maybe raise a warning. Other suggestions?

that seems fine.

@mrocklin
Copy link
Member

mrocklin commented Oct 10, 2018 via email

@datajanko
Copy link
Contributor Author

So I'd suggest that I try to solve the issue in dask, as @mrocklin suggested. If I'm not able to fix this quickly, I'd provide a solution with the hot fix. Would this be okay, or should we rather close this issue first with the hot fix and create a new task to remove the fix once dask has the "bug" fixed?

J42994 added 15 commits October 12, 2018 23:23
also works for numpy arrays
tests are provided for dask arrays with known shape
Additionally, tests for the same array with unknown chunk size and shape that fails is provided
dask dataframes not working properly, I assume something about missing meta data,
A failing test is provided

Another example will be provided
this fixes the failing test
The shape of the output arrays/frames should be handled correctly.
added entry to docs/source/modules/api.rst
added entry to docs/source/preprocessing.rst
we prefer to raise an error message if the dask array has block not covering the full width as passing to sklearn will not produce the correct output.
documentation still missing
with the newest version of dask, the representation bug is gone
condenses the tests and drop some duplicated tests
@datajanko
Copy link
Contributor Author

datajanko commented Oct 12, 2018

Any hints on the coverage bug?

@mrocklin
Copy link
Member

I think it's not actually coverage, I think that it's a flake issue. I recommend running flake8 on your codebase and seeing if anything comes up.

cc @TomAugspurger feedback on the black process ^^ It's not entirely clear to users what went wrong when things go wrong, or how to fix it.

@datajanko
Copy link
Contributor Author

Hmm i ran black and pushen the changes. Will Check tomorrow morning again and push the Update. Thanks for your comments

@mrocklin
Copy link
Member

[black]
would reformat /root/project/dask_ml/preprocessing/__init__.py
All done! 💥 💔 💥
1 file would be reformatted, 92 files would be left unchanged.

@datajanko
Copy link
Contributor Author

Damn, i rebased and did not run black again. Maybe it's too late today. Will push the changes tomorrow. Thanks again

@datajanko
Copy link
Contributor Author

btw, flake8 was fine. the issue was only related to black. Hope, everything is fine now. Coverage also seems to be ok/good enough

@datajanko
Copy link
Contributor Author

@TomAugspurger and @mrocklin I think this could be merged right now. Any further comments? With my latest addition in dask, the representation bug is also gone. So assuming one hast the nightly dask, this will work without issues. Is there any suggestion how/if to make users aware of that?

@TomAugspurger
Copy link
Member

Looks good, thanks @datajanko.

Dask-ML tends to require fairly recent versions of dask. We'll likely bump the required version of dask shortly after the next release. But since it's just a bug in the repr I don't think we're in a rush.

@TomAugspurger TomAugspurger merged commit ac2fdb7 into dask:master Oct 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants