Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use getitem_with_mask in reindex_variables #1847

Merged
merged 9 commits into from
Feb 14, 2018

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jan 22, 2018

This is an internal refactor of reindexing/alignment to use Variable.getitem_with_mask.

As noted back in #1751 (comment), there is a nice improvement for alignment with dask (~100x improvement) but we are slower in several cases with NumPy (2-3x).

ASV results (smaller ratio is better):

    before     after       ratio
  [e31cf43e] [5830f2f8]
     4.85ms     4.86ms      1.00  reindexing.Reindex.time_1d_coarse
    98.15ms    98.97ms      1.01  reindexing.Reindex.time_1d_fine_all_found
+   96.88ms   210.71ms      2.17  reindexing.Reindex.time_1d_fine_some_missing
    24.47ms    25.18ms      1.03  reindexing.Reindex.time_2d_coarse
   433.26ms   437.19ms      1.01  reindexing.Reindex.time_2d_fine_all_found
+  245.20ms   711.36ms      2.90  reindexing.Reindex.time_2d_fine_some_missing
-   23.78ms    12.79ms      0.54  reindexing.Reindex.time_reindex_coarse
-  409.89ms   230.75ms      0.56  reindexing.Reindex.time_reindex_fine_all_found
+  233.41ms   369.48ms      1.58  reindexing.Reindex.time_reindex_fine_some_missing
    14.39ms    14.20ms      0.99  reindexing.ReindexDask.time_1d_coarse
   184.07ms   182.64ms      0.99  reindexing.ReindexDask.time_1d_fine_all_found
-     1.44s   277.03ms      0.19  reindexing.ReindexDask.time_1d_fine_some_missing
    95.49ms    94.49ms      0.99  reindexing.ReindexDask.time_2d_coarse
   910.11ms   916.47ms      1.01  reindexing.ReindexDask.time_2d_fine_all_found
     failed   997.33ms       n/a  reindexing.ReindexDask.time_2d_fine_some_missing

Note that reindexing.ReindexDask.time_2d_fine_some_missing timed out previously, which I think indicates that it took longer than 60 seconds.

  • Tests passed (for all non-documentation changes)
  • Passes git diff upstream/master **/*py | flake8 --diff (remove if you did not edit any Python files)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

@@ -63,7 +63,7 @@
"netcdf4": [""],
"scipy": [""],
"bottleneck": ["", null],
"dask": ["", null],
"dask": [""],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to run benchmarks separately with/without dask. This just makes the whole suite take twice as long as necessary. We already need to write separate benchmark cases, given that the syntax for using dask is different.

@shoyer shoyer mentioned this pull request Jan 23, 2018
1 task
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - other than failing test on Travis, this seems like a nice cleanup/refactor. Nothing popped out at me as something that needs to changed.

Once the tests are passing, ping me again, as well as @fujiisoup and we'll give it a final review.


def time_2d_fine_some_missing(self):
self.ds.reindex(x=np.arange(0, 100, 0.5), y=np.arange(0, 100, 0.5),
method='nearest', tolerance=0.1).load()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have unit test coverage for this test w/ dask? I see it was failing in the ASV benchmark before the changes in alignment.py so my guess is we don't have test coverage here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check but I think the test failure was just ASV timing out. When I run a similar test case before (see my comment on the other PR) it was extremely slow.

@fujiisoup
Copy link
Member

fujiisoup commented Feb 9, 2018

It looks that our where does not do type-casting (inheriting from numpy's where),

In [1]: import numpy as np
   ...: import xarray as xr
   ...: a = xr.Variable('x', np.array(['a', 'b', 'c']))

In [2]: a._getitem_with_mask([1, 2, -1])
Out[2]:
<xarray.Variable (x: 3)>
array(['b', 'c', 'nan'],
      dtype='<U32')

nan is converted to string.

I guess we need to cast arrays manually in _getitem_with_mask before where is applied.

In [9]: dtype, fill_value = xr.core.dtypes.maybe_promote(a.dtype)
   ...: a.astype(dtype)._getitem_with_mask([1, 2, -1])
Out[9]: 
<xarray.Variable (x: 3)>
array(['b', 'c', nan], dtype=object)

@shoyer shoyer changed the title WIP: use getitem_with_mask in reindex_variables Use getitem_with_mask in reindex_variables Feb 9, 2018
@shoyer
Copy link
Member Author

shoyer commented Feb 9, 2018

@fujiisoup Indeed, that seems to be the case. I made a fixed version of duck_array_ops.where(), relying on the new dtypes.result_type() function that's basically a fixed version of np.result_type().

I believe this is ready for review now.

Copy link
Member

@fujiisoup fujiisoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, but it already looks nice :)



def result_type(*arrays_and_dtypes):
"""Like np.result_type, but number + string -> object (not string).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also take care of bool-float pair

In [1]: import numpy as np
   ...: import xarray as xr
   ...: 
   ...: x = xr.DataArray([True, False])
   ...: x.where(x)
   ...: 
Out[1]: 
<xarray.DataArray (dim_0: 2)>
array([ 1., nan])
Dimensions without coordinates: dim_0

@shoyer
Copy link
Member Author

shoyer commented Feb 11, 2018

I added fixes (bool, float) -> object and (str, bytes) -> object, and also applied these to concatenate/stack as well as where in duck_array_ops.

From a user-facing perspective, this now fixes dtypes xarray.concat() for mixed dtypes as well.

Copy link
Member

@fujiisoup fujiisoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fujiisoup fujiisoup merged commit 2aa5b8a into pydata:master Feb 14, 2018
@smithsp
Copy link

smithsp commented May 21, 2018

@shoyer
Copy link
Member Author

shoyer commented May 23, 2018

@smithsp that link doesn't work. Did you mean to refer to something else?

@shoyer shoyer deleted the getitem_with_mask_reindex branch May 23, 2018 02:00
@smithsp
Copy link

smithsp commented May 23, 2018

@shoyer Thank you for inquiring after the link. The link is to a private repository (although we have never denied someone who requests access.) It may become public some day. We encountered a problem when switching to version 10 or 11 of xarray, because of this fix, as we seemed to be combining strings and unicode in a concat operation. We have since fixed our specific bug, although we have not tracked down the exact combination of variables that led to upcasting to the object dtype.

@shoyer
Copy link
Member Author

shoyer commented May 23, 2018

@smithsp OK, I'd be happy to take a look if you're able to track it down into a self-contained example.

@smithsp
Copy link

smithsp commented May 23, 2018

I think that the behavior is consistent with expected upcasting, (so no bug on your side), but we haven't figured out where we are using unicode vs string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants