Added Coarsen #2612

fujiisoup · 2018-12-16T15:28:31Z

Closes Multi-dimensional binning/resampling/coarsening #2525
Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

Started to implement corsen.
The API is currently something like

    actual = ds.coarsen(time=2, x=3, side='right',
                        coordinate_func={'time': np.max}).max()

Currently, it is not working for a datetime coordinate, since mean does not work for this dtype.
e.g.

da = xr.DataArray(np.linspace(0, 365, num=365), 
         dims='time', coords={'time': pd.date_range('15/12/1999', periods=365)}) 
da['time'].mean()    # -> TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')

I am not familiar with datetime things.
Any advice will be appreciated.

pep8speaks · 2018-12-16T15:28:38Z

Hello @fujiisoup! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 06, 2019 at 09:13 Hours UTC

dcherian · 2018-12-16T18:18:55Z

(da.time - da.time[0]).mean() + da.time[0]

Or we could default to min() and add an loffset kwarg like resample (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)

fujiisoup · 2018-12-17T06:27:35Z

@dcherian
Thanks.

(da.time - da.time[0]).mean() + da.time[0]
I would add mean for datetime-arrays.

max-sixty · 2018-12-17T16:41:58Z

Forgive me if this has an obvious answer: to what extent is this downsampling? Could this be done with resample?

dcherian · 2018-12-17T17:17:18Z

@max-sixty I think it's multi-dimensional resampling / rolling. resample & rolling operate one dimension at a time.

shoyer · 2018-12-18T17:59:28Z

xarray/core/rolling.py

+        self.coordinate_func = coordinate_func
+
+    def __repr__(self):
+        """provide a nice str repr of our rolling object"""


rolling -> coarsen

shoyer · 2018-12-18T18:03:37Z

xarray/core/variable.py

+                shape.append(variable.shape[i])
+                dims.append(d)
+
+        return Variable(dims, variable.data.reshape(shape), self._attrs)


Is it worth making an actual xarray.Variable object here rather than just returning data? With the need to come up with new dimension names, I think it might be cleaner avoiding that.

shoyer · 2018-12-18T18:04:16Z

xarray/core/rolling.py

+    @classmethod
+    def _reduce_method(cls, func):
+        """
+        Return a wrapped function for injecting numpy and bottoleneck methods.


bottoleneck -> bottleneck

shoyer · 2018-12-18T18:11:11Z

xarray/core/common.py

+            entries in the most right will be removed (if trim_excess is True).
+            If right, coarsen windows ends at the most right entry, while
+            excessed entries in the most left will be removed.
+        trim_excess : boolean, default False


It would be nice to have an API that lets us express at least three options:

trim excess entries

pad with NaN

raise an error if the shape does not divide exactly (this is probably the safest default behavior)

Maybe a string valued keyword argument would work better here, e.g., boundary='trim', boundary='pad' and boundary='exact'?

I would also suggest putting this argument before side, since side is only used for a particular (non-default) value of this argument.

max-sixty · 2018-12-19T19:35:29Z

I think it's multi-dimensional resampling / rolling. resample & rolling operate one dimension at a time.

Thanks. If this is multi-dimensional resampling / rolling, could we implement this functionality within those methods, enabling multiple dimensions?

Potentially the implementations are different enough that the arguments don't have enough overlap?

shoyer · 2018-12-19T19:37:09Z

@max-sixty we discussed this a little bit: #2525 (comment)

The main difference is that resample is coordinate based, whereas this is integer position based which makes the implementation considerably simpler.

max-sixty · 2018-12-19T19:41:53Z

Perfect, thanks for pointing that out (I've generally been a bit out of the loop recently...)

shoyer · 2018-12-19T19:48:02Z

I do think it would be nice for resample() to work with numbers and along multiple dimensions, but that's definitely a bigger project.

fujiisoup · 2018-12-20T21:33:22Z

Updated.
The main change is

API updated to

dataset.coarsen(self, dim=None, boundary='exact', side='left',
                            coordinate_func='mean', **dim_kwargs):

based on comments.

nanmean for DateTime was implemented.
This is a kind of backward incompatible.
Previously, we raised an Error if mean is applied to a datetime array.

fujiisoup · 2018-12-20T22:38:19Z

During writing the test codes, I realized that the default boundary='exact' is a little annoying.
It would be rare that the length of the data is exactly a multiple of the window size; we almost always need to type bounary='trim' or the other.
I personally think that boundary='pad' is also a good default, as it is a similar behaviour to rolling.

The down side of this is that it can make a uniformly spaced data an inhomogeneously spaced one.
It is something what users may not expect.
(probably boundary='exact is safer)

shoyer · 2018-12-21T02:00:53Z

@pydata/xarray any opinions on what the default value for the boundary argument to coarsen() should be?

Personally, I like boundary='exact', but I also work mostly with simulated data with dimensions setup to be exact powers of 2 :).

jhamman · 2018-12-21T02:35:49Z

+1, on 'exact'. I like the idea of making users be explicit about when to pad or trim edge cells.

shoyer · 2018-12-24T15:40:19Z

doc/computation.rst

+Coarsen large arrays
+====================
+
+``DataArray`` objects include a :py:meth:`~xarray.DataArray.coarsen` method.


Maybe "DataArray and Dataset"?

shoyer · 2018-12-24T15:41:54Z

doc/computation.rst

+
+.. ipython:: python
+
+  da.coarsen(time=7, coordinate_func={'time': 'min'}).mean()


Should we abbreviate this as coord_func? We use that elsewhere in xarray.

shoyer · 2018-12-24T15:43:12Z

doc/computation.rst

+
+.. ipython:: python
+
+  da = xr.DataArray(np.linspace(0, 364, num=364), dims='time',


What do you think about including a 2D example in the docs, e.g., a 4x4 array to 2x2? I expect that is closer to the typical use case.

Changed to 2d example

shoyer · 2018-12-24T15:48:51Z

xarray/core/duck_array_ops.py

+    array = asarray(array)
+    if array.dtype.kind == 'M':
+        offset = min(array)
+        dtype = (array.ravel()[0] - offset).dtype  # dtype.kind == 'm'


This will fail for size 0 arrays. Maybe switch to another reduce method if the array is size 0, e.g., use min instead? The only important part is that it handles reducing the shape properly.

Fixed, but I am not yet sure it (the updated one) is the best solution to find the appropriate dtype. See comment

# Conflicts: # xarray/core/common.py

fujiisoup · 2018-12-25T07:37:24Z

xarray/core/duck_array_ops.py

+    if array.dtype.kind == 'M':
+        offset = min(array)
+        # infer the compatible timedelta dtype
+        dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype


This is just to find the corresponding timedelta from datetime. Is there any good function to find an appropriate dtype?

I could be missing something, but since xarray always coerces all NumPy dates to datetime64[ns], will the default results of datetime_to_numeric always have units of nanoseconds? In other words will this dtype always be timedelta64[ns]?

Thanks. I just realized we are always using [ns] for datetime. Updated.

fujiisoup · 2018-12-25T16:36:08Z

xarray/core/duck_array_ops.py

+_mean = _create_nan_agg_method('mean')
+
+
+def mean(array, axis=None, skipna=None, **kwargs):


I would like to make this compatible with CFTime index. @spencerkclark, could you comment for this?

Thanks @fujiisoup! I think something like the following would work:

from ..coding.times import format_cftime_datetime from .common import contains_cftime_datetimes def mean(array, axis=None, skipna=None, **kwargs): array = asarray(array) if array.dtype.kind == 'M': offset = min(array) # infer the compatible timedelta dtype dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype return _mean(utils.datetime_to_numeric(array, offset), axis=axis, skipna=skipna, **kwargs).astype(dtype) + offset elif contains_cftime_datetimes(xr.DataArray(array)): import cftime offset = min(array) numeric_dates = utils.datetime_to_numeric(xr.DataArray(array), offset, datetime_unit='s').data mean_dates = _mean(numeric_dates, axis=axis, skipna=skipna, **kwargs) units = 'seconds since {}'.format(format_cftime_datetime(offset)) calendar = offset.calendar return cftime.num2date(mean_dates, units=units, calendar=calendar, only_use_cftime_datetimes=True) else: return _mean(array, axis=axis, skipna=skipna, **kwargs)

Ideally we would modify datetime_to_numeric and contains_cftime_datetimes to work with both pure NumPy or dask arrays of cftime objects as well as DataArrays (currently they only work with DataArrays), but that's the way things are coded right now. I could handle that in a follow-up if you would like.

Thanks, @spencerkclark

It would be nice if you could send a follow-up PR :)

Sure thing, I'd be happy to take care of making this compatible with cftime dates.

shoyer · 2018-12-29T06:59:44Z

xarray/core/variable.py

+            else:
+                raise TypeError(
+                    '{} is invalid for boundary. Valid option is \'exact\', '
+                    '\'trim\' and \'pad\''.format(boundary[d]))


this is possibly a little easier to read using double-quotes for the outer layer of quotation, e.g.,

Suggested change

'\'trim\' and \'pad\''.format(boundary[d]))

"{} is invalid for boundary. Valid option is 'exact', "

"'trim' and 'pad'".format(boundary[d]))

shoyer · 2018-12-29T07:01:01Z

xarray/tests/test_dataset.py

+    # should be no error
+    ds.isel(x=slice(0, 3 * (len(ds['x']) // 3))).coarsen(x=3).mean()
+
+    # raise if exact


shoyer · 2018-12-29T07:03:20Z

xarray/tests/test_variable.py

+                           boundary='trim')
+        expected = self.cls(['x'], [1.5, 3.5])
+        assert_identical(actual, expected)
+


I would be good to add a test case that checks the values of a higher dimensional array, e.g., to verify that coarsening a 4x4 array to 2x2 gives the right result.

Fair comment. Thanks. Updated.

shoyer · 2018-12-29T07:04:27Z

doc/computation.rst

+
+.. ipython:: python
+
+  da.coarsen(time=7, x=2, coordinate_func={'time': 'min'}).mean()


This should be coord_func, not coordinate_func.

shoyer

This looks good to me, after these two doc fixes.

shoyer · 2019-01-01T00:20:15Z

xarray/core/common.py

+                   364.      ])
+        >>> Coordinates:
+        >>> * time     (time) datetime64[ns] 1999-12-15 ... 2000-12-12
+        >>>


nit: the results here should not be prefaced with >>>.

shoyer · 2019-01-01T00:22:20Z

doc/whats-new.rst

@@ -50,6 +50,10 @@ Enhancements
 - :py:class:`CFTimeIndex` uses slicing for string indexing when possible (like
  :py:class:`pandas.DatetimeIndex`), which avoids unnecessary copies.
  By `Stephan Hoyer <https://github.com/shoyer>`_
+- :py:meth:`~xarray.DataArray.coarsen` and


This now needs to move up to the section for 0.11.2. Also it would be nice to add a link to the new doc section "Coarsen large arrays".

shoyer

Looks good to me!

shoyer · 2019-01-04T04:22:18Z

I plan to merge this tomorrow unless anyone else has further suggestions.

fmaussion · 2019-01-04T11:36:00Z

This is (one more time) an extremely useful addition to xarray - thanks so much @fujiisoup !

shoyer · 2019-01-06T09:13:56Z

Indeed, thank you @fujiisoup !

* upstream/master: xfail cftimeindex multiindex test (pydata#2669) DOC: refresh "Why xarray" and shorten top-level description (pydata#2657) Remove broken Travis-CI builds (pydata#2661) Type checking with mypy (pydata#2655) Added Coarsen (pydata#2612)

* master: Remove broken Travis-CI builds (pydata#2661) Type checking with mypy (pydata#2655) Added Coarsen (pydata#2612) Improve test for GH 2649 (pydata#2654) revise top-level package description (pydata#2430) Convert ref_date to UTC in encode_cf_datetime (pydata#2651) Change an `==` to an `is`. Fix tests so that this won't happen again. (pydata#2648) ENH: switch Dataset and DataArray to use explicit indexes (pydata#2639) Use pycodestyle for lint checks. (pydata#2642) Switch whats-new for 0.11.2 -> 0.11.3 DOC: document v0.11.2 release Use built-in interp for interpolation with resample (pydata#2640) BUG: pytest-runner no required for setup.py (pydata#2643)

fujiisoup added 2 commits December 15, 2018 14:36

Added variable.coarsen

3525b9c

Added DataArray.coarsen and Dataset.coarsen

5ff3102

pep8

6f3cf0c

fujiisoup added 2 commits December 17, 2018 06:48

a bugfix for mpa3

f1f4804

Support mean for datatime dtype

ab5d2f6

shoyer reviewed Dec 18, 2018

View reviewed changes

fujiisoup added 2 commits December 20, 2018 17:18

nanmean for DateTime

9123fd4

API updatedd via comments

c85d18a

fujiisoup added 3 commits December 20, 2018 23:19

bug fix in tests

0aa7a37

updated docs

b656d62

Merge branch 'master' into corsen

2ffcb23

fujiisoup added 5 commits December 21, 2018 08:05

use pd.isnull rather than isnat

04773eb

support Variable in datetime_to_numeric

b33020b

use pd.isnull instead of numpy.isnat in test

b13af18

Merge branch 'master' into corsen

24f3061

Added an example to doc.

d806c96

fujiisoup changed the title ~~[WIP] Added Corsen~~ Added Coarsen Dec 24, 2018

shoyer reviewed Dec 24, 2018

View reviewed changes

fujiisoup added 5 commits December 24, 2018 17:41

coordinate_func -> coord_func. Support 0d-array mean with datetime

96bf29b

Added an two dimensional example

b70996a

flake8

827794e

Merge branch 'master' into corsen

a354005

# Conflicts: # xarray/core/common.py

flake8

82c08af

fujiisoup commented Dec 25, 2018

View reviewed changes

a potential bug fix

d73d1d5

fujiisoup commented Dec 25, 2018

View reviewed changes

shoyer reviewed Dec 29, 2018

View reviewed changes

fujiisoup added 3 commits December 30, 2018 12:30

Update via comments

a92c431

Always use datetime64[ns] in mean

0e53c7b

Added tests for 2d coarsen with value check

07b8060

shoyer approved these changes Jan 1, 2019

View reviewed changes

fujiisoup added 4 commits January 3, 2019 21:20

update via comment

aa41f39

Merge branch 'master' into corsen

4c347af

whats new

2a06b05

Merge branch 'master' into corsen

50fa6aa

shoyer approved these changes Jan 3, 2019

View reviewed changes

fujiisoup and others added 2 commits January 4, 2019 13:02

typo fix

1d04bdd

Merge branch 'master' into corsen

1523292

shoyer merged commit ede3e01 into pydata:master Jan 6, 2019

jhamman mentioned this pull request Feb 1, 2019

Implementing dask.array.coarsen in xarrays #1192

Closed


		.. ipython:: python

		da.coarsen(time=7, coordinate_func={'time': 'min'}).mean()


		.. ipython:: python

		da = xr.DataArray(np.linspace(0, 364, num=364), dims='time',

		_mean = _create_nan_agg_method('mean')


		def mean(array, axis=None, skipna=None, **kwargs):

	'\'trim\' and \'pad\''.format(boundary[d]))
	"{} is invalid for boundary. Valid option is 'exact', "
	"'trim' and 'pad'".format(boundary[d]))


		.. ipython:: python

		da.coarsen(time=7, x=2, coordinate_func={'time': 'min'}).mean()

Added Coarsen #2612

Added Coarsen #2612

Conversation

fujiisoup commented Dec 16, 2018 • edited Loading

pep8speaks commented Dec 16, 2018 • edited Loading

Comment last updated on January 06, 2019 at 09:13 Hours UTC

dcherian commented Dec 16, 2018

fujiisoup commented Dec 17, 2018

max-sixty commented Dec 17, 2018

dcherian commented Dec 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Dec 19, 2018

shoyer commented Dec 19, 2018

max-sixty commented Dec 19, 2018

shoyer commented Dec 19, 2018

fujiisoup commented Dec 20, 2018

fujiisoup commented Dec 20, 2018 • edited Loading

shoyer commented Dec 21, 2018

jhamman commented Dec 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fujiisoup Dec 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

shoyer commented Jan 4, 2019

fmaussion commented Jan 4, 2019

shoyer commented Jan 6, 2019

fujiisoup commented Dec 16, 2018 •

edited

Loading

pep8speaks commented Dec 16, 2018 •

edited

Loading

fujiisoup commented Dec 20, 2018 •

edited

Loading

fujiisoup Dec 25, 2018 •

edited

Loading