-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Coarsen #2612
Added Coarsen #2612
Conversation
Hello @fujiisoup! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 06, 2019 at 09:13 Hours UTC |
Or we could default to |
@dcherian
|
Forgive me if this has an obvious answer: to what extent is this downsampling? Could this be done with |
@max-sixty I think it's multi-dimensional resampling / rolling. resample & rolling operate one dimension at a time. |
xarray/core/rolling.py
Outdated
self.coordinate_func = coordinate_func | ||
|
||
def __repr__(self): | ||
"""provide a nice str repr of our rolling object""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolling -> coarsen
xarray/core/variable.py
Outdated
shape.append(variable.shape[i]) | ||
dims.append(d) | ||
|
||
return Variable(dims, variable.data.reshape(shape), self._attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth making an actual xarray.Variable object here rather than just returning data
? With the need to come up with new dimension names, I think it might be cleaner avoiding that.
xarray/core/rolling.py
Outdated
@classmethod | ||
def _reduce_method(cls, func): | ||
""" | ||
Return a wrapped function for injecting numpy and bottoleneck methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bottoleneck -> bottleneck
xarray/core/common.py
Outdated
entries in the most right will be removed (if trim_excess is True). | ||
If right, coarsen windows ends at the most right entry, while | ||
excessed entries in the most left will be removed. | ||
trim_excess : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have an API that lets us express at least three options:
- trim excess entries
- pad with NaN
- raise an error if the shape does not divide exactly (this is probably the safest default behavior)
Maybe a string valued keyword argument would work better here, e.g., boundary='trim'
, boundary='pad'
and boundary='exact'
?
I would also suggest putting this argument before side
, since side
is only used for a particular (non-default) value of this argument.
Thanks. If this is multi-dimensional resampling / rolling, could we implement this functionality within those methods, enabling multiple dimensions? Potentially the implementations are different enough that the arguments don't have enough overlap? |
@max-sixty we discussed this a little bit: #2525 (comment) The main difference is that resample is coordinate based, whereas this is integer position based which makes the implementation considerably simpler. |
Perfect, thanks for pointing that out (I've generally been a bit out of the loop recently...) |
I do think it would be nice for |
Updated.
dataset.coarsen(self, dim=None, boundary='exact', side='left',
coordinate_func='mean', **dim_kwargs): based on comments.
|
During writing the test codes, I realized that the default The down side of this is that it can make a uniformly spaced data an inhomogeneously spaced one. |
@pydata/xarray any opinions on what the default value for the Personally, I like |
+1, on |
doc/computation.rst
Outdated
Coarsen large arrays | ||
==================== | ||
|
||
``DataArray`` objects include a :py:meth:`~xarray.DataArray.coarsen` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "DataArray and Dataset"?
doc/computation.rst
Outdated
|
||
.. ipython:: python | ||
|
||
da.coarsen(time=7, coordinate_func={'time': 'min'}).mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we abbreviate this as coord_func
? We use that elsewhere in xarray.
doc/computation.rst
Outdated
|
||
.. ipython:: python | ||
|
||
da = xr.DataArray(np.linspace(0, 364, num=364), dims='time', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about including a 2D example in the docs, e.g., a 4x4 array to 2x2? I expect that is closer to the typical use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 2d example
xarray/core/duck_array_ops.py
Outdated
array = asarray(array) | ||
if array.dtype.kind == 'M': | ||
offset = min(array) | ||
dtype = (array.ravel()[0] - offset).dtype # dtype.kind == 'm' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail for size 0 arrays. Maybe switch to another reduce method if the array is size 0, e.g., use min instead? The only important part is that it handles reducing the shape properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, but I am not yet sure it (the updated one) is the best solution to find the appropriate dtype. See comment
# Conflicts: # xarray/core/common.py
xarray/core/duck_array_ops.py
Outdated
if array.dtype.kind == 'M': | ||
offset = min(array) | ||
# infer the compatible timedelta dtype | ||
dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to find the corresponding timedelta from datetime. Is there any good function to find an appropriate dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be missing something, but since xarray always coerces all NumPy dates to datetime64[ns]
, will the default results of datetime_to_numeric
always have units of nanoseconds? In other words will this dtype always be timedelta64[ns]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I just realized we are always using [ns] for datetime. Updated.
_mean = _create_nan_agg_method('mean') | ||
|
||
|
||
def mean(array, axis=None, skipna=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to make this compatible with CFTime index. @spencerkclark, could you comment for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fujiisoup! I think something like the following would work:
from ..coding.times import format_cftime_datetime
from .common import contains_cftime_datetimes
def mean(array, axis=None, skipna=None, **kwargs):
array = asarray(array)
if array.dtype.kind == 'M':
offset = min(array)
# infer the compatible timedelta dtype
dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype
return _mean(utils.datetime_to_numeric(array, offset), axis=axis,
skipna=skipna, **kwargs).astype(dtype) + offset
elif contains_cftime_datetimes(xr.DataArray(array)):
import cftime
offset = min(array)
numeric_dates = utils.datetime_to_numeric(xr.DataArray(array), offset,
datetime_unit='s').data
mean_dates = _mean(numeric_dates, axis=axis, skipna=skipna, **kwargs)
units = 'seconds since {}'.format(format_cftime_datetime(offset))
calendar = offset.calendar
return cftime.num2date(mean_dates, units=units, calendar=calendar,
only_use_cftime_datetimes=True)
else:
return _mean(array, axis=axis, skipna=skipna, **kwargs)
Ideally we would modify datetime_to_numeric
and contains_cftime_datetimes
to work with both pure NumPy or dask arrays of cftime objects as well as DataArrays (currently they only work with DataArrays), but that's the way things are coded right now. I could handle that in a follow-up if you would like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @spencerkclark
It would be nice if you could send a follow-up PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I'd be happy to take care of making this compatible with cftime dates.
xarray/core/variable.py
Outdated
else: | ||
raise TypeError( | ||
'{} is invalid for boundary. Valid option is \'exact\', ' | ||
'\'trim\' and \'pad\''.format(boundary[d])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is possibly a little easier to read using double-quotes for the outer layer of quotation, e.g.,
'\'trim\' and \'pad\''.format(boundary[d])) | |
"{} is invalid for boundary. Valid option is 'exact', " | |
"'trim' and 'pad'".format(boundary[d])) |
xarray/tests/test_dataset.py
Outdated
# should be no error | ||
ds.isel(x=slice(0, 3 * (len(ds['x']) // 3))).coarsen(x=3).mean() | ||
|
||
# raise if exact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
boundary='trim') | ||
expected = self.cls(['x'], [1.5, 3.5]) | ||
assert_identical(actual, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be good to add a test case that checks the values of a higher dimensional array, e.g., to verify that coarsening a 4x4 array to 2x2 gives the right result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair comment. Thanks. Updated.
doc/computation.rst
Outdated
|
||
.. ipython:: python | ||
|
||
da.coarsen(time=7, x=2, coordinate_func={'time': 'min'}).mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be coord_func
, not coordinate_func
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, after these two doc fixes.
364. ]) | ||
>>> Coordinates: | ||
>>> * time (time) datetime64[ns] 1999-12-15 ... 2000-12-12 | ||
>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the results here should not be prefaced with >>>
.
doc/whats-new.rst
Outdated
@@ -50,6 +50,10 @@ Enhancements | |||
- :py:class:`CFTimeIndex` uses slicing for string indexing when possible (like | |||
:py:class:`pandas.DatetimeIndex`), which avoids unnecessary copies. | |||
By `Stephan Hoyer <https://github.com/shoyer>`_ | |||
- :py:meth:`~xarray.DataArray.coarsen` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now needs to move up to the section for 0.11.2. Also it would be nice to add a link to the new doc section "Coarsen large arrays".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I plan to merge this tomorrow unless anyone else has further suggestions. |
This is (one more time) an extremely useful addition to xarray - thanks so much @fujiisoup ! |
Indeed, thank you @fujiisoup ! |
* upstream/master: xfail cftimeindex multiindex test (pydata#2669) DOC: refresh "Why xarray" and shorten top-level description (pydata#2657) Remove broken Travis-CI builds (pydata#2661) Type checking with mypy (pydata#2655) Added Coarsen (pydata#2612)
* master: Remove broken Travis-CI builds (pydata#2661) Type checking with mypy (pydata#2655) Added Coarsen (pydata#2612) Improve test for GH 2649 (pydata#2654) revise top-level package description (pydata#2430) Convert ref_date to UTC in encode_cf_datetime (pydata#2651) Change an `==` to an `is`. Fix tests so that this won't happen again. (pydata#2648) ENH: switch Dataset and DataArray to use explicit indexes (pydata#2639) Use pycodestyle for lint checks. (pydata#2642) Switch whats-new for 0.11.2 -> 0.11.3 DOC: document v0.11.2 release Use built-in interp for interpolation with resample (pydata#2640) BUG: pytest-runner no required for setup.py (pydata#2643)
whats-new.rst
for all changes andapi.rst
for new APIStarted to implement
corsen
.The API is currently something like
Currently, it is not working for a datetime coordinate, since
mean
does not work for this dtype.e.g.
I am not familiar with datetime things.
Any advice will be appreciated.