Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Coarsen #2612

Merged
merged 30 commits into from
Jan 6, 2019
Merged

Added Coarsen #2612

merged 30 commits into from
Jan 6, 2019

Conversation

fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented Dec 16, 2018

Started to implement corsen.
The API is currently something like

    actual = ds.coarsen(time=2, x=3, side='right',
                        coordinate_func={'time': np.max}).max()

Currently, it is not working for a datetime coordinate, since mean does not work for this dtype.
e.g.

da = xr.DataArray(np.linspace(0, 365, num=365), 
         dims='time', coords={'time': pd.date_range('15/12/1999', periods=365)}) 
da['time'].mean()    # -> TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')

I am not familiar with datetime things.
Any advice will be appreciated.

@pep8speaks
Copy link

pep8speaks commented Dec 16, 2018

Hello @fujiisoup! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 06, 2019 at 09:13 Hours UTC

@dcherian
Copy link
Contributor

(da.time - da.time[0]).mean() + da.time[0]

Or we could default to min() and add an loffset kwarg like resample (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)

@fujiisoup
Copy link
Member Author

@dcherian
Thanks.

(da.time - da.time[0]).mean() + da.time[0]
I would add mean for datetime-arrays.

@max-sixty
Copy link
Collaborator

Forgive me if this has an obvious answer: to what extent is this downsampling? Could this be done with resample?

@dcherian
Copy link
Contributor

@max-sixty I think it's multi-dimensional resampling / rolling. resample & rolling operate one dimension at a time.

self.coordinate_func = coordinate_func

def __repr__(self):
"""provide a nice str repr of our rolling object"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolling -> coarsen

shape.append(variable.shape[i])
dims.append(d)

return Variable(dims, variable.data.reshape(shape), self._attrs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth making an actual xarray.Variable object here rather than just returning data? With the need to come up with new dimension names, I think it might be cleaner avoiding that.

@classmethod
def _reduce_method(cls, func):
"""
Return a wrapped function for injecting numpy and bottoleneck methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bottoleneck -> bottleneck

entries in the most right will be removed (if trim_excess is True).
If right, coarsen windows ends at the most right entry, while
excessed entries in the most left will be removed.
trim_excess : boolean, default False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an API that lets us express at least three options:

  • trim excess entries
  • pad with NaN
  • raise an error if the shape does not divide exactly (this is probably the safest default behavior)

Maybe a string valued keyword argument would work better here, e.g., boundary='trim', boundary='pad' and boundary='exact'?

I would also suggest putting this argument before side, since side is only used for a particular (non-default) value of this argument.

@max-sixty
Copy link
Collaborator

I think it's multi-dimensional resampling / rolling. resample & rolling operate one dimension at a time.

Thanks. If this is multi-dimensional resampling / rolling, could we implement this functionality within those methods, enabling multiple dimensions?

Potentially the implementations are different enough that the arguments don't have enough overlap?

@shoyer
Copy link
Member

shoyer commented Dec 19, 2018

@max-sixty we discussed this a little bit: #2525 (comment)

The main difference is that resample is coordinate based, whereas this is integer position based which makes the implementation considerably simpler.

@max-sixty
Copy link
Collaborator

Perfect, thanks for pointing that out (I've generally been a bit out of the loop recently...)

@shoyer
Copy link
Member

shoyer commented Dec 19, 2018

I do think it would be nice for resample() to work with numbers and along multiple dimensions, but that's definitely a bigger project.

@fujiisoup
Copy link
Member Author

Updated.
The main change is

  1. API updated to
dataset.coarsen(self, dim=None, boundary='exact', side='left',
                            coordinate_func='mean', **dim_kwargs):

based on comments.

  1. nanmean for DateTime was implemented.
    This is a kind of backward incompatible.
    Previously, we raised an Error if mean is applied to a datetime array.

@fujiisoup
Copy link
Member Author

fujiisoup commented Dec 20, 2018

During writing the test codes, I realized that the default boundary='exact' is a little annoying.
It would be rare that the length of the data is exactly a multiple of the window size; we almost always need to type bounary='trim' or the other.
I personally think that boundary='pad' is also a good default, as it is a similar behaviour to rolling.

The down side of this is that it can make a uniformly spaced data an inhomogeneously spaced one.
It is something what users may not expect.
(probably boundary='exact is safer)

@shoyer
Copy link
Member

shoyer commented Dec 21, 2018

@pydata/xarray any opinions on what the default value for the boundary argument to coarsen() should be?

Personally, I like boundary='exact', but I also work mostly with simulated data with dimensions setup to be exact powers of 2 :).

@jhamman
Copy link
Member

jhamman commented Dec 21, 2018

+1, on 'exact'. I like the idea of making users be explicit about when to pad or trim edge cells.

@fujiisoup fujiisoup changed the title [WIP] Added Corsen Added Coarsen Dec 24, 2018
Coarsen large arrays
====================

``DataArray`` objects include a :py:meth:`~xarray.DataArray.coarsen` method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "DataArray and Dataset"?


.. ipython:: python

da.coarsen(time=7, coordinate_func={'time': 'min'}).mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we abbreviate this as coord_func? We use that elsewhere in xarray.


.. ipython:: python

da = xr.DataArray(np.linspace(0, 364, num=364), dims='time',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about including a 2D example in the docs, e.g., a 4x4 array to 2x2? I expect that is closer to the typical use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 2d example

array = asarray(array)
if array.dtype.kind == 'M':
offset = min(array)
dtype = (array.ravel()[0] - offset).dtype # dtype.kind == 'm'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail for size 0 arrays. Maybe switch to another reduce method if the array is size 0, e.g., use min instead? The only important part is that it handles reducing the shape properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, but I am not yet sure it (the updated one) is the best solution to find the appropriate dtype. See comment

if array.dtype.kind == 'M':
offset = min(array)
# infer the compatible timedelta dtype
dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to find the corresponding timedelta from datetime. Is there any good function to find an appropriate dtype?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be missing something, but since xarray always coerces all NumPy dates to datetime64[ns], will the default results of datetime_to_numeric always have units of nanoseconds? In other words will this dtype always be timedelta64[ns]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I just realized we are always using [ns] for datetime. Updated.

_mean = _create_nan_agg_method('mean')


def mean(array, axis=None, skipna=None, **kwargs):
Copy link
Member Author

@fujiisoup fujiisoup Dec 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make this compatible with CFTime index. @spencerkclark, could you comment for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fujiisoup! I think something like the following would work:

from ..coding.times import format_cftime_datetime
from .common import contains_cftime_datetimes


def mean(array, axis=None, skipna=None, **kwargs):
    array = asarray(array)
    if array.dtype.kind == 'M':
        offset = min(array)
        # infer the compatible timedelta dtype
	dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype
        return _mean(utils.datetime_to_numeric(array, offset), axis=axis,
	             skipna=skipna, **kwargs).astype(dtype) + offset
    elif contains_cftime_datetimes(xr.DataArray(array)):
        import cftime
        offset = min(array)
        numeric_dates = utils.datetime_to_numeric(xr.DataArray(array), offset,
                                                  datetime_unit='s').data
        mean_dates = _mean(numeric_dates, axis=axis, skipna=skipna, **kwargs)
        units = 'seconds since {}'.format(format_cftime_datetime(offset))
        calendar = offset.calendar
        return cftime.num2date(mean_dates, units=units, calendar=calendar,
                               only_use_cftime_datetimes=True)
    else:
        return _mean(array, axis=axis, skipna=skipna, **kwargs)

Ideally we would modify datetime_to_numeric and contains_cftime_datetimes to work with both pure NumPy or dask arrays of cftime objects as well as DataArrays (currently they only work with DataArrays), but that's the way things are coded right now. I could handle that in a follow-up if you would like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @spencerkclark

It would be nice if you could send a follow-up PR :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I'd be happy to take care of making this compatible with cftime dates.

else:
raise TypeError(
'{} is invalid for boundary. Valid option is \'exact\', '
'\'trim\' and \'pad\''.format(boundary[d]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is possibly a little easier to read using double-quotes for the outer layer of quotation, e.g.,

Suggested change
'\'trim\' and \'pad\''.format(boundary[d]))
"{} is invalid for boundary. Valid option is 'exact', "
"'trim' and 'pad'".format(boundary[d]))

# should be no error
ds.isel(x=slice(0, 3 * (len(ds['x']) // 3))).coarsen(x=3).mean()

# raise if exact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

boundary='trim')
expected = self.cls(['x'], [1.5, 3.5])
assert_identical(actual, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be good to add a test case that checks the values of a higher dimensional array, e.g., to verify that coarsening a 4x4 array to 2x2 gives the right result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair comment. Thanks. Updated.


.. ipython:: python

da.coarsen(time=7, x=2, coordinate_func={'time': 'min'}).mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be coord_func, not coordinate_func.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, after these two doc fixes.

364. ])
>>> Coordinates:
>>> * time (time) datetime64[ns] 1999-12-15 ... 2000-12-12
>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the results here should not be prefaced with >>>.

@@ -50,6 +50,10 @@ Enhancements
- :py:class:`CFTimeIndex` uses slicing for string indexing when possible (like
:py:class:`pandas.DatetimeIndex`), which avoids unnecessary copies.
By `Stephan Hoyer <https://github.com/shoyer>`_
- :py:meth:`~xarray.DataArray.coarsen` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now needs to move up to the section for 0.11.2. Also it would be nice to add a link to the new doc section "Coarsen large arrays".

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@shoyer
Copy link
Member

shoyer commented Jan 4, 2019

I plan to merge this tomorrow unless anyone else has further suggestions.

@fmaussion
Copy link
Member

This is (one more time) an extremely useful addition to xarray - thanks so much @fujiisoup !

@shoyer shoyer merged commit ede3e01 into pydata:master Jan 6, 2019
@shoyer
Copy link
Member

shoyer commented Jan 6, 2019

Indeed, thank you @fujiisoup !

dcherian pushed a commit to dcherian/xarray that referenced this pull request Jan 14, 2019
* upstream/master:
  xfail cftimeindex multiindex test (pydata#2669)
  DOC: refresh "Why xarray" and shorten top-level description (pydata#2657)
  Remove broken Travis-CI builds (pydata#2661)
  Type checking with mypy (pydata#2655)
  Added Coarsen (pydata#2612)
dcherian pushed a commit to yohai/xarray that referenced this pull request Jan 24, 2019
* master:
  Remove broken Travis-CI builds (pydata#2661)
  Type checking with mypy (pydata#2655)
  Added Coarsen (pydata#2612)
  Improve test for GH 2649 (pydata#2654)
  revise top-level package description (pydata#2430)
  Convert ref_date to UTC in encode_cf_datetime (pydata#2651)
  Change an `==` to an `is`. Fix tests so that this won't happen again. (pydata#2648)
  ENH: switch Dataset and DataArray to use explicit indexes (pydata#2639)
  Use pycodestyle for lint checks. (pydata#2642)
  Switch whats-new for 0.11.2 -> 0.11.3
  DOC: document v0.11.2 release
  Use built-in interp for interpolation with resample (pydata#2640)
  BUG: pytest-runner no required for setup.py (pydata#2643)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants