Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Coarsen #2612

Merged
merged 30 commits into from
Jan 6, 2019
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
3525b9c
Added variable.coarsen
fujiisoup Dec 15, 2018
5ff3102
Added DataArray.coarsen and Dataset.coarsen
fujiisoup Dec 16, 2018
6f3cf0c
pep8
fujiisoup Dec 16, 2018
f1f4804
a bugfix for mpa3
fujiisoup Dec 17, 2018
ab5d2f6
Support mean for datatime dtype
fujiisoup Dec 17, 2018
9123fd4
nanmean for DateTime
fujiisoup Dec 17, 2018
c85d18a
API updatedd via comments
fujiisoup Dec 20, 2018
0aa7a37
bug fix in tests
fujiisoup Dec 20, 2018
b656d62
updated docs
fujiisoup Dec 20, 2018
2ffcb23
Merge branch 'master' into corsen
fujiisoup Dec 20, 2018
04773eb
use pd.isnull rather than isnat
fujiisoup Dec 21, 2018
b33020b
support Variable in datetime_to_numeric
fujiisoup Dec 21, 2018
b13af18
use pd.isnull instead of numpy.isnat in test
fujiisoup Dec 21, 2018
24f3061
Merge branch 'master' into corsen
fujiisoup Dec 24, 2018
d806c96
Added an example to doc.
fujiisoup Dec 24, 2018
96bf29b
coordinate_func -> coord_func. Support 0d-array mean with datetime
fujiisoup Dec 24, 2018
b70996a
Added an two dimensional example
fujiisoup Dec 24, 2018
827794e
flake8
fujiisoup Dec 24, 2018
a354005
Merge branch 'master' into corsen
fujiisoup Dec 25, 2018
82c08af
flake8
fujiisoup Dec 25, 2018
d73d1d5
a potential bug fix
fujiisoup Dec 25, 2018
a92c431
Update via comments
fujiisoup Dec 30, 2018
0e53c7b
Always use datetime64[ns] in mean
fujiisoup Dec 30, 2018
07b8060
Added tests for 2d coarsen with value check
fujiisoup Dec 31, 2018
aa41f39
update via comment
fujiisoup Jan 3, 2019
4c347af
Merge branch 'master' into corsen
fujiisoup Jan 3, 2019
2a06b05
whats new
fujiisoup Jan 3, 2019
50fa6aa
Merge branch 'master' into corsen
fujiisoup Jan 3, 2019
1d04bdd
typo fix
fujiisoup Jan 4, 2019
1523292
Merge branch 'master' into corsen
shoyer Jan 6, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ Computation
Dataset.groupby
Dataset.groupby_bins
Dataset.rolling
Dataset.coarsen
Dataset.resample
Dataset.diff
Dataset.quantile
Expand Down Expand Up @@ -312,6 +313,7 @@ Computation
DataArray.groupby
DataArray.groupby_bins
DataArray.rolling
DataArray.coarsen
DataArray.dt
DataArray.resample
DataArray.get_axis_num
Expand Down
39 changes: 39 additions & 0 deletions doc/computation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,45 @@ You can also use ``construct`` to compute a weighted rolling sum:
To avoid this, use ``skipna=False`` as the above example.


Coarsen large arrays
====================

``DataArray`` and ``Dataset`` objects include a
:py:meth:`~xarray.DataArray.coarsen` and :py:meth:`~xarray.Dataset.coarsen`
methods. This supports the block aggregation along multiple dimensions,

.. ipython:: python

x = np.linspace(0, 10, 300)
t = pd.date_range('15/12/1999', periods=364)
da = xr.DataArray(np.sin(x) * np.cos(np.linspace(0, 1, 364)[:, np.newaxis]),
dims=['time', 'x'], coords={'time': t, 'x': x})
da

In order to take a block mean for every 7 days along ``time`` dimension and
every 2 points along ``x`` dimension,

.. ipython:: python

da.coarsen(time=7, x=2).mean()

:py:meth:`~xarray.DataArray.coarsen` raises an ``ValueError`` if the data
length is not a multiple of the corresponding window size.
You can choose ``boundary='trim'`` or ``boundary='pad'`` options for trimming
the excess entries or padding ``nan`` to insufficient entries,

.. ipython:: python

da.coarsen(time=30, x=2, boundary='trim').mean()

If you want to apply a specific function to coordinate, you can pass the
function of function name to ``coordinate_func`` option,

.. ipython:: python

da.coarsen(time=7, x=2, coordinate_func={'time': 'min'}).mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be coord_func, not coordinate_func.



Computation using Coordinates
=============================

Expand Down
4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ Enhancements
- :py:class:`CFTimeIndex` uses slicing for string indexing when possible (like
:py:class:`pandas.DatetimeIndex`), which avoids unnecessary copies.
By `Stephan Hoyer <https://github.com/shoyer>`_
- :py:meth:`~xarray.DataArray.coarsen` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now needs to move up to the section for 0.11.2. Also it would be nice to add a link to the new doc section "Coarsen large arrays".

:py:meth:`~xarray.Dataset.coarsen` are newly added.
(:issue:`2525`)
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
- Enable passing ``rasterio.io.DatasetReader`` or ``rasterio.vrt.WarpedVRT`` to
``open_rasterio`` instead of file path string. Allows for in-memory
reprojection, see (:issue:`2588`).
Expand Down
59 changes: 59 additions & 0 deletions xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -590,6 +590,65 @@ def rolling(self, dim=None, min_periods=None, center=False, **dim_kwargs):
return self._rolling_cls(self, dim, min_periods=min_periods,
center=center)

def coarsen(self, dim=None, boundary='exact', side='left',
coord_func='mean', **dim_kwargs):
"""
Coarsen object.

Parameters
----------
dim: dict, optional
Mapping from the dimension name to the window size.
dim : str
Name of the dimension to create the rolling iterator
along (e.g., `time`).
window : int
Size of the moving window.
boundary : 'exact' | 'trim' | 'pad'
If 'exact', a ValueError will be raised if dimension size is not a
multiple of the window size. If 'trim', the excess entries are
dropped. If 'pad', NA will be padded.
side : 'left' or 'right' or mapping from dimension to 'left' or 'right'
coord_func: function (name) that is applied to the coordintes,
or a mapping from coordinate name to function (name).

Returns
-------
Coarsen object (core.rolling.DataArrayCoarsen for DataArray,
core.rolling.DatasetCoarsen for Dataset.)

Examples
--------
Coarsen the long time series by averaging over every four days.

>>> da = xr.DataArray(np.linspace(0, 364, num=364),
... dims='time',
... coords={'time': pd.date_range(
... '15/12/1999', periods=364)})
>>> da
>>> <xarray.DataArray (time: 364)>
>>> array([ 0. , 1.002755, 2.00551 , ..., 362.997245,
364. ])
>>> Coordinates:
>>> * time (time) datetime64[ns] 1999-12-15 ... 2000-12-12
>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the results here should not be prefaced with >>>.

>>> da.coarsen(time=4).mean()
>>> <xarray.DataArray (time: 91)>
>>> array([ 1.504132, 5.515152, 9.526171, 13.53719 , ...,
>>> 362.495868])
>>> Coordinates:
>>> * time (time) datetime64[ns] 1999-12-16T12:00:00 ...

See Also
--------
core.rolling.DataArrayCoarsen
core.rolling.DatasetCoarsen
"""
dim = either_dict_or_kwargs(dim, dim_kwargs, 'coarsen')
return self._coarsen_cls(
self, dim, boundary=boundary, side=side,
coord_func=coord_func)

def resample(self, indexer=None, skipna=None, closed=None, label=None,
base=0, keep_attrs=None, loffset=None, **indexer_kwargs):
"""Returns a Resample object for performing resampling operations.
Expand Down
1 change: 1 addition & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ class DataArray(AbstractArray, DataWithCoords):
"""
_groupby_cls = groupby.DataArrayGroupBy
_rolling_cls = rolling.DataArrayRolling
_coarsen_cls = rolling.DataArrayCoarsen
_resample_cls = resample.DataArrayResample

dt = property(DatetimeAccessor)
Expand Down
1 change: 1 addition & 0 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,7 @@ class Dataset(Mapping, ImplementsDatasetReduce, DataWithCoords,
"""
_groupby_cls = groupby.DatasetGroupBy
_rolling_cls = rolling.DatasetRolling
_coarsen_cls = rolling.DatasetCoarsen
_resample_cls = resample.DatasetResample

def __init__(self, data_vars=None, coords=None, attrs=None,
Expand Down
23 changes: 20 additions & 3 deletions xarray/core/duck_array_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import numpy as np
import pandas as pd

from . import dask_array_ops, dtypes, npcompat, nputils
from . import dask_array_ops, dtypes, npcompat, nputils, utils
from .nputils import nanfirst, nanlast
from .pycompat import dask_array_type

Expand Down Expand Up @@ -261,8 +261,6 @@ def f(values, axis=None, skipna=None, **kwargs):
sum = _create_nan_agg_method('sum')
sum.numeric_only = True
sum.available_min_count = True
mean = _create_nan_agg_method('mean')
mean.numeric_only = True
std = _create_nan_agg_method('std')
std.numeric_only = True
var = _create_nan_agg_method('var')
Expand All @@ -278,6 +276,25 @@ def f(values, axis=None, skipna=None, **kwargs):
cumsum_1d.numeric_only = True


_mean = _create_nan_agg_method('mean')


def mean(array, axis=None, skipna=None, **kwargs):
Copy link
Member Author

@fujiisoup fujiisoup Dec 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make this compatible with CFTime index. @spencerkclark, could you comment for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fujiisoup! I think something like the following would work:

from ..coding.times import format_cftime_datetime
from .common import contains_cftime_datetimes


def mean(array, axis=None, skipna=None, **kwargs):
    array = asarray(array)
    if array.dtype.kind == 'M':
        offset = min(array)
        # infer the compatible timedelta dtype
	dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype
        return _mean(utils.datetime_to_numeric(array, offset), axis=axis,
	             skipna=skipna, **kwargs).astype(dtype) + offset
    elif contains_cftime_datetimes(xr.DataArray(array)):
        import cftime
        offset = min(array)
        numeric_dates = utils.datetime_to_numeric(xr.DataArray(array), offset,
                                                  datetime_unit='s').data
        mean_dates = _mean(numeric_dates, axis=axis, skipna=skipna, **kwargs)
        units = 'seconds since {}'.format(format_cftime_datetime(offset))
        calendar = offset.calendar
        return cftime.num2date(mean_dates, units=units, calendar=calendar,
                               only_use_cftime_datetimes=True)
    else:
        return _mean(array, axis=axis, skipna=skipna, **kwargs)

Ideally we would modify datetime_to_numeric and contains_cftime_datetimes to work with both pure NumPy or dask arrays of cftime objects as well as DataArrays (currently they only work with DataArrays), but that's the way things are coded right now. I could handle that in a follow-up if you would like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @spencerkclark

It would be nice if you could send a follow-up PR :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I'd be happy to take care of making this compatible with cftime dates.

""" inhouse mean that can handle datatime dtype """
array = asarray(array)
if array.dtype.kind == 'M':
offset = min(array)
# infer the compatible timedelta dtype
dtype = (np.empty((1,), dtype=array.dtype) - offset).dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to find the corresponding timedelta from datetime. Is there any good function to find an appropriate dtype?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be missing something, but since xarray always coerces all NumPy dates to datetime64[ns], will the default results of datetime_to_numeric always have units of nanoseconds? In other words will this dtype always be timedelta64[ns]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I just realized we are always using [ns] for datetime. Updated.

return _mean(utils.datetime_to_numeric(array, offset), axis=axis,
skipna=skipna, **kwargs).astype(dtype) + offset
else:
return _mean(array, axis=axis, skipna=skipna, **kwargs)


mean.numeric_only = True


def _nd_cum_func(cum_func, array, axis, **kwargs):
array = asarray(array)
if axis is None:
Expand Down
4 changes: 2 additions & 2 deletions xarray/core/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import numpy as np
import pandas as pd

from . import rolling
from . import utils
from .common import _contains_datetime_like_objects
from .computation import apply_ufunc
from .duck_array_ops import dask_array_type
Expand Down Expand Up @@ -370,7 +370,7 @@ def _get_valid_fill_mask(arr, dim, limit):
None'''
kw = {dim: limit + 1}
# we explicitly use construct method to avoid copy.
new_dim = rolling._get_new_dimname(arr.dims, '_window')
new_dim = utils.get_temp_dimname(arr.dims, '_window')
return (arr.isnull().rolling(min_periods=1, **kw)
.construct(new_dim, fill_value=False)
.sum(new_dim, skipna=False)) <= limit
Expand Down
26 changes: 26 additions & 0 deletions xarray/core/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,20 @@
New {da_or_ds} object with `{name}` applied along its rolling dimnension.
"""

_COARSEN_REDUCE_DOCSTRING_TEMPLATE = """\
Coarsen this object by applying `{name}` along its dimensions.

Parameters
----------
**kwargs : dict
Additional keyword arguments passed on to `{name}`.

Returns
-------
reduced : DataArray or Dataset
New object with `{name}` applied along its coasen dimnensions.
"""


def fillna(data, other, join="left", dataset_join="left"):
"""Fill missing values in this object with data from the other object.
Expand Down Expand Up @@ -378,3 +392,15 @@ def inject_datasetrolling_methods(cls):
func.__doc__ = _ROLLING_REDUCE_DOCSTRING_TEMPLATE.format(
name=func.__name__, da_or_ds='Dataset')
setattr(cls, 'count', func)


def inject_coarsen_methods(cls):
# standard numpy reduce methods
methods = [(name, getattr(duck_array_ops, name))
for name in NAN_REDUCE_METHODS]
for name, f in methods:
func = cls._reduce_method(f)
func.__name__ = name
func.__doc__ = _COARSEN_REDUCE_DOCSTRING_TEMPLATE.format(
name=func.__name__)
setattr(cls, name, func)
1 change: 0 additions & 1 deletion xarray/core/pdcompat.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@


import numpy as np
import pandas as pd


# for pandas 0.19
Expand Down
Loading