Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine_by_coordinates to handle unnamed data arrays. #4696

Merged
merged 37 commits into from
Jul 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
b35de8e
Added test for combine_by_coords changes.
aijams Dec 10, 2020
f966e76
Modified test case to expect a dataset instead of a DataArray. Added …
aijams Dec 11, 2020
68b7b49
Added tests to check combine_by_coords for exception with mixed DataA…
aijams Dec 12, 2020
540961f
Formatting changes after running black
aijams Dec 15, 2020
1c9b4c2
Added underscore to helper function to label as private.
aijams Dec 15, 2020
cb5ed5e
Black formatting changes for whats-new doc file.
aijams Dec 15, 2020
77020c0
Removed imports in docstring that were automatically added by code st…
aijams Dec 17, 2020
6af896b
Merge branch 'master' into aijams/combine-by-coords
aijams Dec 17, 2020
f06371a
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams Dec 19, 2020
7cdeabb
Merge branch 'aijams/combine-by-coords' of https://github.com/aijams/…
aijams Dec 19, 2020
6190839
Removed duplicate new item line in whats-new.
aijams Dec 19, 2020
3055000
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams Jan 30, 2021
cbc002f
combine methods now accept unnamed DataArrays as input.
aijams Apr 15, 2021
11a868b
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams Apr 15, 2021
89ac962
combine nested test checks nested lists of unnamed DataArrays.
aijams Apr 15, 2021
5f3afa5
Made combine_by_coords more readable.
aijams Apr 15, 2021
feb90ce
Cosmetic changes to code style.
aijams Apr 15, 2021
db5b906
Merging changes from first PR.
aijams Apr 18, 2021
e884f52
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams Apr 18, 2021
0044bb9
Removed extra test from merge with previous PR.
aijams Apr 18, 2021
44548ee
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams May 4, 2021
5fe8323
Updated test to use pytest.raises instead of raises_regex.
aijams May 4, 2021
55f53b9
Merged latests changes from upstream.
aijams May 10, 2021
805145c
Added breaking-change entry to whats new page.
aijams May 11, 2021
3eed47a
Merged new changes from master branch.
aijams May 11, 2021
05faa88
Added deprecation warning to combine_coords
aijams May 11, 2021
6c75525
Removed index monotonicity checking temporarily.
aijams May 11, 2021
2c43030
Removed duplicate entries from whats new page.
aijams May 12, 2021
f6fae25
Removed TODO message
aijams May 12, 2021
81ec1ff
Added test for combine_nested.
aijams May 16, 2021
caaee74
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams May 16, 2021
637d4cc
Added check to combine methods to clarify parameter requirements.
aijams May 16, 2021
b5940a1
Reassigned description of changes to bug fixes category.
aijams May 19, 2021
d02da23
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams May 19, 2021
04cd5f8
Minor style changes.
aijams May 19, 2021
e58a9e2
Added blank line for style purposes.
aijams May 19, 2021
c0fc4f1
Merge remote-tracking branch 'upstream/master' into aijams/combine-by…
aijams Jun 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,6 @@ Thomas Nicholas, Tom Nicholas, Zachary Moon.

New Features
~~~~~~~~~~~~

- Implement :py:meth:`DataArray.drop_duplicates`
to remove duplicate dimension values (:pull:`5239`).
By `Andrew Huang <https://github.com/ahuang11>`_.
Expand All @@ -119,9 +118,22 @@ New Features
- Raise more informative error when decoding time variables with invalid reference dates.
(:issue:`5199`, :pull:`5288`). By `Giacomo Caria <https://github.com/gcaria>`_.

Breaking changes
~~~~~~~~~~~~~~~~
- The main parameter to :py:func:`combine_by_coords` is renamed to `data_objects` instead
of `datasets` so anyone calling this method using a named parameter will need to update
the name accordingly (:issue:`3248`, :pull:`4696`).
By `Augustus Ijams <https://github.com/aijams>`_.

Deprecations
~~~~~~~~~~~~


Bug fixes
~~~~~~~~~

- :py:func:`combine_by_coords` can now handle combining a list of unnamed
``DataArray`` as input (:issue:`3248`, :pull:`4696`).
By `Augustus Ijams <https://github.com/aijams>`_.
- Opening netCDF files from a path that doesn't end in ``.nc`` without supplying
an explicit ``engine`` works again (:issue:`5295`), fixing a bug introduced in
0.18.0.
Expand Down
164 changes: 124 additions & 40 deletions xarray/core/combine.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import itertools
import warnings
from collections import Counter

import pandas as pd
Expand All @@ -8,6 +9,7 @@
from .dataarray import DataArray
from .dataset import Dataset
from .merge import merge
from .utils import iterate_nested


def _infer_concat_order_from_positions(datasets):
Expand Down Expand Up @@ -544,6 +546,15 @@ def combine_nested(
concat
merge
"""
mixed_datasets_and_arrays = any(
isinstance(obj, Dataset) for obj in iterate_nested(datasets)
) and any(
isinstance(obj, DataArray) and obj.name is None
for obj in iterate_nested(datasets)
)
if mixed_datasets_and_arrays:
raise ValueError("Can't combine datasets with unnamed arrays.")

if isinstance(concat_dim, (str, DataArray)) or concat_dim is None:
concat_dim = [concat_dim]

Expand All @@ -565,18 +576,79 @@ def vars_as_keys(ds):
return tuple(sorted(ds))


def combine_by_coords(
def _combine_single_variable_hypercube(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice way to refactor, that makes it easier to reason about the code.

datasets,
fill_value=dtypes.NA,
data_vars="all",
coords="different",
compat="no_conflicts",
join="outer",
combine_attrs="no_conflicts",
):
"""
Attempt to combine a list of Datasets into a hypercube using their
coordinates.

All provided Datasets must belong to a single variable, ie. must be
assigned the same variable name. This precondition is not checked by this
function, so the caller is assumed to know what it's doing.

This function is NOT part of the public API.
"""
if len(datasets) == 0:
raise ValueError(
"At least one Dataset is required to resolve variable names "
"for combined hypercube."
)
Comment on lines +598 to +602
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this Error ever be reached? Wouldn't the

if not data_objects:
        return Dataset()

stop this from being triggered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended this check to be part of the internal contract for _combine_single_variable_hypercube since it doesn't make sense to build an empty hypercube with this method. There would be no way to determine the variable involved, so a default sentinel name would have to be introduced.


combined_ids, concat_dims = _infer_concat_order_from_coords(list(datasets))

if fill_value is None:
# check that datasets form complete hypercube
_check_shape_tile_ids(combined_ids)
else:
# check only that all datasets have same dimension depth for these
# vars
_check_dimension_depth_tile_ids(combined_ids)

# Concatenate along all of concat_dims one by one to create single ds
concatenated = _combine_nd(
combined_ids,
concat_dims=concat_dims,
data_vars=data_vars,
coords=coords,
compat=compat,
fill_value=fill_value,
join=join,
combine_attrs=combine_attrs,
)

# Check the overall coordinates are monotonically increasing
for dim in concat_dims:
indexes = concatenated.indexes.get(dim)
if not (indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing):
raise ValueError(
"Resulting object does not have monotonic"
" global indexes along dimension {}".format(dim)
)

return concatenated


# TODO remove empty list default param after version 0.19, see PR4696
def combine_by_coords(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also modify combine_nested so the two are consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a test test_combine_nested_unnamed_data_arrays that passes a list of unnamed DataArrays into combine_nested and it produces the expected output. Can you clarify what about combine_nested you want to be consistent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know what @dcherian meant - at first glance it looks like the _combine_single_variable_hypercube refactoring is a change that would also simplify the code in combine_nested. But looking more closely I don't think it actually makes sense to do that, does it? It seems about as neat as it can be as is.

data_objects=[],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_objects=[],
data_objects,

(It's considered bad practice to have mutable default arguments to functions in python.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this in because if someone calls this method with datasets as a named parameter the data_objects argument would be unspecified and their code would break with an unspecified argument error. This is part of the deprecation warning below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point, but that means the default argument should be None, not an empty list, as None is immutable.

compat="no_conflicts",
data_vars="all",
coords="different",
fill_value=dtypes.NA,
join="outer",
combine_attrs="no_conflicts",
datasets=None,
):
"""
Attempt to auto-magically combine the given datasets into one by using
dimension coordinates.
Attempt to auto-magically combine the given datasets (or data arrays)
into one by using dimension coordinates.

This method attempts to combine a group of datasets along any number of
dimensions into a single entity by inspecting coords and metadata and using
Expand All @@ -600,8 +672,9 @@ def combine_by_coords(

Parameters
----------
datasets : sequence of xarray.Dataset
Dataset objects to combine.
data_objects : sequence of xarray.Dataset or sequence of xarray.DataArray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is renaming a breaking change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It technically is a breaking change - if someone was previously passing combine_by_coords(datasets=...) then this change will break their code. But renaming the argument does make sense with this PR. Not sure whether that means this justifies a deprecation cycle?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably does. But only a few extra lines and easy to copy from elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what justifies a deprecation cycle in your project, or how one is performed. Can someone give me some guidance on this, seeing as this change will probably need one according to @max-sixty?
I also agree that renaming the argument makes sense here as data arrays and data sets are distinguished as two different things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what justifies a deprecation cycle in your project, or how one is performed.

No worries! (It's very briefly mentioned in our contributing guide, but maybe we should expand that...)

xarray has loads of regular users, and we don't want them to find that downloading a new version of xarray breaks their perfectly good code, even in some minor way. Therefore we normally hold their hand through any changes by warning them in the version before, or possibly by anticipating the way in which we might need to make sure that their old way of using the functions still works temporarily, to give them time to switch to our new way.

In this case the only people who could be affected are people who are currently passing datasets as a named argument to combine_by_coords, so I think we want to catch that specific possibility and tell them to change the named argument to data_objects in future. So perhaps by adding the argument back in with a temporary check like

import warnings

def combine_by_coords(data_objects, ..., datasets=None):

    # TODO remove after version 0.19, see PR4696
    if datasets is not None:
        warnings.warn("The datasets argument has been renamed to `data_objects`. In future passing a value for datasets will raise an error.")
        data_objects = datasets

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'm somewhat concerned about users who ignore the warning (many software projects I've seen generate a lot of warnings), however I don't think there's much that can be done since the method signature will have to change anyway.

Data objects to combine.

compat : {"identical", "equals", "broadcast_equals", "no_conflicts", "override"}, optional
String indicating how to compare variables of the same name for
potential conflicts:
Expand Down Expand Up @@ -776,51 +849,62 @@ def combine_by_coords(
precipitation (y, x) float64 0.4376 0.8918 0.9637 ... 0.5684 0.01879 0.6176
"""

# Group by data vars
sorted_datasets = sorted(datasets, key=vars_as_keys)
grouped_by_vars = itertools.groupby(sorted_datasets, key=vars_as_keys)

# Perform the multidimensional combine on each group of data variables
# before merging back together
concatenated_grouped_by_data_vars = []
for vars, datasets_with_same_vars in grouped_by_vars:
combined_ids, concat_dims = _infer_concat_order_from_coords(
list(datasets_with_same_vars)
# TODO remove after version 0.19, see PR4696
if datasets is not None:
warnings.warn(
"The datasets argument has been renamed to `data_objects`."
" In future passing a value for datasets will raise an error."
)
data_objects = datasets

if fill_value is None:
# check that datasets form complete hypercube
_check_shape_tile_ids(combined_ids)
else:
# check only that all datasets have same dimension depth for these
# vars
_check_dimension_depth_tile_ids(combined_ids)
if not data_objects:
return Dataset()

# Concatenate along all of concat_dims one by one to create single ds
concatenated = _combine_nd(
combined_ids,
concat_dims=concat_dims,
mixed_arrays_and_datasets = any(
isinstance(data_object, DataArray) and data_object.name is None
for data_object in data_objects
) and any(isinstance(data_object, Dataset) for data_object in data_objects)
if mixed_arrays_and_datasets:
raise ValueError("Can't automatically combine datasets with unnamed arrays.")

all_unnamed_data_arrays = all(
isinstance(data_object, DataArray) and data_object.name is None
for data_object in data_objects
)
if all_unnamed_data_arrays:
unnamed_arrays = data_objects
temp_datasets = [data_array._to_temp_dataset() for data_array in unnamed_arrays]

combined_temp_dataset = _combine_single_variable_hypercube(
temp_datasets,
fill_value=fill_value,
data_vars=data_vars,
coords=coords,
compat=compat,
fill_value=fill_value,
join=join,
combine_attrs=combine_attrs,
)
return DataArray()._from_temp_dataset(combined_temp_dataset)

# Check the overall coordinates are monotonically increasing
# TODO (benbovy - flexible indexes): only with pandas.Index?
for dim in concat_dims:
indexes = concatenated.xindexes.get(dim)
if not (
indexes.array.is_monotonic_increasing
or indexes.array.is_monotonic_decreasing
):
raise ValueError(
"Resulting object does not have monotonic"
" global indexes along dimension {}".format(dim)
)
concatenated_grouped_by_data_vars.append(concatenated)
else:
# Group by data vars
sorted_datasets = sorted(data_objects, key=vars_as_keys)
grouped_by_vars = itertools.groupby(sorted_datasets, key=vars_as_keys)

# Perform the multidimensional combine on each group of data variables
# before merging back together
concatenated_grouped_by_data_vars = []
for vars, datasets_with_same_vars in grouped_by_vars:
concatenated = _combine_single_variable_hypercube(
list(datasets_with_same_vars),
fill_value=fill_value,
data_vars=data_vars,
coords=coords,
compat=compat,
join=join,
combine_attrs=combine_attrs,
)
concatenated_grouped_by_data_vars.append(concatenated)

return merge(
concatenated_grouped_by_data_vars,
Expand Down
8 changes: 8 additions & 0 deletions xarray/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -900,3 +900,11 @@ class Default(Enum):


_default = Default.token


def iterate_nested(nested_list):
for item in nested_list:
if isinstance(item, list):
yield from iterate_nested(item)
else:
yield item
68 changes: 68 additions & 0 deletions xarray/tests/test_combine.py
Original file line number Diff line number Diff line change
Expand Up @@ -646,6 +646,47 @@ def test_combine_nested_fill_value(self, fill_value):
actual = combine_nested(datasets, concat_dim="t", fill_value=fill_value)
assert_identical(expected, actual)

def test_combine_nested_unnamed_data_arrays(self):
unnamed_array = DataArray(data=[1.0, 2.0], coords={"x": [0, 1]}, dims="x")

actual = combine_nested([unnamed_array], concat_dim="x")
expected = unnamed_array
assert_identical(expected, actual)

unnamed_array1 = DataArray(data=[1.0, 2.0], coords={"x": [0, 1]}, dims="x")
unnamed_array2 = DataArray(data=[3.0, 4.0], coords={"x": [2, 3]}, dims="x")

actual = combine_nested([unnamed_array1, unnamed_array2], concat_dim="x")
expected = DataArray(
data=[1.0, 2.0, 3.0, 4.0], coords={"x": [0, 1, 2, 3]}, dims="x"
)
assert_identical(expected, actual)

da1 = DataArray(data=[[0.0]], coords={"x": [0], "y": [0]}, dims=["x", "y"])
da2 = DataArray(data=[[1.0]], coords={"x": [0], "y": [1]}, dims=["x", "y"])
da3 = DataArray(data=[[2.0]], coords={"x": [1], "y": [0]}, dims=["x", "y"])
da4 = DataArray(data=[[3.0]], coords={"x": [1], "y": [1]}, dims=["x", "y"])
objs = [[da1, da2], [da3, da4]]

expected = DataArray(
data=[[0.0, 1.0], [2.0, 3.0]],
coords={"x": [0, 1], "y": [0, 1]},
dims=["x", "y"],
)
actual = combine_nested(objs, concat_dim=["x", "y"])
assert_identical(expected, actual)

# TODO aijams - Determine if this test is appropriate.
def test_nested_combine_mixed_datasets_arrays(self):
objs = [
DataArray([0, 1], dims=("x"), coords=({"x": [0, 1]})),
Dataset({"x": [2, 3]}),
]
with pytest.raises(
ValueError, match=r"Can't combine datasets with unnamed arrays."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ValueError, match=r"Can't combine datasets with unnamed arrays."
ValueError, match=r"Can't combine datasets with unnamed dataarrays."

Tiny clarification that this means datasets with other xarray.datarrays, not something about the numpy arrays inside the xarray.dataset objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good clarification.

):
combine_nested(objs, "x")


class TestCombineAuto:
def test_combine_by_coords(self):
Expand Down Expand Up @@ -689,6 +730,17 @@ def test_combine_by_coords(self):
def test_empty_input(self):
assert_identical(Dataset(), combine_by_coords([]))

def test_combine_coords_mixed_datasets_arrays(self):
objs = [
DataArray([0, 1], dims=("x"), coords=({"x": [0, 1]})),
Dataset({"x": [2, 3]}),
]
with pytest.raises(
ValueError,
match=r"Can't automatically combine datasets with unnamed arrays.",
):
combine_by_coords(objs)

@pytest.mark.parametrize(
"join, expected",
[
Expand Down Expand Up @@ -992,6 +1044,22 @@ def test_combine_by_coords_incomplete_hypercube(self):
with pytest.raises(ValueError):
combine_by_coords([x1, x2, x3], fill_value=None)

def test_combine_by_coords_unnamed_arrays(self):
unnamed_array = DataArray(data=[1.0, 2.0], coords={"x": [0, 1]}, dims="x")

actual = combine_by_coords([unnamed_array])
expected = unnamed_array
assert_identical(expected, actual)

unnamed_array1 = DataArray(data=[1.0, 2.0], coords={"x": [0, 1]}, dims="x")
unnamed_array2 = DataArray(data=[3.0, 4.0], coords={"x": [2, 3]}, dims="x")

actual = combine_by_coords([unnamed_array1, unnamed_array2])
expected = DataArray(
data=[1.0, 2.0, 3.0, 4.0], coords={"x": [0, 1, 2, 3]}, dims="x"
)
assert_identical(expected, actual)


@requires_cftime
def test_combine_by_coords_distant_cftime_dates():
Expand Down
17 changes: 16 additions & 1 deletion xarray/tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from xarray.coding.cftimeindex import CFTimeIndex
from xarray.core import duck_array_ops, utils
from xarray.core.indexes import PandasIndex
from xarray.core.utils import either_dict_or_kwargs
from xarray.core.utils import either_dict_or_kwargs, iterate_nested

from . import assert_array_equal, requires_cftime, requires_dask
from .test_coding_times import _all_cftime_date_types
Expand Down Expand Up @@ -318,3 +318,18 @@ def test_infix_dims(supplied, all_, expected):
def test_infix_dims_errors(supplied, all_):
with pytest.raises(ValueError):
list(utils.infix_dims(supplied, all_))


@pytest.mark.parametrize(
"nested_list, expected",
[
([], []),
([1], [1]),
([1, 2, 3], [1, 2, 3]),
([[1]], [1]),
([[1, 2], [3, 4]], [1, 2, 3, 4]),
([[[1, 2, 3], [4]], [5, 6]], [1, 2, 3, 4, 5, 6]),
],
)
def test_iterate_nested(nested_list, expected):
assert list(iterate_nested(nested_list)) == expected