-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
combine_by_coordinates to handle unnamed data arrays. #4696
Conversation
…converter to combine_by_coords to check for all DataArray case and convert to datasets.
…rrays and dataset input and with empty list.
…yling tools to match the other docstrings.
…xarray into aijams/combine-by-coords
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience @aijams
Sorry for the delay. I've added a few requests.
xarray/tests/test_combine.py
Outdated
|
||
def test_combine_coords_empty_list(self): | ||
expected = Dataset() | ||
actual = combine_by_coords([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should raise an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a function (test_empty_input
) above test_combine_coords_empty_list
that does exactly the same thing. It looks like this function isn't called anywhere, so I think it should be removed, especially if combining an empty list of datasets should raise an error as you suggest.
xarray/tests/test_combine.py
Outdated
DataArray([0, 1], dims=("x"), coords=({"x": [0, 1]})), | ||
DataArray([2, 3], dims=("x"), coords=({"x": [2, 3]})), | ||
] | ||
expected = Dataset({"_": ("x", [0, 1, 2, 3])}, coords={"x": [0, 1, 2, 3]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return a DataArray in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is to start with a nested list of DataArrays and end up with a DataArray. This will change the return type from Dataset to (DataArray or Dataset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if combine_by_coords
should accept a list of both DataArrays and Datasets? In such a case, I believe it should consider the DataArrays as a separate variable to those in the Datasets. The dummy variable would then be merged with the other variables in the resulting Dataset. Otherwise, this function could take either a list of DataArrays or a list of Datasets.
As for combine_nested
, it doesn't make sense in my mind to have it take both DataArrays and Datasets at the same time, since it aligns its input into a hypercube according to the nested list structure.
I will have combine_by_coords
take the "either or" approach for now. Let me know if you think it should take both types of input simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aijams the case of a list of both DataArrays and Datasets is the same as@dcherian's case (3) in this comment. We have to pass the objects down to merge so that it can raise the correct error.
I think it would be nice if we covered the case {DataArrays}->DataArray, but I'm not sure what the easiest way to do that is. If the dataarrays are unnamed, then the combined result will have some sort of default name, which we could check for and then demote to DataArray before returning. But if the DataArrays are named then at the return step I don't think we will know whether the input was originally a DA or DS before it got promoted?
return False | ||
return True | ||
|
||
|
||
def combine_by_coords( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also modify combine_nested
so the two are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a test test_combine_nested_unnamed_data_arrays
that passes a list of unnamed DataArray
s into combine_nested
and it produces the expected output. Can you clarify what about combine_nested
you want to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I know what @dcherian meant - at first glance it looks like the _combine_single_variable_hypercube
refactoring is a change that would also simplify the code in combine_nested
. But looking more closely I don't think it actually makes sense to do that, does it? It seems about as neat as it can be as is.
xarray/core/combine.py
Outdated
# If a set of unnamed data arrays is provided, these arrays are assumed to belong | ||
# to the same variable and should be combined. | ||
if _all_unnamed_data_arrays(data_objects): | ||
datasets = [Dataset({"_": data_array}) for data_array in data_objects] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using ._to_temp_dataset
here? See @dcherian 's comment here.
xarray/tests/test_combine.py
Outdated
DataArray([0, 1], dims=("x"), coords=({"x": [0, 1]})), | ||
DataArray([2, 3], dims=("x"), coords=({"x": [2, 3]})), | ||
] | ||
expected = Dataset({"_": ("x", [0, 1, 2, 3])}, coords={"x": [0, 1, 2, 3]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aijams the case of a list of both DataArrays and Datasets is the same as@dcherian's case (3) in this comment. We have to pass the objects down to merge so that it can raise the correct error.
I think it would be nice if we covered the case {DataArrays}->DataArray, but I'm not sure what the easiest way to do that is. If the dataarrays are unnamed, then the combined result will have some sort of default name, which we could check for and then demote to DataArray before returning. But if the DataArrays are named then at the return step I don't think we will know whether the input was originally a DA or DS before it got promoted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -593,8 +652,9 @@ def combine_by_coords( | |||
|
|||
Parameters | |||
---------- | |||
datasets : sequence of xarray.Dataset | |||
Dataset objects to combine. | |||
data_objects : sequence of xarray.Dataset or sequence of xarray.DataArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is renaming a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It technically is a breaking change - if someone was previously passing combine_by_coords(datasets=...)
then this change will break their code. But renaming the argument does make sense with this PR. Not sure whether that means this justifies a deprecation cycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it probably does. But only a few extra lines and easy to copy from elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what justifies a deprecation cycle in your project, or how one is performed. Can someone give me some guidance on this, seeing as this change will probably need one according to @max-sixty?
I also agree that renaming the argument makes sense here as data arrays and data sets are distinguished as two different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what justifies a deprecation cycle in your project, or how one is performed.
No worries! (It's very briefly mentioned in our contributing guide, but maybe we should expand that...)
xarray has loads of regular users, and we don't want them to find that downloading a new version of xarray breaks their perfectly good code, even in some minor way. Therefore we normally hold their hand through any changes by warning them in the version before, or possibly by anticipating the way in which we might need to make sure that their old way of using the functions still works temporarily, to give them time to switch to our new way.
In this case the only people who could be affected are people who are currently passing datasets
as a named argument to combine_by_coords
, so I think we want to catch that specific possibility and tell them to change the named argument to data_objects
in future. So perhaps by adding the argument back in with a temporary check like
import warnings
def combine_by_coords(data_objects, ..., datasets=None):
# TODO remove after version 0.19, see PR4696
if datasets is not None:
warnings.warn("The datasets argument has been renamed to `data_objects`. In future passing a value for datasets will raise an error.")
data_objects = datasets
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I'm somewhat concerned about users who ignore the warning (many software projects I've seen generate a lot of warnings), however I don't think there's much that can be done since the method signature will have to change anyway.
I defer to @TomNicholas =) |
def combine_by_coords( | ||
data_objects, | ||
data_objects=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_objects=[], | |
data_objects, |
(It's considered bad practice to have mutable default arguments to functions in python.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this in because if someone calls this method with datasets as a named parameter the data_objects argument would be unspecified and their code would break with an unspecified argument error. This is part of the deprecation warning below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a good point, but that means the default argument should be None
, not an empty list, as None
is immutable.
ValueError, | ||
match=r"Can't combine datasets with unnamed arrays." | ||
): | ||
ValueError, match=r"Can't combine datasets with unnamed arrays." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError, match=r"Can't combine datasets with unnamed arrays." | |
ValueError, match=r"Can't combine datasets with unnamed dataarrays." |
Tiny clarification that this means datasets with other xarray.datarrays, not something about the numpy arrays inside the xarray.dataset objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that (aside from two miniscule suggestions) this is ready to merge!
Sorry about the closing/reopening. I thought I was supposed to do something to get the code merged in to the official master. I accidentally pressed the wrong button thinking it did something else. |
It looks like this PR has been neglected for a while. I haven't heard from a maintainer whether they will be merging in this PR any time soon. Edit: I just noticed that @TomNicholas plans on adding type hints to the two functions |
Sorry @aijams ! I was following our broad rule of "two maintainers approve before merging", but this has sat here for absolutely ages and had feedback from others previously, so maybe I will just merge it now. We really need a better system for catching PRs that just sit around :/ |
Sorry @aijams ! Thank you for the reminder and forgive the delay. We appreciate the contribution (more than our responsiveness suggests!) |
isort . && black . && mypy . && flake8
whats-new.rst
api.rst