Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Xarray zarr encoding conventions #4047

Merged
merged 3 commits into from
May 20, 2020

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented May 8, 2020

When we implemented the Zarr backend, we made some ad hoc choices about how to encode NetCDF data in Zarr. At this stage, it would be useful to explicitly document this encoding. I decided to put it on the "Xarray Internals" page, but I'm open to moving if folks feel it fits better elsewhere.

cc @jeffdlb, @WardF, @DennisHeimbigner

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@dcherian
Copy link
Contributor

Thanks @rabernat

@dcherian dcherian merged commit 261df2e into pydata:master May 20, 2020
@DennisHeimbigner
Copy link

I have a couple of questions about _ARRAY_DIMENSIONS.
Let me make sure I understand how it is used.
Suppose I am given an array X with shape(10,20,30) and
an _ARRAY_DIMENSION attribute on X with the contents
_ARRAY_DIMENSION=["time", "lon", "lat"]
Then this is equivalent to the following partial netcdf CDL:
netcdf ... {
dims: time=10; lon=20; lat=30;
...}
Correct?
I assume that if there are conflicts where two variables end
up assigning different sIzes to the same named dimension, then
that generates an error.
Finally it is unclear where xarray puts these dimensions.
In the closest enclosIng Group? or in the root group?
=DennIs Heimbigner
Unidata

@rabernat
Copy link
Contributor Author

rabernat commented May 22, 2020

Thanks for the useful questions @DennisHeimbigner

Suppose I am given an array X with shape(10,20,30) and
an _ARRAY_DIMENSION attribute on X with the contents
_ARRAY_DIMENSION=["time", "lon", "lat"]
Then this is equivalent to the following partial netcdf CDL:
netcdf ... {
dims: time=10; lon=20; lat=30;
...}
Correct?

Yes, correct

I assume that if there are conflicts where two variables end
up assigning different sIzes to the same named dimension, then
that generates an error.

Yes, correct as well. Understanding how this works requires me to describe some xarray internals. When decoding a Dataset, each array is decoded as an xarray.Variable. According to those docs "a single Variable object is not fully described outside the context of its parent Dataset". The Zarr decoding process returns a Variable, which is basically a tuple of dims, data, attributes, encoding, where dims is the list we got from _ARRAY_DIMENSIONS.

Once the variables have all been decoded, then we put them together into a Dataset object. At that point, if there are inconsistent shapes across the different variables, an error will be raised. So far we haven't encountered this situation, because all the Zarr data we read tends to have been also written by Xarray, so it is consistent. But you could definitely manually hack a Zarr store to break this consistency, rendering it un-decodable by Xarray.

Finally it is unclear where xarray puts these dimensions.
In the closest enclosIng Group? or in the root group?

I hoped this was clear in the documentation I wrote which is now live here: http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification. What I said was

To accomplish this, Xarray developers decided to define a special Zarr array attribute: _ARRAY_DIMENSIONS. The value of this attribute is a list of dimension names (strings), for example ["time", "lon", "lat"]. When writing data to Zarr, Xarray sets this attribute on all variables based on the variable dimensions. When reading a Zarr group, Xarray looks for this attribute on all arrays, raising an error if it can’t be found. The attribute is used to define the variable dimension names and then removed from the attributes dictionary returned to the user.

An "array attribute" has a specific meaning in Zarr: it is the user metadata associated with an individual array. So the _ARRAY_DIMENSIONS attribute lives in the .zattrs file of each Zarr array. It is not a group-level attribute.

As you pointed out on the last call, there are clearly some downsides to having chosen to store this important property with the rest of user metadata (.zattrs). However, it allowed us to move forward without any changes to the zarr spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants