Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set _FillValue #181

Merged
merged 2 commits into from
Jun 22, 2022
Merged

Set _FillValue #181

merged 2 commits into from
Jun 22, 2022

Conversation

martindurant
Copy link
Member

Fixes #177

@rsignell-usgs
Copy link
Collaborator

@martindurant , I thought we wanted to set fill_value to be the value of _FillValue, not vice versa:
#177 (comment)

@martindurant
Copy link
Member Author

The code linked in the issue is what I was going on

@martindurant
Copy link
Member Author

@rabernat to choose: when creating the zarr output metadata, fill_value overwrites _FillValue, or the reverse?

@keewis
Copy link
Contributor

keewis commented Jun 21, 2022

as far as I understand #177 (comment), we should always ignore the HDF5 dataset fill value (fill_value) because it is static and not part of the netcdf4 spec.

@martindurant
Copy link
Member Author

@keewis , so we should only use it when _FillValue is not available? What about if the dataset is not netCDF, but regular HDF?

@keewis
Copy link
Contributor

keewis commented Jun 21, 2022

according to https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html:

It is not necessary to define your own _FillValue attribute for a variable if the default fill value for the type of the variable is adequate. However, use of the default fill value for data type byte is not recommended.

So that means you'd use the default fill value defined by the netcdf4 library if _FillValue is not set?

For HDF5 datasets it seems the correct thing to do is what kerchunk currently does: translate the HDF5 fill_value property to zarr's fill_value (and you will get a default if that is not set).


Re-reading the comments in #177, I get the impression that @ajelenak and @rabernat actually disagree on what to do (I might very well be misunderstanding something, though): @ajelenak would prefer to expose both fill_value and _FillValue from xarray's zarr backend and have decode_cf decide which to use, while @rabernat would like to have kerchunk set zarr's fill_value to _FillValue for netcdf4 datasets. Maybe they can clarify?

@rabernat
Copy link
Contributor

It's not a question of what I would like or not like. My goal was to communicate how Xarray handles fill_value with Zarr.

  1. When opening a Zarr store for reading, the Zarr array .fill_value is copied into the CF-style attribute _FillValue. So when the data are decoded whatever the the fill_value is presented to the user as NaN. In the code, you can see that any existing _FillValue attribute is overwritten. That is potentially a problem. However, that situation would never arise if you are using Xarray to write the data because...
  2. When writing an array, the _FillValue attribute is removed from attributes and placed in fill_value of the zarr array (code).

Creating a kerchunk Zarr is analogous to Xarray writing a Zarr store. So I recommend you follow the same pattern here. If _FillValue is present in the attributes, move it .fill_value of the array. Do not set the _FillValue attribute. Xarray will set it when decoding the data.

@rsignell-usgs
Copy link
Collaborator

@martindurant, as I understand it, the new plan is to modify the PR so that in the Kerchunk JSON we:

  • assign the value of _FillValue to fill_value.
  • remove the _FillValue attribute (to avoid confusion)

Correct?

@keewis
Copy link
Contributor

keewis commented Jun 22, 2022

to follow @rabernat's recommendation kerchunk and still keep supporting HDF datasets, I think we need to do a slight modification of that:

  • detect the dataset type (netcdf or HDF)
  • get the appropriate fill value (_FillValue for netcdf or the dataset fill value for HDF)
  • assign that to zarr's fill_value
  • clear any _FillValue attributes

In any case, @martindurant: I think this PR would fix #105 as well

@martindurant
Copy link
Member Author

detect the dataset type (netcdf or HDF)

I'm not sure this is a thing: all netCDF4 files are valid HDF5, but not vice versa; plus you can reasonably store netCDF datasets as parts of an HDF5 at some subpath.

@keewis
Copy link
Contributor

keewis commented Jun 22, 2022

well, the point is that netcdf4 does not use or set the HDF5 fill value, as far as I can tell, so we need to figure out which fill value to use with a particular dataset. That "detection" does not have to happen in code, though, it could just as well be a parameter to SingleHdfToZarr.

@rsignell-usgs
Copy link
Collaborator

rsignell-usgs commented Jun 22, 2022

@keewis I see your point -- if the HDF5 file is generated with NetCDF4, it will have _FillValue defined and we should use that for the Kerchunk fill_value, but if not we should use the HDF5 fill_value for the Kerchunk fill_value.

Is there a fast/definitive way to tell if an HDF5 file was generated by NetCDF4?

@martindurant
Copy link
Member Author

How about this version, then? Would someone like to test?

@rsignell-usgs
Copy link
Collaborator

@keewis
Copy link
Contributor

keewis commented Jun 22, 2022

Is there a fast/definitive way to tell if an HDF5 file was generated by NetCDF4?

It is probably not definitive, but https://github.com/pedro-vicente/netcdf-detect uses some internal attributes that the netcdf library sets to detect that. It is written in C++, but it should be possible to do the same using h5py (see also this post and the follow-ups on the netcdf mailing list).

@rsignell-usgs
Copy link
Collaborator

rsignell-usgs commented Jun 22, 2022

@keewis , so basically, check for:

  1. an attribute named "_Netcdf4Dimid" (in some cases)
  2. an attribute named "NAME", (always), saved by the HDF5 Dimension Scales API, that contains the string "This is a netCDF > dimension but not a netCDF variable."

This utility tries to detect both attributes by traversing the HDF5 file, if either case is found, it returns a value of 1

@martindurant martindurant marked this pull request as ready for review June 22, 2022 19:49
@martindurant martindurant merged commit 5df294f into fsspec:main Jun 22, 2022
@martindurant martindurant deleted the fill branch June 22, 2022 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FillValue issues with kerchunk 1e37=> 9.99999993e+36
4 participants