-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FillValue issues with kerchunk 1e37=> 9.99999993e+36 #177
Comments
My hunch was, that this is due to writing the value as text in the attributes of zarr. But that raises the question of how this is normally handled outside of kerchunk, and why decode_cf matters. |
@rsignell-usgs , do you mind using xarray's to_zarr on that dataset to see what .zarray/.zattrs it creates? You can set the output chunk store (as opposed to the metadata store) to be a dict or even a custom dict that doesn't store anything (see below) to avoid writing things. class NoOpDict(dict):
def __setitem__(self, *_):
pass
metadata_strore = {}
chunk_store = NoOpDict() |
this may be #105: if you look at the NetCDF file with |
@keewis, what is "fillvalue"? I've never heard of that attribute being used for fill values. That must be an HDF5-specific thing? @martindurant , when I write Zarr, it's even more confusing. The See cells [9] and [11] here: |
yes, exactly. You can examine it with In [1]: import fsspec
...: import h5py
...:
...: fs = fsspec.filesystem(
...: "s3", anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"}
...: )
...: url = "s3://rsignellbucket1/COAWST/coawst_us_20220101_01.nc"
...:
...: f = h5py.File(fs.open(url))
...: f["temp"].fillvalue
Out[1]: 0.0 For You might have to synchronise In [1]: import fsspec
...:
...: fs = fsspec.filesystem(
...: "s3", anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"}
...: )
...: url = "s3://rsignellbucket1/COAWST/coawst_us_20220101_01.nc"
...:
...: import kerchunk.hdf
...:
...: hdf = kerchunk.hdf.SingleHdf5ToZarr(fs.open(url), url=url)
...: references = hdf.translate()
In [2]: import xarray as xr
...:
...: r_opts = dict(anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"})
In [3]: fs = fsspec.filesystem(
...: "reference", fo=references, remote_protocol="s3", remote_options=r_opts
...: )
...: m = fs.get_mapper("")
...:
...: ds = xr.open_dataset(
...: m,
...: engine="zarr",
...: chunks={},
...: backend_kwargs=dict(consolidated=False),
...: decode_cf=False,
...: )
In [4]: ds.temp.attrs
Out[4]:
{'_FillValue': 0.0,
'coordinates': 'lon_rho lat_rho s_rho ocean_time',
'field': 'temperature, scalar, series',
'grid': 'grid',
'location': 'face',
'long_name': 'potential temperature',
'time': 'ocean_time',
'units': 'Celsius'}
In [5]: ds.temp[0, 0, 0, 0].values
Out[5]: array(1.e+37, dtype=float32)
In [6]: decoded = xr.decode_cf(ds)
...: decoded.temp[0, 0, 0, 0].values
Out[6]: array(1.e+37, dtype=float32)
In [7]: import ujson
...: import itertools
...:
...:
...: def correct_fill_values(data):
...: def fix_variable(values):
...: zattrs = values[".zattrs"]
...:
...: if "_FillValue" not in zattrs:
...: return values
...:
...: _FillValue = zattrs["_FillValue"]
...: if values[".zarray"]["fill_value"] != _FillValue:
...: values[".zarray"]["fill_value"] = _FillValue
...:
...: return values
...:
...: refs = data["refs"]
...: prepared = (
...: (tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
...: )
...: filtered = (
...: (key, ujson.loads(value))
...: for key, value in prepared
...: if key[1] in (".zattrs", ".zarray")
...: )
...: key = lambda i: i[0][0]
...: grouped = (
...: (name, {n[1]: v for n, v in group})
...: for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
...: )
...: fixed = ((name, fix_variable(var)) for name, var in grouped)
...: flattened = {
...: f"{name}/{item}": ujson.dumps(data, indent=4)
...: for name, var in fixed
...: for item, data in var.items()
...: }
...: data["refs"] = dict(sorted((refs | flattened).items()))
...: return data
In [8]: corrected_refs = correct_fill_values(references)
In [9]: fs = fsspec.filesystem(
...: "reference", fo=corrected_refs, remote_protocol="s3", remote_options=r_opts
...: )
...: m = fs.get_mapper("")
...:
...: ds = xr.open_dataset(
...: m,
...: engine="zarr",
...: chunks={},
...: backend_kwargs=dict(consolidated=False),
...: decode_cf=False,
...: )
...: display(ds.temp.attrs, ds.temp[0, 0, 0, 0].values)
{'_FillValue': 1e+37,
'coordinates': 'lon_rho lat_rho s_rho ocean_time',
'field': 'temperature, scalar, series',
'grid': 'grid',
'location': 'face',
'long_name': 'potential temperature',
'time': 'ocean_time',
'units': 'Celsius'}
array(1.e+37, dtype=float32)
In [10]: decoded = xr.decode_cf(ds)
...: decoded.temp[0, 0, 0, 0].values
Out[10]: array(nan, dtype=float32) |
That might be a precision / formatting issue: In [33]: np.float32(9.999999933815813e36)
Out[33]: 1e+37
In [34]: np.float64(9.999999933815813e36)
Out[34]: 9.999999933815813e+36 |
I have set up a reproducible notebook here manipulating some fill_values: https://gist.github.com/peterm790/6e6ddb2df0f095056837d21c531bbe57 It seems that only The |
just for reference, the code at the end of #177 (comment) successfully changes the references to work as expected (i.e. you get |
@keewis , you mean the |
yes, although the the implementation feels a bit clunky |
@rabernat, hate to drag you into everything, but I bet you will immediately know what to do here, as you've been involved with xarray, the CF conventions & the xarray zarr writer. The whole issue is explained in this This Reproducible Notebook (without credentials needed, as it uses OSN!) |
Rather than using string formatting to assess equality, I would be looking at raw bytes, e.g. actual_value = ds.temp[0,0,0,0].data
actual_value_bytes = actual_value.tobytes()
assert actual_value_bytes == b'\xc2\xbd\xf0|' I dug into this and noticed that the kerchunk dataset is converting the fill value to a float64 type, while the original (working) zarr on disk is converting it to a float32 (same as original netCDF). Digging into this reveals some pretty weird things about how netCDF, Zarr, and Xarray handle fill values. Note, for example, that the on-disk zarr does not actually keep the fill value in the attributes at all:
Compare this to both the Kerchunk Zarr and the original NetCDF. Instead, the fill value lives in the zarr array attributes.
Thus a key problem here is that there are redundant / overlapping mechanisms for specifying fill values. When we implemented the Zarr backend, we decided to lift the Zarr array I am no longer sure that this was the right choice. The zarr definition of fill_value, is defined as
It is only ever used by zarr if you request data from a chunk that has not been initialized. It is NOT a mask. This is not quite the same is the NUG FillValue:
The kerchunk Zarr now has default
Okay so why does any of this matter? Because whether we use the zarr array internal z_array = zarr.open("foo.zarr")["temp"]
on_disk_zarr_fill_value = z_array.fill_value
assert z_array[0, 0, 0, 0] == on_disk_zarr_fill_value
on_disk_zarr_fill_value_bytes = on_disk_zarr_fill_value.tobytes()
print(type(on_disk_zarr_fill_value), on_disk_zarr_fill_value_bytes)
# -> <class 'numpy.float32'> b'\xc2\xbd\xf0|' but kerhunk_z_array = zarr.open(m)['temp']
kerchunk_attrs_fill_value = kerhunk_z_array.attrs["_FillValue"]
print(type(kerchunk_attrs_fill_value)) # -> float However, I am mystified by why this still works: assert kerhunk_z_array[0, 0, 0, 0] == kerchunk_attrs_fill_value ... but cf-decoding does not. I am currently looking into the xarray coding code to understand what is going wrong. In the meantime, one thing you might try is to modify the kerchunk dataset to use the zarr Array |
Ok something even more obvious. dsk = xr.open_dataset(m, engine="zarr", chunks={},
backend_kwargs=dict(consolidated=False), decode_cf=False)
dsk.temp._FillValue # -> 0.0 So the zarr array fill_value is actually overwriting the fill value already in the attributes by Xarray! Manually overriding does the trick dsk.temp.attrs['_FillValue'] = kerchunk_attrs_fill_value
xr.decode_cf(dsk).temp[0, 0, 0, 0].values # -> NaN This may be an xarray bug. The problem is ultimately the duplicate and inconsistent fill_values. |
This is the specific place where the overwriting happens: https://github.com/pydata/xarray/blob/d7931f9014a26e712ff5f30c4082cf0261f045d3/xarray/backends/zarr.py#L455-L456 So there are two simultaneous paths forward:
|
@rabernat you are awesome! Thanks for showing us the path(s)! |
To be fair, the original file was already inconsistent (the HDF5 That said, I agree that xarray should do something when it finds inconsistent fill values, in any file format. |
@keewis , you said the original NetCDF4 file had
and searching the resulting file for The
You saw the |
yes, exactly. I'm not sure where that gets the value from, though. |
So sounds like h5py also handles the fill values inconsistently! If the HDF5 model uses @ajelenak, do you have insight here? |
Fill value is a property of an HDF5 dataset (same as a netCDF variable or Zarr array) which is assigned at its creation. If it is not set explicitly, the HDF5 library assigns the default value which for all numerical datatypes is zero, I think. That should be shown with
That part of the kerchunk code is still the original I wrote and it does the right thing in my opinion: translates HDF5 dataset fill value to Zarr array fill value. How to interpret the Zarr array fill value vs its |
@ajelenak thanks for this explanation! So if |
with xarray.netcdf_and_cf_context as ds:
answer = yes |
This is explicitly how we coded it when we implemented the zarr backed. The array internal I think the problem is that Kerchunk does not use the same code path as Xarray when creating a new Zarr store, so it does not have this logic of setting Another problem is that Xarray just overwrites any existing |
OK, so kerchunk is happy to fill in the |
@martindurant , I believe what we want is to make the Kerchunk |
Can anyone suggest a simple test case for #181 ? |
This may or may not be directly related to this issue but it was what I found when I googled my problem so I'll add a comment here. I found that if a scalar value being kerchunked from a file is 0.0 and the fill value is 0.0, it comes into the dataset as nan. As in:
returns
and also I know from the original model output that the value of
This could also be fixed in preprocessing but it seemed like it would be faster to do postprocessing one time per variable. cc @martindurant this is the issue I mentioned earlier today. |
That pre/post processing function could be included a an example in kerchunk itself alongside the current This was netCDF/HDF data? I wonder if there's a flag to say "no missing values" for a whole array. |
netCDF yes. |
You could put a breakpoint() in SinhleHdf5ToZarr._translator() to see if there's anything about the particular |
Yes, debugging can help. Alternatively, the output of this command: |
In my original NetCDF4 file (HDF5 under the hood) the fill value
_FillValue
is set to1.e+37
for 32bit floats, and when loaded into xarray withdecode_cf=True
, the data values with 1e37 are correctly interpreted as fill values and set to NaN.In the kerchunk JSON, the
_FillValue
in the JSON is set to 9.99999993e+36 and when loaded into xarray withdecode_cf=True
, the data values are not interpreted as fill values, and show up as 9.99999993e+36 (not 1e37 and not NaN).Pretty confused on what is going on here!
Here is a Reproducible Notebook
cc @peterm790
The text was updated successfully, but these errors were encountered: