Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_FillValue from a classic netCDF file does not end up in the kerchunk json #385

Closed
noaaroland opened this issue Nov 1, 2023 · 4 comments · Fixed by #389
Closed

_FillValue from a classic netCDF file does not end up in the kerchunk json #385

noaaroland opened this issue Nov 1, 2023 · 4 comments · Fixed by #389

Comments

@noaaroland
Copy link
Contributor

So I spent some time reading through all of the discussion of #177 and it seems like I'm having the exact same problem with a classic netCDF file (not an HDF file under the hood). I am using these versions:

kerchunk version =  0.2.2
xarray version =  2023.10.1

I using this file: wget https://data.pmel.noaa.gov/aclim/thredds/fileServer/B10K-K20_CORECFS/Level1/1995-1999/B10K-K20_CORECFS_1995-1999_average_Jel.nc

When I open the netCDF using xarray, the "fill values" are set to nan:

ds1 = xr.open_dataset('./B10K-K20_CORECFS_1995-1999_average_Jel.nc')
ds1.Jel[100,0,250,175].values

array(nan, dtype=float32)

And I naively kerchunk it with this code, read the result with xarray, the "file value" is set to 1.e37:

fs1 = fsspec.filesystem('file')
ncchunks = NetCDF3ToZarr('./B10K-K20_CORECFS_1995-1999_average_Jel.nc')
with fs1.open('./B10K-K20_CORECFS_1995-1999_average_Jel.json', 'wb') as f:
        f.write(ujson.dumps(ncchunks.translate()).encode())
ds2 = xr.open_dataset('./B10K-K20_CORECFS_1995-1999_average_Jel.json')
ds2.Jel[100,0,250,175].values
reference://
array(1.e+37, dtype=float32)

When I write from the netCDF xarray data set to zarr, the "fill value" is preserved when reading the resulting zarr.

ds1.to_zarr('./B10K-K20_CORECFS_1995-1999_average_Jel.zarr')
ds3 = xr.open_dataset('./B10K-K20_CORECFS_1995-1999_average_Jel.zarr', engine='zarr')
ds3.Jel[100,0,250,175].values

array(nan, dtype=float32)

The use of decode_cf=True didn't change anything regarding the value of the "fill value".

@noaaroland
Copy link
Contributor Author

For what it's worth, if I start with same netCDF file above and first convert it to netCDF4, then kerchunk it using SingleHdf5ToZarr the "fill value" is converted to a nan as expected when reading the resulting json file with xarray.

ncks --fl_fmt=netcdf4_classic B10K-K20_CORECFS_1995-1999_average_Jel.nc B10K-K20_CORECFS_1995-
1999_average_Jel.nc4
u = 'B10K-K20_CORECFS_1995-1999_average_Jel.nc4'
so = dict(mode='rb', default_fill_cache=False, default_cache_type='first')
with fs1.open(u, **so) as infile:
    h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
    with fs1.open('B10K-K20_CORECFS_1995-1999_average_Jel_4.json', 'wb') as f:
        f.write(ujson.dumps(h5chunks.translate()).encode())
ds4 = xr.open_dataset('B10K-K20_CORECFS_1995-1999_average_Jel_4.json', decode_cf=True)
ds4.Jel[100,0,250,175].values
reference://
B10K-K20_CORECFS_1995-1999_average_Jel_4.json
B10K-K20_CORECFS_1995-1999_average_Jel_4.json
B10K-K20_CORECFS_1995-1999_average_Jel_4.json
array(nan, dtype=float32)

@martindurant
Copy link
Member

In netCDF3,oy, we have

                fill = var._attributes.get("missing_value", None)
...
            arr.attrs.update(
                {
                    k: v.decode() if isinstance(v, bytes) else str(v)
                    for k, v in var._attributes.items()
                    if k not in ["_FillValue", "missing_value"]
                }
            )

so perhaps the first line should consider both possible versions of fill as done in the second block?

-                fill = var._attributes.get("missing_value", None)
+                fill = var._attributes.get("missing_value", None) or var._attributes.get("_FillValue", None)

@noaaroland
Copy link
Contributor Author

I tested this change yesterday and it worked as I expected, so considering the _FillValue as well as the missing in whatever way makes sense to you would likely solve my problem.

diff --git a/kerchunk/netCDF3.py b/kerchunk/netCDF3.py
index 01f573c..8a521d5 100644
--- a/kerchunk/netCDF3.py
+++ b/kerchunk/netCDF3.py
@@ -172,6 +172,8 @@ class NetCDF3ToZarr(netcdf_file):
                 # simple array block
                 # TODO: chance to sub-chunk
                 fill = var._attributes.get("missing_value", None)
+                if fill is None:
+                    fill = var._attributes.get("_FillValue", None)
                 if fill is not None and var.data.dtype.kind == "f":
                     fill = float(fill)
                 if fill is not None and var.data.dtype.kind == "i":
@@ -221,6 +223,8 @@ class NetCDF3ToZarr(netcdf_file):

                 # TODO: avoid this code repeat
                 fill = var._attributes.get("missing_value", None)
+                if fill is None:
+                    fill = var._attributes.get("_FillValue", None)
                 if fill is not None and base.kind == "f":
                     fill = float(fill)
                 if fill is not None and base.kind == "i":

@martindurant
Copy link
Member

Let's make it a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants