-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistent fill value on hdf5 files #105
Comments
At the combine stage, we allow a preprocessor argument to modify datasets. We could have something similar in the individual files; having said that, loading the produces JSON and removing the offending tags should not be too difficult. |
sure, postprocessing the individual files is not too difficult, and I'll do that for now. However, since the files I investigated are from multiple sources (I think?) it might be good to also catch this here. Am I correct in assuming that these values should not be out-of-sync? |
@cgentemann this seems similar to the problem you were having a while back with the fill values, did you ever figure that out? |
I have a feeling that, in general, there can be many things that are wrong/inconsistent in original data files! I don't know that we can cover them all, but perhaps we can auto-correct common issues. The ability to add custom processing is more powerful, however, and I suspect that many datasets will require some form of custom processing. In fact, I think that kerchunk workflows (in pangeo-forge or not) is an opportunity to apply those things so that users don't have to. |
right, that makes sense. I was thinking that rather than postprocessing the output of for reference, here's my hacky postprocessing functiondef correct_fill_values(data):
def fix_variable(values):
zattrs = values[".zattrs"]
if "_FillValue" not in zattrs:
return values
_FillValue = zattrs["_FillValue"]
if values[".zarray"]["fill_value"] != _FillValue:
values[".zarray"]["fill_value"] = _FillValue
return values
refs = data["refs"]
prepared = (
(tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
)
filtered = (
(key, ujson.loads(value))
for key, value in prepared
if key[1] in (".zattrs", ".zarray")
)
key = lambda i: i[0][0]
grouped = (
(name, {n[1]: v for n, v in group})
for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
)
fixed = ((name, fix_variable(var)) for name, var in grouped)
flattened = {
f"{name}/{item}": ujson.dumps(data, indent=4)
for name, var in fixed
for item, data in var.items()
}
data["refs"] = dict(sorted((refs | flattened).items()))
return data |
Yes, totally agree with this too. We could have a bunch of optional processing functions. As you say, it's tricky, because (for instance) processing the zarr object might not be quite the same as processing the xarray view of the same thing. |
should this have been closed by #181? |
Hope so :) |
For some reason I have files (model output) where the
hdf5
fillvalue
property is different from the_FillValue
attribute:While
xarray
'snetcdf4
engine will ignorefillvalue
and use only_FillValue
(?),kerchunk
will copy both over, translatingfillvalue
tozarr
'sfill_value
:kerchunk/kerchunk/hdf.py
Lines 186 to 192 in 1c6add1
However, if I open the resulting file the
zarr
engine will either complain about multiple fill values or the opened dataset won't have missing values (i.e.xr.testing.assert_identical(ds_nc, ds_zarr)
fails butxr.testing.assert_identical(ds_nc, ds_zarr.where(ds_nc.notnull())
doesn't). Not sure why it doesn't always raise, though.Am I missing something? If not, is there a way to tell
kerchunk
to use only_FillValue
?The text was updated successfully, but these errors were encountered: