Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent fill value on hdf5 files #105

Closed
keewis opened this issue Nov 30, 2021 · 8 comments
Closed

inconsistent fill value on hdf5 files #105

keewis opened this issue Nov 30, 2021 · 8 comments

Comments

@keewis
Copy link
Contributor

keewis commented Nov 30, 2021

For some reason I have files (model output) where the hdf5 fillvalue property is different from the _FillValue attribute:

f = h5py.File("<file>", mode="r")
f["latitude"].fillvalue != f["latitude"].attrs["_FillValue"]  # significantly different

While xarray's netcdf4 engine will ignore fillvalue and use only _FillValue (?), kerchunk will copy both over, translating fillvalue to zarr's fill_value:

kerchunk/kerchunk/hdf.py

Lines 186 to 192 in 1c6add1

za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape,
dtype=h5obj.dtype,
chunks=h5obj.chunks or False,
fill_value=h5obj.fillvalue,
compression=compression,
filters=filters,
overwrite=True)

However, if I open the resulting file the zarr engine will either complain about multiple fill values or the opened dataset won't have missing values (i.e. xr.testing.assert_identical(ds_nc, ds_zarr) fails but xr.testing.assert_identical(ds_nc, ds_zarr.where(ds_nc.notnull()) doesn't). Not sure why it doesn't always raise, though.

Am I missing something? If not, is there a way to tell kerchunk to use only _FillValue?

@martindurant
Copy link
Member

At the combine stage, we allow a preprocessor argument to modify datasets. We could have something similar in the individual files; having said that, loading the produces JSON and removing the offending tags should not be too difficult.

@keewis
Copy link
Contributor Author

keewis commented Nov 30, 2021

sure, postprocessing the individual files is not too difficult, and I'll do that for now.

However, since the files I investigated are from multiple sources (I think?) it might be good to also catch this here. Am I correct in assuming that these values should not be out-of-sync?

@lsterzinger
Copy link
Collaborator

@cgentemann this seems similar to the problem you were having a while back with the fill values, did you ever figure that out?

@martindurant
Copy link
Member

I have a feeling that, in general, there can be many things that are wrong/inconsistent in original data files! I don't know that we can cover them all, but perhaps we can auto-correct common issues. The ability to add custom processing is more powerful, however, and I suspect that many datasets will require some form of custom processing. In fact, I think that kerchunk workflows (in pangeo-forge or not) is an opportunity to apply those things so that users don't have to.

@keewis
Copy link
Contributor Author

keewis commented Nov 30, 2021

right, that makes sense. I was thinking that rather than postprocessing the output of .translate() it would be much easier to postprocess with the variable's zarr object, but of course the details of hooks like that are tricky to get right.

for reference, here's my hacky postprocessing function
def correct_fill_values(data):
    def fix_variable(values):
        zattrs = values[".zattrs"]

        if "_FillValue" not in zattrs:
            return values

        _FillValue = zattrs["_FillValue"]
        if values[".zarray"]["fill_value"] != _FillValue:
            values[".zarray"]["fill_value"] = _FillValue

        return values

    refs = data["refs"]
    prepared = (
        (tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
    )
    filtered = (
        (key, ujson.loads(value))
        for key, value in prepared
        if key[1] in (".zattrs", ".zarray")
    )
    key = lambda i: i[0][0]
    grouped = (
        (name, {n[1]: v for n, v in group})
        for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
    )
    fixed = ((name, fix_variable(var)) for name, var in grouped)
    flattened = {
        f"{name}/{item}": ujson.dumps(data, indent=4)
        for name, var in fixed
        for item, data in var.items()
    }
    data["refs"] = dict(sorted((refs | flattened).items()))
    return data

@martindurant
Copy link
Member

it would be much easier to postprocess with the variable's zarr object

Yes, totally agree with this too. We could have a bunch of optional processing functions. As you say, it's tricky, because (for instance) processing the zarr object might not be quite the same as processing the xarray view of the same thing.

@keewis
Copy link
Contributor Author

keewis commented Aug 22, 2023

should this have been closed by #181?

@martindurant
Copy link
Member

Hope so :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants