inconsistent fill value on hdf5 files #105

keewis · 2021-11-30T15:01:34Z

For some reason I have files (model output) where the hdf5 fillvalue property is different from the _FillValue attribute:

f = h5py.File("<file>", mode="r")
f["latitude"].fillvalue != f["latitude"].attrs["_FillValue"]  # significantly different

While xarray's netcdf4 engine will ignore fillvalue and use only _FillValue (?), kerchunk will copy both over, translating fillvalue to zarr's fill_value:

kerchunk/kerchunk/hdf.py

Lines 186 to 192 in 1c6add1

    
           za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape, 
        
                                           dtype=h5obj.dtype, 
        
                                           chunks=h5obj.chunks or False, 
        
                                           fill_value=h5obj.fillvalue, 
        
                                           compression=compression, 
        
                                           filters=filters, 
        
                                           overwrite=True)

However, if I open the resulting file the zarr engine will either complain about multiple fill values or the opened dataset won't have missing values (i.e. xr.testing.assert_identical(ds_nc, ds_zarr) fails but xr.testing.assert_identical(ds_nc, ds_zarr.where(ds_nc.notnull()) doesn't). Not sure why it doesn't always raise, though.

Am I missing something? If not, is there a way to tell kerchunk to use only _FillValue?

The text was updated successfully, but these errors were encountered:

martindurant · 2021-11-30T15:19:03Z

At the combine stage, we allow a preprocessor argument to modify datasets. We could have something similar in the individual files; having said that, loading the produces JSON and removing the offending tags should not be too difficult.

keewis · 2021-11-30T16:45:12Z

sure, postprocessing the individual files is not too difficult, and I'll do that for now.

However, since the files I investigated are from multiple sources (I think?) it might be good to also catch this here. Am I correct in assuming that these values should not be out-of-sync?

lsterzinger · 2021-11-30T16:46:35Z

@cgentemann this seems similar to the problem you were having a while back with the fill values, did you ever figure that out?

martindurant · 2021-11-30T16:48:21Z

I have a feeling that, in general, there can be many things that are wrong/inconsistent in original data files! I don't know that we can cover them all, but perhaps we can auto-correct common issues. The ability to add custom processing is more powerful, however, and I suspect that many datasets will require some form of custom processing. In fact, I think that kerchunk workflows (in pangeo-forge or not) is an opportunity to apply those things so that users don't have to.

keewis · 2021-11-30T17:03:21Z

right, that makes sense. I was thinking that rather than postprocessing the output of .translate() it would be much easier to postprocess with the variable's zarr object, but of course the details of hooks like that are tricky to get right.

for reference, here's my hacky postprocessing function

def correct_fill_values(data):
    def fix_variable(values):
        zattrs = values[".zattrs"]

        if "_FillValue" not in zattrs:
            return values

        _FillValue = zattrs["_FillValue"]
        if values[".zarray"]["fill_value"] != _FillValue:
            values[".zarray"]["fill_value"] = _FillValue

        return values

    refs = data["refs"]
    prepared = (
        (tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
    )
    filtered = (
        (key, ujson.loads(value))
        for key, value in prepared
        if key[1] in (".zattrs", ".zarray")
    )
    key = lambda i: i[0][0]
    grouped = (
        (name, {n[1]: v for n, v in group})
        for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
    )
    fixed = ((name, fix_variable(var)) for name, var in grouped)
    flattened = {
        f"{name}/{item}": ujson.dumps(data, indent=4)
        for name, var in fixed
        for item, data in var.items()
    }
    data["refs"] = dict(sorted((refs | flattened).items()))
    return data

martindurant · 2021-11-30T17:05:36Z

it would be much easier to postprocess with the variable's zarr object

Yes, totally agree with this too. We could have a bunch of optional processing functions. As you say, it's tricky, because (for instance) processing the zarr object might not be quite the same as processing the xarray view of the same thing.

keewis · 2023-08-22T14:59:21Z

should this have been closed by #181?

martindurant · 2023-08-22T17:54:55Z

Hope so :)

keewis mentioned this issue Jun 15, 2022

FillValue issues with kerchunk 1e37=> 9.99999993e+36 #177

Closed

keewis mentioned this issue Jun 22, 2022

Set _FillValue #181

Merged

martindurant closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent fill value on hdf5 files #105

inconsistent fill value on hdf5 files #105

keewis commented Nov 30, 2021

martindurant commented Nov 30, 2021

keewis commented Nov 30, 2021 •

edited

Loading

lsterzinger commented Nov 30, 2021

martindurant commented Nov 30, 2021

keewis commented Nov 30, 2021 •

edited

Loading

martindurant commented Nov 30, 2021

keewis commented Aug 22, 2023

martindurant commented Aug 22, 2023

inconsistent fill value on hdf5 files #105

inconsistent fill value on hdf5 files #105

Comments

keewis commented Nov 30, 2021

martindurant commented Nov 30, 2021

keewis commented Nov 30, 2021 • edited Loading

lsterzinger commented Nov 30, 2021

martindurant commented Nov 30, 2021

keewis commented Nov 30, 2021 • edited Loading

martindurant commented Nov 30, 2021

keewis commented Aug 22, 2023

martindurant commented Aug 22, 2023

keewis commented Nov 30, 2021 •

edited

Loading

keewis commented Nov 30, 2021 •

edited

Loading