Extension to the HDF5 chunks API #309

davidhassell · 2024-08-06T14:00:49Z

Currently (v1.11.1.0), the treatment of HDF5 chunking is a bit inadequate:

Chunking can only be set on a per-Data object basis
Chunking can only be defined by explicitly setting the chunks shape on each axis
Chunking is ignored in an output file unless native compression is on
Chunks from an input file are not stored

A more comprehensive and flexible API is needed:

cfdm.write should chunk by default, and have a keywork argument (hdf5_chunks) to configure the default chunking.
cfdm.read should, by default, store HDF5 chunking on the returned data, so that it will be used when when writing out to a new netCDF4 file.
Setting a HDF5 chunking strategy should be more intuitive. E.g. it should be easy to "chunk the time axis by 12 elements, leaving all other axes unchunked": f.nc_set_hdf_chunksizes({'T': 12})
Setting HDF5 chunksizes follows the Dask API for defining its computaitonal chunk sizes. E.g. f.nc_set_hdf_chunksizes("8 MiB")

PR to follow.

The text was updated successfully, but these errors were encountered:

davidhassell added enhancement New feature or request performance Relating to speed and memory performance netCDF write Relating to writing netCDF datasets netCDF read Relating to reading netCDF datasets labels Aug 6, 2024

davidhassell mentioned this issue Aug 6, 2024

Extension to the HDF5 chunks API #310

Merged

davidhassell closed this as completed in #310 Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension to the HDF5 chunks API #309

Extension to the HDF5 chunks API #309

davidhassell commented Aug 6, 2024

Extension to the HDF5 chunks API #309

Extension to the HDF5 chunks API #309

Comments

davidhassell commented Aug 6, 2024