Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty unlimited chunked variables cause crash #67

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pyfive/dataobjects.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,10 @@ def _get_contiguous_data(self, property_offset):

def _get_chunked_data(self, offset):
""" Return data which is chunked. """

Copy link
Collaborator

@bmaranville bmaranville Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will work - it is also possible to test if the chunk address is UNDEFINED_ADDRESS, which will happen when no data has been written to the Dataset yet (see change in usnistgov/jsfive@f228420 , which I should have backported to pyfive)

EDIT - I think maybe the test for UNDEFINED_ADDRESS is important here because sometimes you encounter datasets with non-zero shapes but which have not been written yet (initializing a dataset and writing data to it are two separate steps).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems sensible. At this point I'm not minded to follow through fixing this here, as where it gets done in the new H5D.py will be slightly different. What I have there is:

# look out for an empty dataset, which will have no btree
if np.prod(self.shape) == 0 or dataobject._chunk_address == UNDEFINED_ADDRESS:
    self._index = {}
    return

(This is in the context of caching the b-tree when we instantiate a DatasetID, which we do when we create a variable instance with eg. `x=myfile['variable']. We do that at this point so that all threads in a thread pool have their b-tree before they get going on their bit of work.)

if np.prod(self.shape) == 0:
return np.empty(self.shape, dtype=self.dtype)

self._get_chunk_params()
chunk_btree = BTreeV1RawDataChunks(
self.fh, self._chunk_address, self._chunk_dims)
Expand Down
Binary file added tests/h5netcdf_test.hdf5
Binary file not shown.
18 changes: 18 additions & 0 deletions tests/make_netcdf_unlimited.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#! /usr/bin/env python
""" Create a netcdf file with an unlimited dimension, but no data """

import netCDF4
import numpy as np

f = netCDF4.Dataset('netcdf4_empty_unlimited.nc', 'w')
f.createDimension('x', 4)
f.createDimension('unlimited', None) # Unlimited dimension
v = f.createVariable("foo_unlimited", float, ("x", "unlimited"))
f.close()

f = netCDF4.Dataset('netcdf4_unlimited.nc', 'w')
f.createDimension('x', 4)
f.createDimension('unlimited', None) # Unlimited dimension
v = f.createVariable("foo_unlimited", float, ("x", "unlimited"))
v[:] = np.ones((4,1))
f.close()
Binary file added tests/netcdf4_empty_unlimited.nc
Binary file not shown.
Binary file added tests/netcdf4_unlimited.nc
Binary file not shown.
43 changes: 43 additions & 0 deletions tests/test_netcdf_unlimited.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
""" Unit tests for pyfive's ability to read a NetCDF4 Classic file with an unlimited dimension"""
import os
import warnings

import numpy as np
from numpy.testing import assert_array_equal

import pyfive

DIRNAME = os.path.dirname(__file__)
NETCDF4_UNLIMITED_FILE = os.path.join(DIRNAME, 'netcdf4_unlimited.nc')
NETCDF4_EMPTY_UNLIMITED_FILE = os.path.join(DIRNAME, 'netcdf4_empty_unlimited.nc')
H5NETCDF_FILE = os.path.join(DIRNAME, 'h5netcdf_test.hdf5')

def test_read_netcdf4_unlimited():
"""" This works"""

with pyfive.File(NETCDF4_UNLIMITED_FILE) as hfile:

# dataset
var1 = hfile['foo_unlimited']
assert var1.dtype == np.dtype('<f8')
assert_array_equal(var1[:], np.ones((4,1)))

def test_read_netcdf4_empty_unlimited():
"This does not work currently. Why not?"
# This is one example of the sort of problem we see with the H6NetCDF file.
with pyfive.File(NETCDF4_EMPTY_UNLIMITED_FILE) as hfile:

# dataset
var1 = hfile['foo_unlimited']
assert var1.dtype == np.dtype('<f8')
print (var1[:])

def test_h5netcdf_file():
""" This doesn't work either. Why not? """

with pyfive.File(H5NETCDF_FILE) as hfile:

# dataset
var1 = hfile['empty']
print(var1.shape)
print(var1[:])
Loading