Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty unlimited chunked variables cause crash #67

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

bnlawrence
Copy link

If ones has created a variable which is intended to be chunked, but it is currently empty, when the file is read, we get a stack dump that ends with this:

    def _read_node_header(self, offset, node_level):
        """ Return a single node header in the b-tree located at a give offset. """
>       self.fh.seek(offset)
E       ValueError: cannot fit 'int' into an offset-sized integer

This pull request includes code to create a file which manifests the problem, a test to expose it, and a fix.

@@ -480,6 +480,10 @@ def _get_contiguous_data(self, property_offset):

def _get_chunked_data(self, offset):
""" Return data which is chunked. """

Copy link
Collaborator

@bmaranville bmaranville Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will work - it is also possible to test if the chunk address is UNDEFINED_ADDRESS, which will happen when no data has been written to the Dataset yet (see change in usnistgov/jsfive@f228420 , which I should have backported to pyfive)

EDIT - I think maybe the test for UNDEFINED_ADDRESS is important here because sometimes you encounter datasets with non-zero shapes but which have not been written yet (initializing a dataset and writing data to it are two separate steps).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems sensible. At this point I'm not minded to follow through fixing this here, as where it gets done in the new H5D.py will be slightly different. What I have there is:

# look out for an empty dataset, which will have no btree
if np.prod(self.shape) == 0 or dataobject._chunk_address == UNDEFINED_ADDRESS:
    self._index = {}
    return

(This is in the context of caching the b-tree when we instantiate a DatasetID, which we do when we create a variable instance with eg. `x=myfile['variable']. We do that at this point so that all threads in a thread pool have their b-tree before they get going on their bit of work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants