Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 stride has different semantics than netcdf stride semantics wrt unlimited. #1380

Open
DennisHeimbigner opened this issue Apr 2, 2019 · 24 comments
Assignees
Milestone

Comments

@DennisHeimbigner
Copy link
Collaborator

In PR #1001
we converted the netcdf-4 code to use the HDF5 hyperslab
operators to speed up strided access (stride > 1). This replaces
an older element-at-a-time code implementing strided access.

It appears that the HDF5 stride semantics did not exactly match the older
stride semantics in (at least) the following case:

  1. A variable has an unlimited dimension
  2. The space allocated in the HDF5 dimension for this variable
    is smaller than the current (maximum) UNLIMITED value.
  3. A vars (stride > 1) read is performed that accessed values past
    the allocated space for the variable, but inside the space size implied
    by unlimited.
  4. For some reason, this causes the last legal value to be misread even though
    it has a defined value. Instead, the fillvalue is returned.

The attached set of files shows this. If you use h5dump on the .nc file,
it shows that the temperature variable has 50 defined values. However,
ncdump on that file shows the temperature variable has the unlimited size
of 55 values. So the netcdf program ncRepro_WeirdStrideIssue.c attempts to
read values past the 50 defined values and somehow the netcdf-c library screws up.
The other program is a pure HDF5 program that attempts to do the same thing,
but if the count is set to 19 (as in the netcdf program) the HDF5 program fails
with an index error.

The fix for this could be ugly.

files.zip

@edhartnett
Copy link
Contributor

I would like to fix this.

@edhartnett
Copy link
Contributor

To elaborate further, the performance speedup is important, and this is all part of libhdf5.

@DennisHeimbigner
Copy link
Collaborator Author

Ok, if you have the time. My thinking was to look at the vara code to see
what it does in a similar situation, and assuming vara works correctlry, there should be
some code fragments that handle the unlimited case that we can transfer to
the vars code.

@WardF WardF added this to the 4.7.0 milestone Apr 2, 2019
@edhartnett
Copy link
Contributor

I know what the vara code does. It finds all other datasets that share that unlimited dimension, and finds the maximum extent of the dimension. Then it fakes fill data as needed to support the unlimited dim.

I will take a look at this within the next week and hopefully have a PR with the fix.

@DennisHeimbigner
Copy link
Collaborator Author

Sounds right. Thanks.

@DennisHeimbigner
Copy link
Collaborator Author

Ellen-
There may be a temporary workaround.
In the file libhdf5/hdf5dispatch.c, at about line 58
change the two lines:
NC4_get_vars,
NC4_put_vars,
to
NCDEFAULT_get_vars,
NCDEFAULT_put_vars,

This will revert the code to the previous vars code. However, be warned that
there will be a significant performance penalty.

@edhartnett
Copy link
Contributor

OK, I have managed to duplicate this error in a simple test...

@WardF WardF modified the milestones: 4.7.0, 4.7.1 Apr 30, 2019
@ellenjohnson
Copy link

Hi guys -- is there any update on this? We discovered this bug during an upgrade to netcdf and are blocked until this is resolved, as this bug results in loss of data in this particular workflow. Thanks!

@edhartnett
Copy link
Contributor

edhartnett commented Jul 9, 2019 via email

@ellenjohnson
Copy link

Hi Ed -- I thought you were able to repro the error in a test (your comment from April 6). If not, then see the files.zip attachment in the initial report. Dennis kindly created this issue after I emailed Dennis and Ward to report the bug and I included a standalone C program to reproduce it. The zip archive contains my ncRepro_WeirdStrideIssue.c program plus test file nctest_netcdf4_classic.nc. It also contains tst_h_stride.c which Dennis created to repro it using pure hdf5 code.

@ellenjohnson
Copy link

Any updates on this? Ed, did you see my reply about the repro code?
Question: In Dennis' comments, he mentioned the refactoring that seems to have caused this bug (converting the netcdf-4 code to use HDF5 hyperslab for performance). Was this introduced in 4.6.2 or 4.6.3?

@DennisHeimbigner
Copy link
Collaborator Author

According to the release notes, it was added in 4.6.2-rc1.

@WardF WardF modified the milestones: 4.7.1, 4.7.2 Oct 15, 2019
@WardF WardF modified the milestones: 4.7.2, 4.7.3 Oct 28, 2019
@edhartnett
Copy link
Contributor

I am looking at this again now...

@edhartnett
Copy link
Contributor

OK, @DennisHeimbigner I think you attached the wrong zip file, the one attached is the one related to the endianness issue. So I could still use some help with C code that reproduces this problem...

@DennisHeimbigner
Copy link
Collaborator Author

Sorry, let me attach the correct file.
stridetest.zip

@WardF WardF modified the milestones: 4.7.3, 4.7.4 Feb 21, 2020
@WardF
Copy link
Member

WardF commented Feb 21, 2020

I've been pinged by the Mathworks team on this; before I dive into it, I want to see if anybody has any updates that aren't reflected in the issue?

@edwardhartnett
Copy link
Contributor

Yes, I started to look at this. Seeing as you have identified this as important for the next release, I will take a look first thing Monday morning and see if I can come up with an answer...

@WardF WardF removed this from the 4.7.4 milestone Mar 27, 2020
@WardF WardF added this to the 4.8.0 milestone Mar 27, 2020
@DennisHeimbigner
Copy link
Collaborator Author

This issue has been raised again: #1877

@ellenjohnson
Copy link

Hello everyone! Is there any update on this?
Dennis, i see you mentioned #1877. That's a performance issue. This issue is with a loss of data -- the last value of the strided read is not being retrieved.

@DennisHeimbigner
Copy link
Collaborator Author

Ellen- did you ever verify that the suggestion in this comment above fixed the issue?

#1380 (comment)

@WardF WardF modified the milestones: 4.8.0, 4.8.2 Aug 30, 2021
@ellenjohnson
Copy link

ellenjohnson commented Sep 30, 2021

Hi Dennis and all -- we're revisiting this issue again and I realized I never responded to your question of whether I tried the workaround you suggested:

There may be a temporary workaround.   In the file libhdf5/hdf5dispatch.c, at about line 58  change the two lines:
NC4_get_vars,
NC4_put_vars,
to
NCDEFAULT_get_vars,
NCDEFAULT_put_vars,
This will revert the code to the previous vars code. However, be warned that there will be a significant performance penalty.

I did try this when I was upgrading to 4.7.1, and it didn't fix the issue.

We can try this again with our recently upgraded 4.8.1, but if this temp fix works, I'm concerned about the performance penalty. While we try this, do you have any updates about this? Thanks!

@DennisHeimbigner
Copy link
Collaborator Author

I am surprised that that work-around did not work. That code was in place for a long time.
It means I do not understand the problem at all. Also, that temporary fix does indeed have
a significant performance penalty.

@ellenjohnson
Copy link

Thanks Dennis. We're testing your temp workaround now with 4.8.1 just to verify again it doesn't fix the problem. Stay tuned.

@agnishd
Copy link

agnishd commented Mar 2, 2022

Hi Dennis,

Thanks for telling us the workaround to this issue.

I tried out your suggestion in the 4.8.1 version of the NetCDF library and it works. The data loss doesn't occur.

You did mention there is a significant performance penalty when the patch is applied. We'd like to know if you have plans for addressing the performance issue in a coming release.

@WardF WardF modified the milestones: 4.8.2, 4.9.1 Jun 15, 2022
@WardF WardF modified the milestones: 4.9.1, 4.9.2 Feb 13, 2023
@WardF WardF modified the milestones: 4.9.2, 4.9.3 May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants