Some changes to HDF5 code needed to support H5Pset_all_coll_metadata_ops() call for parallel I/O #781

edhartnett · 2018-01-18T16:05:13Z

In the 4.5.0 release @gsjaardema contributed a performance enhancement PR which included a call to H5Pset_all_coll_metadata_ops() to make HDF5 metadata read operations collective.

I see in the documentation for this function (https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllCollMetadataOps):

Warning:
As noted above, corruption will be introduced into the metadata cache and HDF5 Library behavior will be undefined when both of the following conditions exist:
A file is created or opened with a file access property list in which the collective metadata I/O property is set to TRUE.
Any function is called that triggers an independent metadata read while the file remains open with that file access property list.
An approach that avoids this corruption risk is described above.

My concern is that the default access mode is independent. (An unfortunate choice on my part. Collective would have been a better default.) And, as also noted in the documentation:

Users must be aware that several HDF5 operations can potentially issue metadata reads. These include opening a dataset, datatype, or group; reading an attribute; or issuing a get info call such as getting information for a group with H5Gget_info. Collective I/O requirements must be kept in mind when issuing such calls in the context of parallel I/O.

So don't we have to ensure we are never trying any independent metadata read operations or risk file corruption?

That's not very "groovy", as the kids would say these days.

All this is under investigation because turning on the collective metadata write makes the PIO fortran tests break. I have not yet been able to recreate the problem in C. The PIO issue is here:
NCAR/ParallelIO#1070.

@gsjaardema do you have any thoughts on this?

It should also be noted that I can't seem to get any failures or problems from this with a bunch of new C tests. But when the HDF5 team talks about possible corruption, that does not mean 100% of the time. Maybe my test cases are not activating the bug yet.

edhartnett · 2018-02-02T13:10:09Z

@gsjaardema did you see this issue?

gsjaardema · 2018-02-02T14:39:39Z

I don't have any good thoughts on this. We haven't seen the corruption, but we only use the NetCDF C API (although we do use Fortran, our Fortran API to Exodus calls the NetCDF C API).

I do know that at large processor counts, the H5Pset_all_coll_metadata_ops() makes HDF5 usable. For an example of the optimization, see https://www.hdfgroup.org/2015/08/parallel-and-large-scale-simulation-enhancements-to-cgns/. When scaling out to 500,000 processors, we were seeing times of "days" vs "minutes" with the optimization enabled.

edhartnett · 2018-02-02T14:44:57Z

OK, so does that imply that we should give the user control over whether H5Pset_all_coll_metadata_ops() is set?

Perhaps a mode flag setting?

Because right not it is used for all parallel file access.

edhartnett · 2018-02-02T14:55:46Z

@gsjaardema another thought is that perhaps it's time to switch to more advanced IO features. The new PIO/NetCDF integreation (now available as HPC NetCDF, V4.6.0.3 at https://github.com/HPC-NetCDF/netcdf-c) contains the PIO library.

This allows you to specify that N processors will do all I/O. Other processors will (transparently to you) send the data to those N processors, who will do all the I/O under the covers. This will allow you to use parallel I/O on a reasonable number of processors.

gsjaardema · 2018-02-02T15:38:03Z

We are definitely planning to look into PIO/NetCDF.

gsjaardema · 2018-02-02T16:23:09Z

I am fine using have the H5Pset_all_coll_metadata_ops() set via a flag although it would be good to know why it is only failing on that one test since that may be an indication of some other deeper issue since, as I understand it, it isn't expected to fail.

edhartnett · 2018-02-02T16:31:50Z

I will continue to investigate before introducing flag. All my netCDF/PIO C tests work, the problem is in one fortran test of PIO that is doing something weird. (But nonetheless should still work.)

The PIO/NetCDF merger is available as HPC NetCDF:
https://github.com/HPC-NetCDF/netcdf-c

v 4.6.0.3 contains all of your and wkliao's outstanding PRs, plus all of mine, plus the PIO merged code. Better documentation coming in future releases.

DennisHeimbigner · 2018-02-02T18:59:49Z

If we can figure out the conditions under which collective is failing, then it
is possible we may be able to detect it at runtime and not use collective.
If we had a complete list of HDF5 operations that "HDF5 operations can potentially issue metadata reads" then we might be able to also protect those operations.

edhartnett · 2018-02-02T23:00:10Z

A complete list is provided in the HDF5 docs. Protecting from them will require some effort, but I will probably undertake it. Essentially, instead of using collective metadata operations in the file level property list, you use it in lots of the other property lists. This more fine-grained control allows us to avoid the case they are warning about.

I am actively investigating and may have reproduced this in a PIO test. (Or maybe just have a test bug. Still checking.)

I will update as I know more.

DennisHeimbigner · 2018-02-02T23:03:55Z

After thinking about it some, I think that protecting the offending hdf5 calls
is dangerous: it assumes the list is both complete and will be updated promptly.
Perhaps we need to revisit the question about why non-collective is so slow.
Greg, do you have any insight into this; you indicated the difference was
'"days" vs "minutes" with the optimization enabled.'

edhartnett · 2018-02-02T23:21:37Z

I don't believe the list of affected functions will change - it consists of all functions that don't accept a property list. That includes some historic functions but then the HDF5 programmers started including property lists everywhere in the HDF5 API. So I don't think that will be a problem, and I believe that converting it to what they recommend as safe is going to be safer than doing what they say we should not do, and hoping that it's not causing a problem. ;-)

Doing independent access from large numbers of cores is going to be slow, because you are generating a disk access for each core independently instead of grouping your disk accesses, which would then access pages. Or if they are all reading the same bytes, then independent access would cause the read to happen repeatedly instead of once. @gsjaardema please chime in here if I've got it wrong.

NetCDF/PIO will provide a good solution here by allowing the user to use parallel I/O from 1000 cores, but not from all 500K.

Also want to stress that I'm still not sure that what I am working on is actually a netCDF bug. I just know it is activated because of this collective metadata call. This may turn out to be a bug in PIO or a bug in the PIO test. I will get it fully figured out before I make any changes to netCDF code.

DennisHeimbigner · 2018-02-02T23:24:40Z

Ok, that makes sense.

edhartnett · 2018-03-13T18:04:32Z

I have not forgotten this but would like to do it after #856, which will separate the HDF5 code from the libsrc4 code. Then I will make the changes needed for H5Pset_all_coll_metadata_ops() to work within the HDF5 team guidelines.

edhartnett · 2018-08-29T01:16:10Z

An update: the original bug that triggered this issue seems to be in HDF5. There is a fix underway.

However, we must still make the changes described in this ticket so that we are using HDF5 in accordance with the instructions of the HDF5 team, when parallel I/O is involved. Just because we have gotten away with it until now is not reason to close this issue. ;-)

WardF · 2018-08-30T19:33:09Z

Tagged this 'future' since we need to push out a release in short order so that we can also get the C++ and Fortran interfaces out, and there's no need to get this into that release, but it's also not meant to mean "sometime in the future", which is how the 'future' milestone is often used. Thanks!

edhartnett · 2018-08-30T20:53:54Z

No worries, I am always in favor of a release!

edhartnett · 2019-01-22T18:02:08Z

I have examined the code again and I think we are already acting in accordance with HDF5 documentation WRT collective metadata access. And, as we have learned, the bug that triggered this ticket was a HDF5 bug, fixed in the 1.10.4 release. So I will close this ticket.

edhartnett changed the title ~~Concern about use of H5Pset_all_coll_metadata_ops() function...~~ Some changes to HDF5 code needed to support H5Pset_all_coll_metadata_ops() call for parallel I/O Mar 13, 2018

WardF self-assigned this Aug 30, 2018

WardF added this to the future milestone Aug 30, 2018

edhartnett closed this as completed Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some changes to HDF5 code needed to support H5Pset_all_coll_metadata_ops() call for parallel I/O #781

Some changes to HDF5 code needed to support H5Pset_all_coll_metadata_ops() call for parallel I/O #781

edhartnett commented Jan 18, 2018

edhartnett commented Feb 2, 2018

gsjaardema commented Feb 2, 2018

edhartnett commented Feb 2, 2018 •

edited

Loading

edhartnett commented Feb 2, 2018

gsjaardema commented Feb 2, 2018

gsjaardema commented Feb 2, 2018

edhartnett commented Feb 2, 2018

DennisHeimbigner commented Feb 2, 2018

edhartnett commented Feb 2, 2018

DennisHeimbigner commented Feb 2, 2018

edhartnett commented Feb 2, 2018 •

edited

Loading

DennisHeimbigner commented Feb 2, 2018

edhartnett commented Mar 13, 2018

edhartnett commented Aug 29, 2018

WardF commented Aug 30, 2018

edhartnett commented Aug 30, 2018

edhartnett commented Jan 22, 2019

Some changes to HDF5 code needed to support H5Pset_all_coll_metadata_ops() call for parallel I/O #781

Some changes to HDF5 code needed to support H5Pset_all_coll_metadata_ops() call for parallel I/O #781

Comments

edhartnett commented Jan 18, 2018

edhartnett commented Feb 2, 2018

gsjaardema commented Feb 2, 2018

edhartnett commented Feb 2, 2018 • edited Loading

edhartnett commented Feb 2, 2018

gsjaardema commented Feb 2, 2018

gsjaardema commented Feb 2, 2018

edhartnett commented Feb 2, 2018

DennisHeimbigner commented Feb 2, 2018

edhartnett commented Feb 2, 2018

DennisHeimbigner commented Feb 2, 2018

edhartnett commented Feb 2, 2018 • edited Loading

DennisHeimbigner commented Feb 2, 2018

edhartnett commented Mar 13, 2018

edhartnett commented Aug 29, 2018

WardF commented Aug 30, 2018

edhartnett commented Aug 30, 2018

edhartnett commented Jan 22, 2019

edhartnett commented Feb 2, 2018 •

edited

Loading

edhartnett commented Feb 2, 2018 •

edited

Loading