-
Notifications
You must be signed in to change notification settings - Fork 1
2022 01 31 webex j ftwg
#01/31/22 webex notes for joint FT/Sessions WGs meeting
Attending: Howard Pritchard, Thomas Hines, Trupeshkumar Patel, Aureien Bouteiller Isais Urena, Martin Schulz, Ignacio Laguna, Grace Nansamba
- https://github.com/mpi-forum/mpi-standard/pull/644
- General discussions about fault tolerance in sessions
Aurelian rewrote part of the terms to remove "the associated operation has completed". Dan wasn't present today so can't make his points clear.
How are error handlers handled in Sessions. Does Sessions obey the initial error handler? Yes. Not changed from World model and the initial error handler. Have to supply an error handler as part of session init. This error handler gets invoked via a degree of indirection for the group from session pset function. Note this requires wording in 644.
Discuss situation of how errors are handled before a communicator is created. All failures are local until a communicator is created to connect the processes. Howard gives example of using world model but before MPI_Init is called. Should double check verbiage around initial error handler and how that works.
How do we look at communicator objects when there are failures. In ULFM they have the shrink capability. Want to move to a more generic approach that allows for growing. Need to have an operation on a session handle to recover. These would need to be collective in nature (Aurelian). Howard talks about the throwing away model and starting over. If one did need to revoke ULFM style that implies that something will go wrong with group management based on groups created from group_from_session_pset.
Discuss ULFM PR https://github.com/mpi-forum/mpi-standard/pull/13 in this context.
- Should double check verbiage around initial error handler and how that works.