Skip to content

2022 03 14 webex j ftwg

Howard Pritchard edited this page Mar 21, 2022 · 1 revision

#03/14/22 meeting notes for joint FT/Sessions WGs meeting

Attending: Howard Pritchard, Thomas Hines, Trupeshkumar Patel, Aureien Bouteiller, Martin Schulz, Grace Nansamba

Agenda items

  • Followup to discussions at MPI Forum wrt aggreement/consensus

Text added from miro document

Notes from 3/14/22 zoom meeting

When user queries they need to know what version of a given process set they are getting from the runtime. And then there needs to be a mechanism for processes to agree on a given epoch(s). This is needed even outside of FT handling. ULFM agreement - surviving processes come to agreement on who is still alive. Growing - not so sure about whether this covers all cases. Which processes would take place in the fence-like agreement operation? Should this include any new processes in addition to the original processes. Maybe a subset of the processes. Locality of some pset operations desirable - want to retain that property.

Discussion of how ULFM approach implements consensus method. Agreement doesn't need to be blocking. Hidden inside MPI_Comm_shrink.

Perhaps have fencing mechanism optional, or maybe a fast version (I'm feeling lucky) and one that is more robust in the face of errors.

Discuss options on mpiexec cmd line to control the FT model. Could be used to influence behavior of sessions related consensus functions, failure modes, etc.

Discussion of distributed use of process sets vs the master process model employed in D.'s thesis.

Epochs in process set names - no we discussed this before and it leads to problems.

A fence operation to make sure everyones' group_from_pset_name call really return the same group. Need to be able to do this fence possibly over a subset of processes included in the process set. an example where this might be needed is if a subset of processes want to spawn additional processes without involvement of all processes in the process set.

Have to know ahead of time who will be fencing to use this approach for consensus.

Back to versions - moves the problem down a level.

Clone this wiki locally