-
Notifications
You must be signed in to change notification settings - Fork 1
2024 01 22 webex
Continue from where the group left of with discussion driven by Martin Schreibers slide deck.
Discuss how psets behave in presence of a fault. Should we allow MPI_GROUP_FROM_SESSION_PSET? Dan thinks one way would be to return group, but use in context of the failed group set in ULFM. So we would be okay with notion of a process set with failed processes. Roberto thinks we need a local function to obtain the failed group of processes. One could also MPI_COMM_SHRINK before trying to use a communicator returned from MPI_COMM_CREATE_FROM_GROUP. Dan sees the implementation returning a commuicator that has fail stopped processes and will need to be shrunk by the application if it intends to use them for collectives (e.g. MPI_BARRIER).
MPI_GROUP_FROM_SESSION_PSET
if there are failed processes in the PSET included in the group the function will still return a group with failed members. User can manipulate the group using the failed set from the ULFM call that returns that info and then use that for creating a new communicator using MPI_COMM_CREATE_FROM_GROUP.
MPI_COMM_CREATE_FROM_GROUP
may return a communicator with failed processes. The app will need to do a MPI_COMM_SHRINK if it wants to use the resulting communicator with collectives or non-smart pt2pt.
Martin thinks we do need a way to go from a group to a process set.
Aurelian joins. Should we shrink communicators or groups? He doesn't think a purely local way to get failed members of group is reliable. Different processes may "see" different failed members since we don't have a consensus point.
MPI_COMM_SHRINK_FROM_GROUP
Process set size must remain constant.
Discuss growing. Aurelian thinks this is more in the sessions/process creation arena than FT.
Aurelian mentions FT at sessions (like a revoke) level. No time to discuss further today.