Skip to content
Howard Pritchard edited this page Jan 22, 2024 · 1 revision

Agenda - 1/22/24

Discussion

Continue from where the group left of with discussion driven by Martin Schreibers slide deck.

Discuss how psets behave in presence of a fault. Should we allow MPI_GROUP_FROM_SESSION_PSET? Dan thinks one way would be to return group, but use in context of the failed group set in ULFM. So we would be okay with notion of a process set with failed processes. Roberto thinks we need a local function to obtain the failed group of processes. One could also MPI_COMM_SHRINK before trying to use a communicator returned from MPI_COMM_CREATE_FROM_GROUP. Dan sees the implementation returning a commuicator that has fail stopped processes and will need to be shrunk by the application if it intends to use them for collectives (e.g. MPI_BARRIER).

MPI_GROUP_FROM_SESSION_PSET

if there are failed processes in the PSET included in the group the function will still return a group with failed members. User can manipulate the group using the failed set from the ULFM call that returns that info and then use that for creating a new communicator using MPI_COMM_CREATE_FROM_GROUP.

MPI_COMM_CREATE_FROM_GROUP

may return a communicator with failed processes. The app will need to do a MPI_COMM_SHRINK if it wants to use the resulting communicator with collectives or non-smart pt2pt.

Martin thinks we do need a way to go from a group to a process set.

Aurelian joins. Should we shrink communicators or groups? He doesn't think a purely local way to get failed members of group is reliable. Different processes may "see" different failed members since we don't have a consensus point.

MPI_COMM_SHRINK_FROM_GROUP

Process set size must remain constant.

Discuss growing. Aurelian thinks this is more in the sessions/process creation arena than FT.

Aurelian mentions FT at sessions (like a revoke) level. No time to discuss further today.

Clone this wiki locally