MPI_Comm_spawn in Slurm environment #5835

raffenet · 2022-02-07T22:28:28Z

Originated from user email https://lists.mpich.org/pipermail/discuss/2022-January/006360.html.

~~1. MPICH + Hydra + PMI1 (crashes)~~ Fixed in #5838
~~2. MPICH + Hydra + PMI2 (works but ignores "hosts" info key)~~ Fixed in #5849.
3. MPICH + srun + PMI2 (crashes)

raffenet · 2022-02-08T19:48:05Z

More details:

MPICH + Hydra + PMI1 (crashes)

Crashes during MPIDU_Init_shm_init(); in the spawned root process.

MPICH + Hydra + PMI2 (works but ignores "hosts" info key)

MPICH + srun + PMI2 (crashes)

PMI2 Spawn is not supported. Both of these configurations will call MPIR_Assert(0); in src/util/mpir_pmi.c.

Only consider the nodes used in a launch as part of the total core count. See pmodels#5835.

hzhou · 2022-02-17T20:57:46Z

From original bug reporter:

-- quote --

Things were working fine when I was launching 1 node jobs under Slurm 20.11.8, but when I launched a 20 node job, MPICH hangs in MPI_Init. The output of “mpiexec -verbose” is attached, and the stack trace at the point where it hangs is below.

In the “mpiexec -verbose” output, I wonder why variables such as PATH_modshare point to our Intel MPI implementation, which I am no using. I am using MPICH 4.0 with a patch that Ken Raffenetti provided (which makes MPICH recognize the “host” info key). My $PATH and $LD_LIBRARY_PATH variables definitely point to the correct MPICH installation.

I appreciate any help you can give.

Here is the Slurm sbatch command:

sbatch --nodes=20 --ntasks=20 --job-name $job_name --exclusive –verbose

Here is the mpiexec command:

mpiexec -verbose -launcher ssh -print-all-exitcodes -np 20  -wdir ${work_dir} -env DISPLAY localhost:10.0 --ppn 1 <many more args…>

Stack trace at MPI_Init:

 #0  0x00007f6d85f499b2 in read () from /lib64/libpthread.so.0

#1  0x00007f6d87a5753a in PMIU_readline (fd=5, buf=buf@entry=0x7ffd6fb596e0 "", maxlen=maxlen@entry=1024)
    at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmiutil.c:134

#2  0x00007f6d87a57a56 in GetResponse (request=0x7f6d87b48351 "cmd=barrier_in\n",
    expectedCmd=0x7f6d87b48345 "barrier_out", checkRc=0) at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmi.c:818

#3  0x00007f6d87a29915 in MPIDI_PG_SetConnInfo (rank=rank@entry=0,
    connString=connString@entry=0x1bbf5a0 "description#n001$port#33403$ifname#172.16.56.1$")
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpidi_pg.c:559

#4  0x00007f6d87a38611 in MPID_nem_init (pg_rank=pg_rank@entry=0, pg_p=pg_p@entry=0x1bbf850, has_parent=<optimized out>)
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c:393

#5  0x00007f6d87a2ad93 in MPIDI_CH3_Init (has_parent=<optimized out>, pg_p=0x1bbf850, pg_rank=0)
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/ch3_init.c:83

#6  0x00007f6d87a1b3b7 in init_world () at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:190

#7  MPID_Init (requested=<optimized out>, provided=provided@entry=0x7f6d87e03540 <MPIR_ThreadInfo>)
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:76

#8  0x00007f6d879828eb in MPII_Init_thread (argc=argc@entry=0x7ffd6fb5a5cc, argv=argv@entry=0x7ffd6fb5a5c0,
    user_required=0, provided=provided@entry=0x7ffd6fb5a574, p_session_ptr=p_session_ptr@entry=0x0)
    at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:208

#9  0x00007f6d879832a5 in MPIR_Init_impl (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)
    at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:93

#10 0x00007f6d8786388e in PMPI_Init (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)
    at ../mpich-slurm-patch-4.0/src/binding/c/init/init.c:46

#11 0x000000000040640d in main (argc=23, argv=0x7ffd6fb5ad68) at src/NeedlesMpiManagerMain.cpp:53

hzhou · 2022-02-17T21:36:38Z

Attach the console log:
console.log

raffenet · 2022-02-17T23:29:19Z

From what I can tell, not all the nodes are able to launch proxies via ssh, so the processes that did launch are waiting in PMI_Barrier. I recommended -launcher ssh because using the default (srun) was throwing an error

    srun: Job 84993 step creation temporarily disabled, retrying (Requested nodes are busy)

It would be preferable to launch using the default, but we might be missing an option. Or there needs to be a configuration change.

hzhou · 2022-02-18T03:35:48Z

From what I can tell, not all the nodes are able to launch proxies via ssh, so the processes that did launch are waiting in PMI_Barrier. I recommended -launcher ssh because using the default (srun) was throwing an error
    srun: Job 84993 step creation temporarily disabled, retrying (Requested nodes are busy)
It would be preferable to launch using the default, but we might be missing an option. Or there needs to be a configuration change.

When user do mpiexec -np 20 ..., hydra will launch 1 proxy on each of the node, right? So hydra could simply launch the new spawned process using the existing proxy rather than spawning new one, right? If the above message is from mpiexec launch a new proxy, I guess that means user's slurm setting is preventing srun more than 1 job on a given node.

raffenet · 2022-02-18T15:45:12Z

I think the failure Kurt is seeing is during the initial launch. It might be that host key authentication is not setup for all the nodes in the system.

raffenet self-assigned this Feb 8, 2022

raffenet added a commit to raffenet/mpich that referenced this issue Feb 8, 2022

hydra: Fix total core count calculation during launch

6766fd9

Only consider the nodes used in a launch as part of the total core count. See pmodels#5835.

raffenet mentioned this issue Feb 8, 2022

hydra: Fix spawn when specifying launch hosts #5838

Merged

4 tasks

raffenet added a commit to raffenet/mpich that referenced this issue Feb 9, 2022

hydra: Fix total core count calculation during launch

660445c

Only consider the nodes used in a launch as part of the total core count. See pmodels#5835.

hzhou mentioned this issue Mar 27, 2022

pmi: fix slurm and cray support #5909

Merged

4 tasks

hzhou closed this as completed in #5909 Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_Comm_spawn in Slurm environment #5835

MPI_Comm_spawn in Slurm environment #5835

raffenet commented Feb 7, 2022 •

edited

Loading

raffenet commented Feb 8, 2022 •

edited

Loading

hzhou commented Feb 17, 2022

hzhou commented Feb 17, 2022 •

edited

Loading

raffenet commented Feb 17, 2022

hzhou commented Feb 18, 2022

raffenet commented Feb 18, 2022

MPI_Comm_spawn in Slurm environment #5835

MPI_Comm_spawn in Slurm environment #5835

Comments

raffenet commented Feb 7, 2022 • edited Loading

raffenet commented Feb 8, 2022 • edited Loading

hzhou commented Feb 17, 2022

hzhou commented Feb 17, 2022 • edited Loading

raffenet commented Feb 17, 2022

hzhou commented Feb 18, 2022

raffenet commented Feb 18, 2022

raffenet commented Feb 7, 2022 •

edited

Loading

raffenet commented Feb 8, 2022 •

edited

Loading

hzhou commented Feb 17, 2022 •

edited

Loading