Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_spawn in Slurm environment #5835

Closed
raffenet opened this issue Feb 7, 2022 · 6 comments · Fixed by #5909
Closed

MPI_Comm_spawn in Slurm environment #5835

raffenet opened this issue Feb 7, 2022 · 6 comments · Fixed by #5909
Assignees

Comments

@raffenet
Copy link
Contributor

raffenet commented Feb 7, 2022

Originated from user email https://lists.mpich.org/pipermail/discuss/2022-January/006360.html.

1. MPICH + Hydra + PMI1 (crashes) Fixed in #5838
2. MPICH + Hydra + PMI2 (works but ignores "hosts" info key) Fixed in #5849.
3. MPICH + srun + PMI2 (crashes)

@raffenet raffenet self-assigned this Feb 8, 2022
@raffenet
Copy link
Contributor Author

raffenet commented Feb 8, 2022

More details:

  1. MPICH + Hydra + PMI1 (crashes)

Crashes during MPIDU_Init_shm_init(); in the spawned root process.

  1. MPICH + Hydra + PMI2 (works but ignores "hosts" info key)
  2. MPICH + srun + PMI2 (crashes)

PMI2 Spawn is not supported. Both of these configurations will call MPIR_Assert(0); in src/util/mpir_pmi.c.

raffenet added a commit to raffenet/mpich that referenced this issue Feb 8, 2022
Only consider the nodes used in a launch as part of the total core
count. See pmodels#5835.
raffenet added a commit to raffenet/mpich that referenced this issue Feb 9, 2022
Only consider the nodes used in a launch as part of the total core
count. See pmodels#5835.
@hzhou
Copy link
Contributor

hzhou commented Feb 17, 2022

From original bug reporter:

-- quote --

Things were working fine when I was launching 1 node jobs under Slurm 20.11.8, but when I launched a 20 node job, MPICH hangs in MPI_Init. The output of “mpiexec -verbose” is attached, and the stack trace at the point where it hangs is below.

In the “mpiexec -verbose” output, I wonder why variables such as PATH_modshare point to our Intel MPI implementation, which I am no using. I am using MPICH 4.0 with a patch that Ken Raffenetti provided (which makes MPICH recognize the “host” info key). My $PATH and $LD_LIBRARY_PATH variables definitely point to the correct MPICH installation.

I appreciate any help you can give.

Here is the Slurm sbatch command:

sbatch --nodes=20 --ntasks=20 --job-name $job_name --exclusive –verbose

Here is the mpiexec command:

mpiexec -verbose -launcher ssh -print-all-exitcodes -np 20  -wdir ${work_dir} -env DISPLAY localhost:10.0 --ppn 1 <many more args…>

Stack trace at MPI_Init:

 #0  0x00007f6d85f499b2 in read () from /lib64/libpthread.so.0

#1  0x00007f6d87a5753a in PMIU_readline (fd=5, buf=buf@entry=0x7ffd6fb596e0 "", maxlen=maxlen@entry=1024)
    at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmiutil.c:134

#2  0x00007f6d87a57a56 in GetResponse (request=0x7f6d87b48351 "cmd=barrier_in\n",
    expectedCmd=0x7f6d87b48345 "barrier_out", checkRc=0) at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmi.c:818

#3  0x00007f6d87a29915 in MPIDI_PG_SetConnInfo (rank=rank@entry=0,
    connString=connString@entry=0x1bbf5a0 "description#n001$port#33403$ifname#172.16.56.1$")
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpidi_pg.c:559

#4  0x00007f6d87a38611 in MPID_nem_init (pg_rank=pg_rank@entry=0, pg_p=pg_p@entry=0x1bbf850, has_parent=<optimized out>)
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c:393

#5  0x00007f6d87a2ad93 in MPIDI_CH3_Init (has_parent=<optimized out>, pg_p=0x1bbf850, pg_rank=0)
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/ch3_init.c:83

#6  0x00007f6d87a1b3b7 in init_world () at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:190

#7  MPID_Init (requested=<optimized out>, provided=provided@entry=0x7f6d87e03540 <MPIR_ThreadInfo>)
    at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:76

#8  0x00007f6d879828eb in MPII_Init_thread (argc=argc@entry=0x7ffd6fb5a5cc, argv=argv@entry=0x7ffd6fb5a5c0,
    user_required=0, provided=provided@entry=0x7ffd6fb5a574, p_session_ptr=p_session_ptr@entry=0x0)
    at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:208

#9  0x00007f6d879832a5 in MPIR_Init_impl (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)
    at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:93

#10 0x00007f6d8786388e in PMPI_Init (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)
    at ../mpich-slurm-patch-4.0/src/binding/c/init/init.c:46

#11 0x000000000040640d in main (argc=23, argv=0x7ffd6fb5ad68) at src/NeedlesMpiManagerMain.cpp:53

@hzhou
Copy link
Contributor

hzhou commented Feb 17, 2022

Attach the console log:
console.log

@raffenet
Copy link
Contributor Author

From what I can tell, not all the nodes are able to launch proxies via ssh, so the processes that did launch are waiting in PMI_Barrier. I recommended -launcher ssh because using the default (srun) was throwing an error

    srun: Job 84993 step creation temporarily disabled, retrying (Requested nodes are busy)

It would be preferable to launch using the default, but we might be missing an option. Or there needs to be a configuration change.

@hzhou
Copy link
Contributor

hzhou commented Feb 18, 2022

From what I can tell, not all the nodes are able to launch proxies via ssh, so the processes that did launch are waiting in PMI_Barrier. I recommended -launcher ssh because using the default (srun) was throwing an error

    srun: Job 84993 step creation temporarily disabled, retrying (Requested nodes are busy)

It would be preferable to launch using the default, but we might be missing an option. Or there needs to be a configuration change.

When user do mpiexec -np 20 ..., hydra will launch 1 proxy on each of the node, right? So hydra could simply launch the new spawned process using the existing proxy rather than spawning new one, right? If the above message is from mpiexec launch a new proxy, I guess that means user's slurm setting is preventing srun more than 1 job on a given node.

@raffenet
Copy link
Contributor Author

I think the failure Kurt is seeing is during the initial launch. It might be that host key authentication is not setup for all the nodes in the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants