-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Comm_spawn in Slurm environment #5835
Comments
More details:
Crashes during
PMI2 Spawn is not supported. Both of these configurations will call |
Only consider the nodes used in a launch as part of the total core count. See pmodels#5835.
Only consider the nodes used in a launch as part of the total core count. See pmodels#5835.
From original bug reporter:
Things were working fine when I was launching 1 node jobs under Slurm 20.11.8, but when I launched a 20 node job, MPICH hangs in MPI_Init. The output of “mpiexec -verbose” is attached, and the stack trace at the point where it hangs is below. In the “mpiexec -verbose” output, I wonder why variables such as PATH_modshare point to our Intel MPI implementation, which I am no using. I am using MPICH 4.0 with a patch that Ken Raffenetti provided (which makes MPICH recognize the “host” info key). My $PATH and $LD_LIBRARY_PATH variables definitely point to the correct MPICH installation. I appreciate any help you can give. Here is the Slurm sbatch command:
Here is the mpiexec command:
Stack trace at MPI_Init:
|
Attach the console log: |
From what I can tell, not all the nodes are able to launch proxies via ssh, so the processes that did launch are waiting in PMI_Barrier. I recommended
It would be preferable to launch using the default, but we might be missing an option. Or there needs to be a configuration change. |
When user do |
I think the failure Kurt is seeing is during the initial launch. It might be that host key authentication is not setup for all the nodes in the system. |
Originated from user email https://lists.mpich.org/pipermail/discuss/2022-January/006360.html.
1. MPICH + Hydra + PMI1 (crashes)Fixed in #58382. MPICH + Hydra + PMI2 (works but ignores "hosts" info key)Fixed in #5849.3. MPICH + srun + PMI2 (crashes)
The text was updated successfully, but these errors were encountered: