-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash with a bus error in IMB-MPI1 #3251
Comments
@LaHaine does the benchmark successfully do any byte transfer sizes? Or does it hit the sigbus before any of the allreduces have been done? I'm trying to reproduce on a different system (Cray XC) and what I'm noticing is that if I don't turn off tuned, the test just hangs trying to do 16384 byte reduction if using the OB1 PML. If I switch to using the CM PML the test runs to completion. |
@hppritcha: the initial transfers were fine, it fails in my test case (1024 processes on 32 nodes) after some successful transfers. |
Oh, it seems I have some problems myself to reproduce the bug. In my last test, mpirun IMB-MPI1 Allreduce finished with 1024 cores, but mpirun IMB-MPI1 failed at a different stage:
|
This may be an ob1 PML specific problem. I tried on an Intel OPA/HFI1 (PSM2) system and the test runs successfully. Also tried 544 and that worked as well. Could you try running with these two different env. variable settings:
and
set in the shell? Our Mellanox cluster isn't big enough to try to reproduce the problem. Another thing you may want to try is see if you can get MXM installed on the system and rebuild Open MPI to pick up MXM support. |
The issue seems to be related to vader fragment initialization. Doing so in the middle of the run suggests in an Allreduce of 1M (which does have a synchronizinng behavior) suggests that we have a vader fragment leak somewhere (and we need to grow the fragment list because we are running out of allocated fragments). There might be other causes, but this seems to be the most plausible. @LaHaine can you check the OS reported memory consumption during this test. I wonder if we are not using the entire memory and we get a non valid fragment at some point when we try to grow the list of vader fragments. |
Okay, our Mellanox EDR cluster actually does have enough nodes to partially(?) reproduce this problem using 2.1.0. With UCX, the test runs successfully. With ob1 pml using either vader or sm, I'm seeing hangs at this point:
free -g doesn't show any nodes with particularly low available memory. Back traces vary but look like:
and
The hang behavior goes away if tuned collective component is excluded. |
This seems to be a different behavior than the one that this issue was started for. Btw, how badly do you oversubscribe the nodes for the test with 1024 processes (I noticed there is a gigantic drop in performance between 8k and 16k) ? |
I'm running with 1 MPI process per hyperthread. I'm not seeing problems using UCX pml, although there is a jump from 8KB to 16KB
|
@hppritcha with OMPI_MCA_btl=self,sm,openib I get the same crash. With OMPI_MCA_btl=self,vader,tcp I get these messages, I guess it is trying to use the IPoIB interfaces:
|
@LaHaine by default, Open MPI will use all the available interfaces |
@LaHaine i noted you reported "the same crash" with |
@ggouaillardet: you are right, the crash looks different in that case:
I have used the variables OMPI_MCA_btl_tcp_if_exclude=ib0 and OMPI_MCA_oob_tcp_if_exclude=ib0 in the test without Infiniband, but that failed in the end with hthis message:
oom-killer wouldn't happen here, AFAIK Slurm would prevent that. Additional tests runs without openib failed with this message:
|
i was able to reproduce the issue with 1 KNL node running
note i need to run this in a loop, since the crash does not always happen. without
i guess on the KNL node, i noted in dmesg
i have never seen this before, but that being said, i never used xfs before too ... |
@LaHaine can you try again with @bosilca @hjelmn should |
@ggouaillardet: yes, /tmp is on xfs. With that variable I got this in the first run:
and in the second run:
|
@LaHaine this is specific to |
with self,vader,openib and /dev/shm it doesn't crash anymore, but the program is hanging here:
|
Somehow we are leaking resources, and as a result we force the SM manager (vader in this instance) to keep mapping memory until we reach some OS limit. Now we only have to find how we are we leaking the fragments (otherwise we wouldn't be in mca_btl_vader_frag_init) ... @ggouaillardet using /dev/shm might work but this will make the cleanup more complicated and error prone. At some point we had a test to check that we are not mmaping files on a shared memory filesystem. What happened to this check ? |
@bosilca did you mean a test to check we do not mmap on a remote file system ? |
With Open-MPI 3.0.0 it no longer crashes for me. Instead it simply hangs at gather with 1024 processes. |
I think I was able to solve this problem. The crash must have been caused by the quota on the /tmp directory that was used by openmpi. Simply setting TMPDIR to /scratch (my system's job scratch path without a quota) makes the crash disappear. |
I have found way to crash openmpi when running the Intel MPI benchmark. I have first found this when testing OpenHPC 1.2 on EL7.3 with openmpi-gnu-ohpc-1.10.4-18.1, but I could reproduce it with a self-compiled openmpi 1.10.6 or 2.1.0 as well, both compiled with OpenHPC's gcc 5.4.0.
The Intel MPI benchmark is from the package imb-gnu-openmpi-ohpc-4.1-4.2, I had to recompile it for use with openmpi 2.1.0.
The command line (in slurm, but tested outside of slurm as well):
mpirun IMB-MPI1 Allreduce
This will only run the Allreduce part of the benchmark.
It will crash 100% with 1024 cores on 32 machines. Another configuration I have found that crashed was 544 cores on 17 machines.
The crash looks like this:
The configure log from openmpi 2.1.0:
configure-log.txt
The crash does not happen when I use the mpirun option of --mca coll ^tuned
I'd like to add that I am using the openib btl and I am using Red Hat's OFED stack and Mellanox FDR Innfiniband.
The text was updated successfully, but these errors were encountered: