Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash with a bus error in IMB-MPI1 #3251

Closed
LaHaine opened this issue Mar 29, 2017 · 21 comments
Closed

crash with a bus error in IMB-MPI1 #3251

LaHaine opened this issue Mar 29, 2017 · 21 comments

Comments

@LaHaine
Copy link

LaHaine commented Mar 29, 2017

I have found way to crash openmpi when running the Intel MPI benchmark. I have first found this when testing OpenHPC 1.2 on EL7.3 with openmpi-gnu-ohpc-1.10.4-18.1, but I could reproduce it with a self-compiled openmpi 1.10.6 or 2.1.0 as well, both compiled with OpenHPC's gcc 5.4.0.
The Intel MPI benchmark is from the package imb-gnu-openmpi-ohpc-4.1-4.2, I had to recompile it for use with openmpi 2.1.0.

The command line (in slurm, but tested outside of slurm as well):
mpirun IMB-MPI1 Allreduce
This will only run the Allreduce part of the benchmark.
It will crash 100% with 1024 cores on 32 machines. Another configuration I have found that crashed was 544 cores on 17 machines.
The crash looks like this:

[pax11-17:16978] *** Process received signal ***
[pax11-17:16978] Signal: Bus error (7)
[pax11-17:16978] Signal code: Non-existant physical address (2)
[pax11-17:16978] Failing at address: 0x2b147b785450
[pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370]
[pax11-17:16978] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e]
[pax11-17:16978] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_free_list_grow+0x199)[0x2b147384f309]
[pax11-17:16978] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(+0x270d)[0x2b14794a270d]
[pax11-17:16978] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13]
[pax11-17:16978] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca]
[pax11-17:16978] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41]
[pax11-17:16978] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_Allreduce+0x17b)[0x2b147387d6bb]
[pax11-17:16978] [ 8] IMB-MPI1[0x40b316]
[pax11-17:16978] [ 9] IMB-MPI1[0x407284]
[pax11-17:16978] [10] IMB-MPI1[0x40250e]
[pax11-17:16978] [11]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35]
[pax11-17:16978] [12] IMB-MPI1[0x401f79]
[pax11-17:16978] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 552 with PID 0 on node pax11-17
exited on signal 7 (Bus error).
--------------------------------------------------------------------------

The configure log from openmpi 2.1.0:
configure-log.txt

The crash does not happen when I use the mpirun option of --mca coll ^tuned

I'd like to add that I am using the openib btl and I am using Red Hat's OFED stack and Mellanox FDR Innfiniband.

@hppritcha
Copy link
Member

hppritcha commented Mar 29, 2017

@LaHaine does the benchmark successfully do any byte transfer sizes? Or does it hit the sigbus before any of the allreduces have been done?

I'm trying to reproduce on a different system (Cray XC) and what I'm noticing is that if I don't turn off tuned, the test just hangs trying to do 16384 byte reduction if using the OB1 PML. If I switch to using the CM PML the test runs to completion.

@LaHaine
Copy link
Author

LaHaine commented Mar 29, 2017

@hppritcha: the initial transfers were fine, it fails in my test case (1024 processes on 32 nodes) after some successful transfers.

@LaHaine
Copy link
Author

LaHaine commented Mar 29, 2017

Oh, it seems I have some problems myself to reproduce the bug. In my last test, mpirun IMB-MPI1 Allreduce finished with 1024 cores, but mpirun IMB-MPI1 failed at a different stage:

# Benchmarking Reduce 
# #processes = 64 
# ( 960 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.05         0.04
            4         1000         6.58         6.64         6.61
            8         1000         6.29         6.33         6.32
           16         1000         6.35         6.41         6.39
           32         1000         6.46         6.52         6.48
           64         1000         6.81         6.88         6.84
          128         1000         8.95         9.03         8.98
          256         1000         9.90        10.03         9.95
          512         1000        11.53        11.59        11.56
         1024         1000        14.24        14.33        14.27
         2048         1000        17.79        17.86        17.82
         4096         1000        24.84        24.92        24.87
         8192         1000        37.59        37.70        37.65
        16384         1000        63.54        63.72        63.63
        32768         1000       117.43       117.73       117.58
        65536          640       223.14       224.23       223.71
       131072          320       475.55       476.60       476.28
       262144          160       822.46       825.96       825.17
       524288           80      4432.12      4439.29      4437.10
[pax11-00:27160] *** Process received signal ***
[pax11-00:27160] Signal: Bus error (7)
[pax11-00:27160] Signal code: Non-existant physical address (2)
[pax11-00:27160] Failing at address: 0x2b37a5aa4490
[pax11-00:27160] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b37969af370]
[pax11-00:27160] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b37a546a5e0]
[pax11-00:27160] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b37972267c1]
[pax11-00:27160] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b37a5468b51]
[pax11-00:27160] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b37a5e0f17f]
[pax11-00:27160] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b37a5e030aa]
[pax11-00:27160] [ 6] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_reduce_generic+0x843)[0x2b379674a753]
[pax11-00:27160] [ 7] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_reduce_intra_pipeline+0xd4)[0x2b379674ace4]
[pax11-00:27160] [ 8] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x1c7)[0x2b37a6c5c037]
[pax11-00:27160] [ 9] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(MPI_Reduce+0x1b2)[0x2b37967256c2]
[pax11-00:27160] [10] IMB-MPI1[0x40b72f]
[pax11-00:27160] [11] IMB-MPI1[0x402646]
[pax11-00:27160] [12] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3796bddb35]
[pax11-00:27160] [13] IMB-MPI1[0x401f79]
[pax11-00:27160] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 15 with PID 0 on node pax11-00 exited on signal 7 (Bus error).

@hppritcha
Copy link
Member

This may be an ob1 PML specific problem. I tried on an Intel OPA/HFI1 (PSM2) system and the test runs successfully. Also tried 544 and that worked as well. Could you try running with these two different env. variable settings:

export OMPI_MCA_btl=self,sm,openib

and

export OMPI_MCA_btl=self,vader,tcp

set in the shell?

Our Mellanox cluster isn't big enough to try to reproduce the problem.

Another thing you may want to try is see if you can get MXM installed on the system and rebuild Open MPI to pick up MXM support.

@bosilca
Copy link
Member

bosilca commented Mar 29, 2017

The issue seems to be related to vader fragment initialization. Doing so in the middle of the run suggests in an Allreduce of 1M (which does have a synchronizinng behavior) suggests that we have a vader fragment leak somewhere (and we need to grow the fragment list because we are running out of allocated fragments). There might be other causes, but this seems to be the most plausible.

@LaHaine can you check the OS reported memory consumption during this test. I wonder if we are not using the entire memory and we get a non valid fragment at some point when we try to grow the list of vader fragments.

@hppritcha
Copy link
Member

Okay, our Mellanox EDR cluster actually does have enough nodes to partially(?) reproduce this problem using 2.1.0. With UCX, the test runs successfully. With ob1 pml using either vader or sm, I'm seeing hangs at this point:

# List of Benchmarks to run:

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce 
# #processes = 1024 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.06         0.09         0.08
            4         1000        37.88        37.92        37.90
            8         1000        37.14        37.18        37.16
           16         1000        26.51        26.53        26.53
           32         1000        26.19        26.20        26.20
           64         1000        27.71        27.73        27.72
          128         1000        37.83        37.86        37.85
          256         1000        34.72        34.74        34.73
          512         1000        38.25        38.27        38.26
         1024         1000        47.98        48.00        47.99
         2048         1000        66.77        66.80        66.79
         4096         1000       129.78       129.82       129.80
         8192         1000       230.42       230.53       230.49
        16384            6    623569.21    779826.13    704594.47

free -g doesn't show any nodes with particularly low available memory.

Back traces vary but look like:

#0  0x00007ffff75e0057 in sched_yield () from /usr/lib64/libc.so.6
#1  0x00007ffff7b24f3d in ompi_request_default_wait () from /opt/hi-master/openmpi/2.1.0/lib/libmpi.so.20
#2  0x00007ffff7b6c91d in ompi_coll_base_sendrecv_nonzero_actual () from /opt/hi-master/openmpi/2.1.0/lib/libmpi.so.20
#3  0x00007ffff7b6db8c in ompi_coll_base_allreduce_intra_ring () from /opt/hi-master/openmpi/2.1.0/lib/libmpi.so.20
#4  0x00007ffff7b3595b in PMPI_Allreduce () from /opt/hi-master/openmpi/2.1.0/lib/libmpi.so.20
#5  0x000000000040b173 in IMB_allreduce ()
#6  0x000000000040700b in IMB_init_buffers_iter ()
#7  0x0000000000402361 in main ()

and

#0  0x00007ffff75e0057 in sched_yield () from /usr/lib64/libc.so.6
#1  0x00007ffff7b24f3d in ompi_request_default_wait () from /opt/hi-master/openmpi/2.1.0/lib/libmpi.so.20
#2  0x00007ffff7b6d741 in ompi_coll_base_allreduce_intra_ring () from /opt/hi-master/openmpi/2.1.0/lib/libmpi.so.20
#3  0x00007ffff7b3595b in PMPI_Allreduce () from /opt/hi-master/openmpi/2.1.0/lib/libmpi.so.20
#4  0x000000000040b173 in IMB_allreduce ()
#5  0x000000000040700b in IMB_init_buffers_iter ()
#6  0x0000000000402361 in main ()

The hang behavior goes away if tuned collective component is excluded.
I tried running with only self and openib btl's allowed and got similar behavior if
tuned is not excluded.

@bosilca
Copy link
Member

bosilca commented Mar 29, 2017

This seems to be a different behavior than the one that this issue was started for.

Btw, how badly do you oversubscribe the nodes for the test with 1024 processes (I noticed there is a gigantic drop in performance between 8k and 16k) ?

@hppritcha
Copy link
Member

I'm running with 1 MPI process per hyperthread. I'm not seeing problems using UCX pml, although there is a jump from 8KB to 16KB

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce 
# #processes = 1024 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.06         0.09         0.08
            4         1000        73.78        73.89        73.84
            8         1000        81.84        81.92        81.86
           16         1000        83.88        83.97        83.95
           32         1000        78.07        78.15        78.08
           64         1000       125.07       125.12       125.08
          128         1000       129.40       129.59       129.50
          256         1000        60.02        60.09        60.07
          512         1000        46.09        46.12        46.10
         1024         1000        47.85        47.88        47.87
         2048         1000       109.26       109.33       109.29
         4096         1000       149.50       149.57       149.53
         8192         1000       272.98       273.11       273.04
        16384         1000      8192.82      8195.51      8194.20
        32768         1000      7891.81      7894.33      7893.07
        65536          640      5491.77      5491.85      5491.82
       131072          320      3349.48      3349.81      3349.73
       262144          160      3891.05      3893.24      3892.39
       524288           80      4302.73      4306.28      4304.85
      1048576           40      7104.18      7121.27      7112.57
      2097152           20     14567.01     14706.02     14637.48
      4194304           10     24752.48     25171.36     24991.72

@LaHaine
Copy link
Author

LaHaine commented Mar 30, 2017

@hppritcha with OMPI_MCA_btl=self,sm,openib I get the same crash. With OMPI_MCA_btl=self,vader,tcp I get these messages, I guess it is trying to use the IPoIB interfaces:

[pax11-00][[12751,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.225.203 failed: Connection timed out (110)
[pax11-00][[12751,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.225.205 failed: Connection timed out (110)

@ggouaillardet
Copy link
Contributor

@LaHaine by default, Open MPI will use all the available interfaces
(e.g. messages are split between the ethernet and the IPoIB interface)
you can restrict to the ethernet interface (assuming it is eth0) with
OMPI_MCA_btl_tcp_if_include=eth0
you might also need to
OMPI_MCA_oob_tcp_if_include=eth0
if not already done, can you please make sure you did not run out of memory ?
(you can run dmesg on all your compute nodes and look for oom-killer related messages)

@ggouaillardet
Copy link
Contributor

@LaHaine i noted you reported "the same crash" with OMPI_MCA_btl=self,sm,openib
the stack trace you posted clearly shows btl/vader is used, so if btl/sm is used instead, the stack trace should be different.
can you please double check this ?
(e.g. environment variables are exported and contains no typo)

@LaHaine
Copy link
Author

LaHaine commented Mar 31, 2017

@ggouaillardet: you are right, the crash looks different in that case:

[pax11-00:65294] *** Process received signal ***
[pax11-00:65294] Signal: Bus error (7)
[pax11-00:65294] Signal code: Non-existant physical address (2)
[pax11-00:65294] Failing at address: 0x2ab061d6f000
[pax11-00:65294] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2ab0512ad370]
[pax11-00:65294] [ 1] /usr/lib64/libc.so.6(+0x14ac00)[0x2ab051604c00]
[pax11-00:65294] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_convertor_pack+0x161)[0x2ab051b3aa21]
[pax11-00:65294] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(mca_btl_sm_prepare_src+0x19e)[0x2ab05770a87e]
[pax11-00:65294] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x19f)[0x2ab057d2c0ff]
[pax11-00:65294] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(+0x155a8)[0x2ab057d2e5a8]
[pax11-00:65294] [ 6] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x1b2)[0x2ab05770c222]
[pax11-00:65294] [ 7] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_progress+0x3c)[0x2ab051b2b99c]
[pax11-00:65294] [ 8] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x105)[0x2ab057d205b5]
[pax11-00:65294] [ 9] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Recv+0x18c)[0x2ab0510233dc]
[pax11-00:65294] [10] IMB-MPI1[0x40bb83]
[pax11-00:65294] [11] IMB-MPI1[0x407280]
[pax11-00:65294] [12] IMB-MPI1[0x40250c]
[pax11-00:65294] [13] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab0514dbb35]
[pax11-00:65294] [14] IMB-MPI1[0x401f79]
[pax11-00:65294] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 26 with PID 0 on node pax11-00 exited on signal 7 (Bus error).
--------------------------------------------------------------------------

I have used the variables OMPI_MCA_btl_tcp_if_exclude=ib0 and OMPI_MCA_oob_tcp_if_exclude=ib0 in the test without Infiniband, but that failed in the end with hthis message:

        16384         1000       165.24       170.28       167.77       183.52
        32768         1000       205.69       211.62       209.04       295.34
[pax11-01][[42762,1],63][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[42762,1],32]
[pax11-00][[42762,1],31][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[42762,1],0]

oom-killer wouldn't happen here, AFAIK Slurm would prevent that. Additional tests runs without openib failed with this message:

[pax11-00:28505] *** Process received signal ***
[pax11-00:28505] Signal: Bus error (7)
[pax11-00:28505] Signal code: Non-existant physical address (2)
[pax11-00:28505] Failing at address: 0x2b2669839b50
[pax11-00:28511] *** Process received signal ***
[pax11-00:28511] Signal: Bus error (7)
[pax11-00:28511] Signal code: Non-existant physical address (2)
[pax11-00:28511] Failing at address: 0x2b5fc4020d50
[pax11-00:28483] *** Process received signal ***
[pax11-00:28483] Signal: Bus error (7)
[pax11-00:28483] Signal code: Non-existant physical address (2)
[pax11-00:28483] Failing at address: 0x2b20b1f6fad0
[pax11-00:28511] [ 0] /usr/lib64/libpthread.so.0[pax11-00:28505] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b265b035370]
[pax11-00:28483] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b20a37fe370]
[pax11-00:28483] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b20b1c685e0]
[pax11-00:28483] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b20a40757c1]
[pax11-00:28483] [ 3] [pax11-00:28505] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b26696315e0(+0xf370)[0x2b5fb57a5370]
[pax11-00:28511] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5fbbbe35e0]
[pax11-00:28511] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x23d6)[0x2b20b1c663d6]
[pax11-00:28483] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_bml_r2.so(+0x2adf)[0x2b20b164dadf]
[pax11-00:28483] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x8b)[0x2b20b257a34b]
[pax11-00:28483] [ 6] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_mpi_init+0x7c4)[0x2b20a354e1f4]
[pax11-00:28483] [ 7] +0x211)[0x2b5fb601c7c1]
[pax11-00:28511] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x24ed)[0x2b5fbbbe14ed]
[pax11-00:28511] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_bml_r2.so]
[pax11-00:28505] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b265b8ac7c1]
[pax11-00:28505] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x23d6)[0x2b266962f3d6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Init+0x53)[0x2b20a356d1a3]
[pax11-00:28483] [ 8] IMB-MPI1[0x402077]
[pax11-00:28483] [ 9] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b20a3a2cb35]
[pax11-00:28483] [10] IMB-MPI1[pax11-00:28505] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_bml_r2.so(+0x2adf)[0x2b2669016adf]
[pax11-00:28505] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(+0x2adf)[0x2b5fbb5c8adf]
[pax11-00:28511] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x8b)[0x2b5fc460c34b]
[pax11-00:28511] [0x401f79]
[pax11-00:28483] *** End of error message ***
[ 6] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_mpi_init+0x7c4)[0x2b5fb54f51f4(mca_pml_ob1_add_procs+0x8b)[0x2b2669e4234b]
[pax11-00:28505] [ 6] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(]
[pax11-00:28511] [ 7] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Init+0x53ompi_mpi_init+0x7c4)[0x2b265ad851f4]
[pax11-00:28505] [ 7] )[0x2b5fb55141a3]
[pax11-00:28511] [ 8] IMB-MPI1[0x402077/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Init+0x53)[0x2b265ada41a3]
[pax11-00:28511] [ 9] /usr/lib64/libc.so.6(__libc_start_main+0x]
[pax11-00:28505] [ 8] IMB-MPI1[0x402077]
[pax11-00:28505] [ 9] f5)[0x2b5fb59d3b35]
[pax11-00:28511] [10] IMB-MPI1[0x401f79/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b265b263b35]
[pax11-00:28505] [10] IMB-MPI1]
[pax11-00:28511] *** End of error message ***
[0x401f79]
[pax11-00:28505] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 15 with PID 0 on node pax11-00 exited on signal 7 (Bus error).
--------------------------------------------------------------------------

@ggouaillardet
Copy link
Contributor

i was able to reproduce the issue with 1 KNL node running

mpirun -np 64 --mca pml ob1 --mca btl vader,tcp,self ./IMB-MPI1 Reduce -npmin 32

note i need to run this in a loop, since the crash does not always happen.
(50 iterations is generally enough to evidence the issue)
also note i was unable to reproduce the issue with --enable-debug

without --enable-debug, the default CFLAGS is -O3 ... which is a bit puzzling to me
(most projects use -g -O2 by default)
anyway, i was also able to reproduce the issue with -g -O2 and here are the details

(gdb) bt
#0  vader_fifo_read (ep=<synthetic pointer>, fifo=0x7fa0ad5e3008) at ../../../../../../src/ompi-v2.x/opal/mca/btl/vader/btl_vader_fifo.h:134
#1  mca_btl_vader_poll_fifo () at ../../../../../../src/ompi-v2.x/opal/mca/btl/vader/btl_vader_component.c:609
#2  mca_btl_vader_component_progress () at ../../../../../../src/ompi-v2.x/opal/mca/btl/vader/btl_vader_component.c:699
#3  0x00007fa0b7818e2c in opal_progress () at ../../../src/ompi-v2.x/opal/runtime/opal_progress.c:225
#4  0x00007fa0ad3cfea5 in ompi_request_wait_completion (req=<optimized out>) at ../../../../../../src/ompi-v2.x/ompi/request/request.h:392
#5  mca_pml_ob1_recv (addr=0x7ffcee3183ac, count=1, datatype=<optimized out>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>, status=0x7ffcee3183d0)
    at ../../../../../../src/ompi-v2.x/ompi/mca/pml/ob1/pml_ob1_irecv.c:129
#6  0x00007fa0b839f86f in PMPI_Recv (buf=0x7ffcee3183ac, count=1, type=0x613300 <ompi_mpi_int>, source=<optimized out>, tag=<optimized out>, comm=0x614d00 <ompi_mpi_comm_world>, status=0x7ffcee3183d0)
    at precv.c:77
#7  0x000000000040471e in IMB_init_communicator ()
#8  0x0000000000401f02 in main ()
(gdb) f 1
#1  mca_btl_vader_poll_fifo () at ../../../../../../src/ompi-v2.x/opal/mca/btl/vader/btl_vader_component.c:609
609	        hdr = vader_fifo_read (mca_btl_vader_component.my_fifo, &endpoint);
(gdb) p *mca_btl_vader_component.my_fifo
Cannot access memory at address 0x7fa0ad5e3008

i guess mca_btl_vader_component.my_fifo points to a mmaped file, so that could explain why gdb is unable to dereference the pointer

on the KNL node, i noted in dmesg

[109733.285661] BUG: Bad page map in process IMB-MPI1  pte:00000060 pmd:ec3b54067
[109733.293777] addr:00007fa0ad5e3008 vm_flags:000000fb anon_vma:          (null) mapping:ffff880ec7b364c8 index:0
[109733.305136] vma->vm_ops->fault: xfs_filemap_fault+0x0/0xa0 [xfs]
[109733.312004] vma->vm_file->f_op->mmap: xfs_file_mmap+0x0/0x40 [xfs]
[109733.319044] CPU: 152 PID: 243574 Comm: IMB-MPI1 Tainted: G    B      OE  ------------   3.10.0-327.el7.x86_64 #1
[109733.319050] Hardware name: FUJITSU PRIMERGY CX1640 M1/D3727-A1, BIOS V5.0.0.12 R1.8.0 for D3727-A1x                     01/06/2017
[109733.319059]  00007fa0ad5e3008 000000005c6a25e8 ffff880ead19fdf0 ffffffff816351f1
[109733.319159]  ffff880ead19fe38 ffffffff811927cf 0000000000000060 0000000000000000
[109733.319254]  00007fa0ad5e3008 ffff880db5cb1b50 ffff880dc8d08288 ffff880ec96f9900
[109733.319358] Call Trace:
[109733.319395]  [<ffffffff816351f1>] dump_stack+0x19/0x1b
[109733.319426]  [<ffffffff811927cf>] print_bad_pte+0x1af/0x250
[109733.319457]  [<ffffffff81197972>] handle_mm_fault+0xea2/0xf50
[109733.319487]  [<ffffffff811337ed>] ? ring_buffer_unlock_commit+0x2d/0x250
[109733.319518]  [<ffffffff81640e22>] __do_page_fault+0x152/0x420
[109733.319546]  [<ffffffff81641113>] do_page_fault+0x23/0x80
[109733.319576]  [<ffffffff8163d408>] page_fault+0x28/0x30

i have never seen this before, but that being said, i never used xfs before too ...

@ggouaillardet
Copy link
Contributor

@LaHaine can you try again with OMPI_MCA_orte_tmpdir_base=/dev/shm ?
i was able to reach 100 iterations so far ...
also can you confirm your orte_tmpdir_base is a xfs filesystem ?
(if you did not force this, $TMPDIR is used, and falls back to /tmp if not set)

@bosilca @hjelmn should btl/vader use /dev/shm instead of /tmp to store its shared files by default ? (see shmem_mmap_backing_file_base_dir vs orte_tmpdir_base)

@LaHaine
Copy link
Author

LaHaine commented Mar 31, 2017

@ggouaillardet: yes, /tmp is on xfs. With that variable I got this in the first run:

        16384         1000       144.32       148.71       146.62       210.13
        32768         1000       163.63       168.98       166.80       369.86
[pax11-01][[46756,1],63][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[46756,1],32]
[pax11-00][[46756,1],31][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[46756,1],0]

and in the second run:

        16384         1000       146.62       151.09       148.94       206.82
        32768         1000       180.02       185.44       183.22       337.03
[pax11-00][[46163,1],31][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 127.0.0.1 failed: Connection refused (111)
[pax11-01][[46163,1],63][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[46163,1],33]

@ggouaillardet
Copy link
Contributor

@LaHaine this is specific to btl/tcp and this is a different issue.
can you try again with the self,vader,openib btl ?

@LaHaine
Copy link
Author

LaHaine commented Mar 31, 2017

with self,vader,openib and /dev/shm it doesn't crash anymore, but the program is hanging here:

# Benchmarking Gather 
# #processes = 32 
# ( 992 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.06         0.04
            1         1000         2.96         3.00         2.99
            2         1000         2.96         3.00         2.99
            4         1000         3.02         3.06         3.05
            8         1000         3.13         3.17         3.16
           16         1000         3.33         3.38         3.37
           32         1000         3.72         3.73         3.73
           64         1000         4.21         4.26         4.23
          128         1000         5.11         5.18         5.14
          256         1000         7.23         7.36         7.29
          512         1000        10.19        10.22        10.21
         1024         1000         3.95         4.09         4.03
         2048         1000         6.08         6.24         6.15

@bosilca
Copy link
Member

bosilca commented Apr 3, 2017

Somehow we are leaking resources, and as a result we force the SM manager (vader in this instance) to keep mapping memory until we reach some OS limit. Now we only have to find how we are we leaking the fragments (otherwise we wouldn't be in mca_btl_vader_frag_init) ...

@ggouaillardet using /dev/shm might work but this will make the cleanup more complicated and error prone. At some point we had a test to check that we are not mmaping files on a shared memory filesystem. What happened to this check ?

@ggouaillardet
Copy link
Contributor

@bosilca did you mean a test to check we do not mmap on a remote file system ?
XFS is a local file system (and CXFS is the distributed file system that is based on XFS)

@LaHaine
Copy link
Author

LaHaine commented Nov 24, 2017

With Open-MPI 3.0.0 it no longer crashes for me. Instead it simply hangs at gather with 1024 processes.

@LaHaine
Copy link
Author

LaHaine commented Nov 30, 2017

I think I was able to solve this problem. The crash must have been caused by the quota on the /tmp directory that was used by openmpi. Simply setting TMPDIR to /scratch (my system's job scratch path without a quota) makes the crash disappear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants