Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need advice on debugging an issue with 3.1.3 and -mca pml ob1 #6833

Closed
mwheinz opened this issue Jul 23, 2019 · 21 comments
Closed

Need advice on debugging an issue with 3.1.3 and -mca pml ob1 #6833

mwheinz opened this issue Jul 23, 2019 · 21 comments
Labels

Comments

@mwheinz
Copy link

mwheinz commented Jul 23, 2019

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Intel OPA build of OMPI v3.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Internal CI testing on a back-to-back OPA fabric.

Please describe the system on which you are running

  • Operating system/version: RHEL 7.5, 7.6, SLES 12 SP3, SLES 12 SP4
  • Computer hardware: Xeon Skylake or better
  • Network type: OPA

Details of the problem

Recently got this issue from one of our testers saying that the following command line:

/usr/mpi/gcc/openmpi-3.1.3-hfi/bin/mpirun -H hds1fna5102.hd.intel.com,hds1fna5103.hd.intel.com,hds1fna5104.hd.intel.com,hds1fna5101.hd.intel.com --allow-run-as-root --mca oob tcp --mca pml ob1 --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 -np 200 --map-by node --oversubscribe /usr/mpi/gcc/openmpi-3.1.3-hfi/tests/IMB-4.0/IMB-MPI1 Sendrecv -npmin 200 -iter 150 -iter_policy off

is hanging "unless they remove --mca pml ob1". Of course, looking at this command line, I'm pretty sure that if they do that they stop using OPA altogether.

Since I'm still trying to learn the OMPI internals, I'm unsure how to approach this, so a few questions:

  1. Any chance this is a known issue?
  2. Assuming it's not, any suggestions on how to debug a silent hang? (I mean, besides staring at the screen and hoping for inspiration to strike...)

@ggouaillardet
Copy link
Contributor

The command you posted force the use of IP over the ib0 interface for inter node communications (and vader for intra node communications).

If you do not pass --mca pml ob1 then (native) OPA will likely be used either via libfabric or PSM2.

@mwheinz
Copy link
Author

mwheinz commented Jul 23, 2019

If you do not pass --mca pml ob1 then (native) OPA will likely be used either via libfabric or PSM2.

Interesting. Thanks for helping me understand what they're testing. Not sure why anyone would even want to use ipoib instead of native OPA....

@bosilca
Copy link
Member

bosilca commented Jul 23, 2019

This is extremely unsettling as OB1 + TCP was our defacto working setup for all environments. Fortunately, I cannot reproduce such a deadlock on my systems, even when the processes are heavily oversubscribed, so this might be some issue with your particular setting.

Do you get any output from the test before the deadlock ?

@mwheinz
Copy link
Author

mwheinz commented Jul 23, 2019

Do you get any output from the test before the deadlock ?

From my current round of testing it appears that the 0-byte round completes but the 1-byte round never does:

[RHEL7.5 hdsmpriv01 20190723_1154 ~]# /usr/mpi/gcc/openmpi-3.1.3-hfi/bin/mpirun -H hdsmpriv01,hdsmpriv02 --allow-run-as-root --mca oob tcp --mca pml ob1 --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 -np 400 --map-by node --oversubscribe /usr/mpi/gcc/openmpi-3.1.3-hfi/tests/IMB-4.0/IMB-MPI1 Sendrecv -npmin 400 -iter 200 -iter_policy off
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part    
#------------------------------------------------------------
# Date                  : Tue Jul 23 12:38:39 2019
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-862.el7.x86_64
# Version               : #1 SMP Wed Mar 21 18:14:51 EDT 2018
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# /usr/mpi/gcc/openmpi-3.1.3-hfi/tests/IMB-4.0/IMB-MPI1 Sendrecv -npmin 400 -iter 200 -iter_policy off
#                                                      

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 400 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0          200      2425.23      3888.08      3223.43         0.00

While running the test I also ran "while true; do ifstat ib0; sleep 1; done" on the other host. What I saw there was that the packet rate went to zero after the 0-byte round completed.

@bosilca
Copy link
Member

bosilca commented Jul 23, 2019

Strange, as a 0-byte packet translate in a message in Open MPI, and thus requires the connections to be setup between the peers. So we can't blame the TCP connection setup, the issue should be somewhere else. Unfortunately I cannot reproduce it on my setup (I only have ipoib), I would really appreciate some help to understand what is going on. Does it happen on a non-oversubscribed run ? What is the smallest number of processes when this starts happening ? Can you find a stack trace of a debug build pinpointing to the deadlock ?

@mwheinz
Copy link
Author

mwheinz commented Jul 23, 2019

If you can instruct me on how to collect such a stack trace, I would be happy to do so - I've never quite figured out how to do that.

@mwheinz
Copy link
Author

mwheinz commented Jul 23, 2019

The primary thing pointing the finger at OMPI instead of OPA is that the same install using OMPI 2.1.2 does not experience the issue. (Which doesn't mean the problem isn't somehow in our own build of OMPI 3.1.3)

@bosilca
Copy link
Member

bosilca commented Jul 23, 2019

To get the stack trace, you can attach with gdb to one of the processes using gdb -p **pib** and then generate the backtrace with the gdb command bt. However, if order for the trace to be human readable, you would need to have a debug build of OMPI.

Before going there, can you please confirm if this bug appears with a more recent version of OMPI stable (4.0.1) ?

@mwheinz
Copy link
Author

mwheinz commented Jul 24, 2019

To get the stack trace, you can attach with gdb to one of the processes using gdb -p **pib** and then generate the backtrace with the gdb command bt. However, if order for the trace to be human readable, you would need to have a debug build of OMPI.

Unfortunately, when I built a debug version of 3.1.3 the job doesn't start - it fails with this message:

[RHEL7.5 hdsmpriv01 20190724_1314 OFA_MPIS.RHEL75-x86_64.10.10.0.0.296]# /usr/mpi/gcc/openmpi-3.1.3-hfi/bin/mpirun -H hdsmpriv01.hd.intel.com,hdsmpriv02.hd.intel.com --allow-run-as-root --mca oob tcp --mca pml ob1 --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 -np 400 --map-by node --oversubscribe /usr/mpi/gcc/openmpi-3.1.3-hfi/tests/IMB-4.0/IMB-MPI1 Sendrecv -npmin 400 -iter 200 -iter_policy off
[hdsmpriv01:56631] [[18566,0],0] ORTE_ERROR_LOG: Data unpack had inadequate space in file base/plm_base_launch_support.c at line 1319

Before going there, can you please confirm if this bug appears with a more recent version of OMPI stable (4.0.1) ?

Doing that now!

@mwheinz
Copy link
Author

mwheinz commented Jul 24, 2019

bosilca, I appreciate your help but I made a disturbing discovery - while I always get the issue with the "official" build of ompi 3.1.3, none of my developer builds have the problem.

Looks like I'm going to be studying the build logs looking for a discrepancy.

@bosilca
Copy link
Member

bosilca commented Jul 24, 2019

@mwheinz another place you should look at is the system wide configuration files (${prefix-ompi-3.1}/etc/openmpi-mca-params.conf). It might contain unexpected restrictions.

@ggouaillardet
Copy link
Contributor

@mwheinz that kind of issue can happen when mpirun and orted are not compatible. For example if you built mpirun with --enable-debug but orted was built without this option.

@mwheinz
Copy link
Author

mwheinz commented Jul 25, 2019

@ggouaillardet, @bosilca, So, from looking at the output of autogen and configure, the differences I've noticed between the official build machine and my personal dev machine is that the official machine has infinipath headers installed (probably not the cause, but easy to check, systemd-devel (libudev.h), and ltdl.h - which seems to be part of a special version of libtool?

There were other differences in the output but these were the ones that looked like they might impact the operation of ompi on a machine that might not have the same software installed - any thoughts?

Oh, and adding @jsquyres - - Does ompi still have infinipath users? It's been 7 years since I've seen a SusieQ and I helped develop them...

@jsquyres
Copy link
Member

Also, we have seen problems with IPoIB before in the past (i.e., where the IPoIB stack wasn't 100% stable, and various versions of Open MPI worked/didn't work, depending on how each OMPI version specifically used the TCP stack). I would take --mca btl_tcp_if_include ib0 out altogether, just to remove that variable. Heck, you may even want to --mca btl_tcp_if_exclude ib0 --mca oob_tcp_if_exclude ib0 to make sure Open MPI doesn't use your IPoIB interfaces at all.

By "infinipath headers", I assume you mean that Open MPI was able to compile the PSM MTL?

@ggouaillardet correctly mentioned that --mca pml ob1 means that you have forced Open MPI to use the OB1 PML, which means that the BTLs are used for MPI point-to-point communications (e.g., the TCP BTL, which @bosilca is now worried about). If you remove --mca pml ob1, Open MPI should default back to the cm PML+PSM MTL. But you can force this, too, via --mca pml cm --mca mtl psm -- just to remove any question as to what transport was used.

Does ompi still have infinipath users?

That's a question only Intel can really answer... You guys effectively "own" the PSM and PSM2 MTLs; I don't think that anyone other than Intel would remove them.

@mwheinz
Copy link
Author

mwheinz commented Jul 25, 2019

That's a question only Intel can really answer... You guys effectively "own" the PSM and PSM2 MTLs; I don't think that anyone other than Intel would remove them.

Yeah, that's what I figured. Okay.

As to your other points, I agree that ipoib can be unreliable and, actually, that's the point of this particular test - rather than testing ompi, it's supposed to be testing ipoib. The finger got pointed at ompi, however, because the test stopped working when we shifted our builds to using 3.1.3 instead of 2.1.2.

@mwheinz
Copy link
Author

mwheinz commented Jul 25, 2019

@mwheinz that kind of issue can happen when mpirun and orted are not compatible. For example if you built mpirun with --enable-debug but orted was built without this option.

@ggouaillardet - thanks for the tip.

@mwheinz
Copy link
Author

mwheinz commented Jul 26, 2019

So, it turns out that my statement that building with different options fixes the issue was incorrect - I was unable to recreate a working build of 3.1.3 on my machines yesterday, so I have to assume my earlier success was actually a testing error - that I wasn't running the version of OMPI I thought I was.

The current state is that the problem definitely does not occur in 2.1.2 or 4.0.1 but in 3.1.3 it does and I can reproduce the issue with as few as 10 processes across 2 nodes.

To produce the following I re-compiled OMPI 3.1.3 locally on each of two machines, with debuginfo enabled. One host was running RHEL 7.5 and one was running 7.6.

Extracting stack traces from all processes, it appears they are all waiting for input from another process, possibly implying a lost message?

It is interesting to note that the RHEL7.6 host is using libevent but the 7.6 host does not appear to be, despite both having the same version of libevent installed.

HOST A: (the job was launched here, running RHEL7.5)

Process: 45956
(gdb) #0 0x00007f29263799d7 in sched_yield () from /lib64/libc.so.6
#1 0x00007f29268cb675 in ompi_request_default_wait ()
#2 0x00007f29268ff9ac in PMPI_Sendrecv ()
#3 0x000000000040b7ac in IMB_sendrecv ()
#4 0x0000000000405706 in IMB_init_buffers_iter ()
#5 0x0000000000402055 in main ()
Process: 45957
(gdb) #0 0x00007fb381adc9d7 in sched_yield () from /lib64/libc.so.6
#1 0x00007fb38202e675 in ompi_request_default_wait ()
#2 0x00007fb3820812d9 in ompi_coll_base_barrier_intra_bruck ()
#3 0x00007fb3820437c7 in PMPI_Barrier ()
#4 0x000000000040b727 in IMB_sendrecv ()
#5 0x0000000000405706 in IMB_init_buffers_iter ()
#6 0x0000000000402055 in main ()
Process: 45958
(gdb) #0 0x00007ff6a8bc59d7 in sched_yield () from /lib64/libc.so.6
#1 0x00007ff6a9117675 in ompi_request_default_wait ()
#2 0x00007ff6a916a2d9 in ompi_coll_base_barrier_intra_bruck ()
#3 0x00007ff6a912c7c7 in PMPI_Barrier ()
#4 0x000000000040b727 in IMB_sendrecv ()
#5 0x0000000000405706 in IMB_init_buffers_iter ()
#6 0x0000000000402055 in main ()
Process: 45959
(gdb) #0 0x00007f9c91d009d7 in sched_yield () from /lib64/libc.so.6
#1 0x00007f9c92252675 in ompi_request_default_wait ()
#2 0x00007f9c922869ac in PMPI_Sendrecv ()
#3 0x000000000040b7ac in IMB_sendrecv ()
#4 0x0000000000405706 in IMB_init_buffers_iter ()
#5 0x0000000000402055 in main ()
Process: 45960
(gdb) #0 0x00007fa5d92159d7 in sched_yield () from /lib64/libc.so.6
#1 0x00007fa5d9767675 in ompi_request_default_wait ()
#2 0x00007fa5d979b9ac in PMPI_Sendrecv ()
#3 0x000000000040b7ac in IMB_sendrecv ()
#4 0x0000000000405706 in IMB_init_buffers_iter ()
#5 0x0000000000402055 in main ()

HOST B: (running RHEL7.6)

Process: 1709
(gdb) #0 0x00007f2c4d20a20d in poll () from /lib64/libc.so.6
#1 0x00007f2c4cbdda36 in poll_dispatch ()
#2 0x00007f2c4cbd4a79 in opal_libevent2022_event_base_loop ()
#3 0x00007f2c4cb8b6b3 in opal_progress ()
#4 0x00007f2c4d74c675 in ompi_request_default_wait ()
#5 0x00007f2c4d7809ac in PMPI_Sendrecv ()
#6 0x000000000040b7ac in IMB_sendrecv ()
#7 0x0000000000405706 in IMB_init_buffers_iter ()
Process: 1710
(gdb) #0 0x00007fa9e0e4020d in poll () from /lib64/libc.so.6
#1 0x00007fa9e0813a36 in poll_dispatch ()
#2 0x00007fa9e080aa79 in opal_libevent2022_event_base_loop ()
#3 0x00007fa9e07c16b3 in opal_progress ()
#4 0x00007fa9e1382675 in ompi_request_default_wait ()
#5 0x00007fa9e13b69ac in PMPI_Sendrecv ()
#6 0x000000000040b7ac in IMB_sendrecv ()
#7 0x0000000000405706 in IMB_init_buffers_iter ()
Process: 1711
(gdb) #0 0x00007fc7c5a9020d in poll () from /lib64/libc.so.6
#1 0x00007fc7c5463a36 in poll_dispatch ()
#2 0x00007fc7c545aa79 in opal_libevent2022_event_base_loop ()
#3 0x00007fc7c54116b3 in opal_progress ()
#4 0x00007fc7c5fd2675 in ompi_request_default_wait ()
#5 0x00007fc7c60069ac in PMPI_Sendrecv ()
#6 0x000000000040b7ac in IMB_sendrecv ()
#7 0x0000000000405706 in IMB_init_buffers_iter ()
Process: 1712
(gdb) #0 0x00007fd8c8aeb20d in poll () from /lib64/libc.so.6
#1 0x00007fd8c84bea36 in poll_dispatch ()
#2 0x00007fd8c84b5a79 in opal_libevent2022_event_base_loop ()
#3 0x00007fd8c846c6b3 in opal_progress ()
#4 0x00007fd8c902d675 in ompi_request_default_wait ()
#5 0x00007fd8c90619ac in PMPI_Sendrecv ()
#6 0x000000000040b7ac in IMB_sendrecv ()
#7 0x0000000000405706 in IMB_init_buffers_iter ()
Process: 1713
(gdb) #0 0x00007f5ffcdb920d in poll () from /lib64/libc.so.6
#1 0x00007f5ffc78ca36 in poll_dispatch ()
#2 0x00007f5ffc783a79 in opal_libevent2022_event_base_loop ()
#3 0x00007f5ffc73a6b3 in opal_progress ()
#4 0x00007f5ffd2fb675 in ompi_request_default_wait ()
#5 0x00007f5ffd32f9ac in PMPI_Sendrecv ()
#6 0x000000000040b7ac in IMB_sendrecv ()
#7 0x0000000000405706 in IMB_init_buffers_iter ()

@bosilca
Copy link
Member

bosilca commented Jul 26, 2019

OMPI cannot run without libevent, you were lucky enough to stop your processes on NodeA outside the opal_progress loop. A possible reason is that the processes on Node A behave as if the run was oversubscribed (otherwise we should not call sched_yield), while on Node B they are all progressing. While this might introduce some delays in the communications, it should not create a deadlock.

If you configure your stack with --enable-debug, you should be able to log the ongoing connections by adding btl_base_verbose = 1000 and pml_base_verbose = 1000 to your MCA param file. That would help to see how the different processes see each other after the modex exchange.

@mwheinz
Copy link
Author

mwheinz commented Jul 29, 2019

@bosilca - this time my attempt to make a debug build worked.

All processes on the node that launched the job are in sched_yield(), on the other node they are all in poll_dispatch().

Unfortunately, the log consists of about 2k lines of text from btl_vader_fbox and I don't recognize anything as particularly significant - the last few lines of output are:

[hdsmpriv02][[34215,1],3][btl_vader_fbox.h:206:mca_btl_vader_check_fboxes] got frag from 0 with header {.tag = 65, .size = 14, .seq = 24} from offset 800
[hdsmpriv02][[34215,1],3][btl_vader_fbox.h:242:mca_btl_vader_check_fboxes] left off at offset 832 (hbs: 0)
[hdsmpriv02][[34215,1],3][btl_vader_fbox.h:147:mca_btl_vader_fbox_sendi] writing fragment of size 32 to offset 768 {start: 0x20, end: 0x300 (hbs: 0)} of peer's buffer. free = 3328
[hdsmpriv02][[34215,1],1][btl_vader_fbox.h:147:mca_btl_vader_fbox_sendi] writing fragment of size 32 to offset 800 {start: 0x20, end: 0x320 (hbs: 0)} of peer's buffer. free = 3296
[hdsmpriv02][[34215,1],1][btl_vader_fbox.h:206:mca_btl_vader_check_fboxes] got frag from 4 with header {.tag = 65, .size = 14, .seq = 13} from offset 448
[hdsmpriv02][[34215,1],1][btl_vader_fbox.h:242:mca_btl_vader_check_fboxes] left off at offset 480 (hbs: 0)
[hdsmpriv02][[34215,1],1][btl_vader_fbox.h:147:mca_btl_vader_fbox_sendi] writing fragment of size 32 to offset 608 {start: 0x20, end: 0x260 (hbs: 0)} of peer's buffer. free = 3488
[hdsmpriv02][[34215,1],9][btl_vader_fbox.h:147:mca_btl_vader_fbox_sendi] writing fragment of size 32 to offset 448 {start: 0x20, end: 0x1c0 (hbs: 0)} of peer's buffer. free = 3648
[hdsmpriv02][[34215,1],7][btl_vader_fbox.h:206:mca_btl_vader_check_fboxes] got frag from 1 with header {.tag = 65, .size = 14, .seq = 23} from offset 768
[hdsmpriv02][[34215,1],7][btl_vader_fbox.h:242:mca_btl_vader_check_fboxes] left off at offset 800 (hbs: 0)

Which I assume means that the last thing that happened was that process #7 received some data from process #1...?

Complete output: mwh.log.gz

Sample BT from first node:

(gdb) #0  0x00007fe3a8bde9d7 in sched_yield () from /lib64/libc.so.6
#1  0x00007fe3a850741b in opal_progress () at runtime/opal_progress.c:256
#2  0x00007fe3a9145dc5 in ompi_request_wait_completion (req=0x1abc8c0) at ../ompi/request/request.h:413
#3  ompi_request_default_wait (req_ptr=0x7ffc567a4640, status=0x7ffc567a4650) at request/req_wait.c:42
#4  0x00007fe3a91aa5ea in ompi_coll_base_sendrecv_zero (stag=-16, rtag=-16, comm=0x1ada070, source=6, dest=<optimized out>) at base/coll_base_barrier.c:64
#5  ompi_coll_base_barrier_intra_bruck (comm=0x1ada070, module=<optimized out>) at base/coll_base_barrier.c:265
#6  0x00007fe3a915f577 in PMPI_Barrier (comm=0x1ada070) at pbarrier.c:71
#7  0x000000000040b727 in IMB_sendrecv ()

Sample from second node:

(gdb) #0  0x00007f2b1022920d in poll () from /lib64/libc.so.6
#1  0x00007f2b0fba48f6 in poll_dispatch () from /usr/mpi/gcc/openmpi-3.1.3rc3-hfi/lib64/libopen-pal.so.40
#2  0x00007f2b0fb9b939 in opal_libevent2022_event_base_loop () from /usr/mpi/gcc/openmpi-3.1.3rc3-hfi/lib64/libopen-pal.so.40
#3  0x00007f2b0fb42403 in opal_progress_events () at runtime/opal_progress.c:189
#4  opal_progress () at runtime/opal_progress.c:242
#5  0x00007f2b10780dc5 in ompi_request_wait_completion (req=0x2079940) at ../ompi/request/request.h:413
#6  ompi_request_default_wait (req_ptr=0x7ffe0547edb0, status=0x7ffe0547edc0) at request/req_wait.c:42
#7  0x00007f2b107e55ea in ompi_coll_base_sendrecv_zero (stag=-16, rtag=-16, comm=0x20966b0, source=7, dest=<optimized out>) at base/coll_base_barrier.c:64
#10 0x000000000040b727 in IMB_sendrecv ()
#11 0x0000000000405706 in IMB_init_buffers_iter ()
#12 0x0000000000402055 in main ()

@bosilca
Copy link
Member

bosilca commented Jul 29, 2019

I extracted the TCP information from your log file (grep tcp ...) and all connections seems to be correctly established, with both sides reporting a successful handshake. Puzzling.

@mwheinz
Copy link
Author

mwheinz commented Aug 1, 2019

So, in a fascinating turn of events, I had a chance to try and reproduce the problem with 3.1.4 and the problem does not appear to exist in that release.

I'm stress testing now to try and make sure - but at the moment, the full 100 processes on 2 nodes run has successfully completed ~10 times whereas with 3.1.3 it failed every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants