-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need advice on debugging an issue with 3.1.3 and -mca pml ob1 #6833
Comments
The command you posted force the use of IP over the ib0 interface for inter node communications (and vader for intra node communications). If you do not pass |
Interesting. Thanks for helping me understand what they're testing. Not sure why anyone would even want to use ipoib instead of native OPA.... |
This is extremely unsettling as OB1 + TCP was our defacto working setup for all environments. Fortunately, I cannot reproduce such a deadlock on my systems, even when the processes are heavily oversubscribed, so this might be some issue with your particular setting. Do you get any output from the test before the deadlock ? |
From my current round of testing it appears that the 0-byte round completes but the 1-byte round never does:
While running the test I also ran "while true; do ifstat ib0; sleep 1; done" on the other host. What I saw there was that the packet rate went to zero after the 0-byte round completed. |
Strange, as a 0-byte packet translate in a message in Open MPI, and thus requires the connections to be setup between the peers. So we can't blame the TCP connection setup, the issue should be somewhere else. Unfortunately I cannot reproduce it on my setup (I only have ipoib), I would really appreciate some help to understand what is going on. Does it happen on a non-oversubscribed run ? What is the smallest number of processes when this starts happening ? Can you find a stack trace of a debug build pinpointing to the deadlock ? |
If you can instruct me on how to collect such a stack trace, I would be happy to do so - I've never quite figured out how to do that. |
The primary thing pointing the finger at OMPI instead of OPA is that the same install using OMPI 2.1.2 does not experience the issue. (Which doesn't mean the problem isn't somehow in our own build of OMPI 3.1.3) |
To get the stack trace, you can attach with gdb to one of the processes using Before going there, can you please confirm if this bug appears with a more recent version of OMPI stable (4.0.1) ? |
Unfortunately, when I built a debug version of 3.1.3 the job doesn't start - it fails with this message:
Doing that now! |
bosilca, I appreciate your help but I made a disturbing discovery - while I always get the issue with the "official" build of ompi 3.1.3, none of my developer builds have the problem. Looks like I'm going to be studying the build logs looking for a discrepancy. |
@mwheinz another place you should look at is the system wide configuration files (${prefix-ompi-3.1}/etc/openmpi-mca-params.conf). It might contain unexpected restrictions. |
@mwheinz that kind of issue can happen when |
@ggouaillardet, @bosilca, So, from looking at the output of autogen and configure, the differences I've noticed between the official build machine and my personal dev machine is that the official machine has infinipath headers installed (probably not the cause, but easy to check, systemd-devel (libudev.h), and ltdl.h - which seems to be part of a special version of libtool? There were other differences in the output but these were the ones that looked like they might impact the operation of ompi on a machine that might not have the same software installed - any thoughts? Oh, and adding @jsquyres - - Does ompi still have infinipath users? It's been 7 years since I've seen a SusieQ and I helped develop them... |
Also, we have seen problems with IPoIB before in the past (i.e., where the IPoIB stack wasn't 100% stable, and various versions of Open MPI worked/didn't work, depending on how each OMPI version specifically used the TCP stack). I would take By "infinipath headers", I assume you mean that Open MPI was able to compile the PSM MTL? @ggouaillardet correctly mentioned that
That's a question only Intel can really answer... You guys effectively "own" the PSM and PSM2 MTLs; I don't think that anyone other than Intel would remove them. |
Yeah, that's what I figured. Okay. As to your other points, I agree that ipoib can be unreliable and, actually, that's the point of this particular test - rather than testing ompi, it's supposed to be testing ipoib. The finger got pointed at ompi, however, because the test stopped working when we shifted our builds to using 3.1.3 instead of 2.1.2. |
@ggouaillardet - thanks for the tip. |
So, it turns out that my statement that building with different options fixes the issue was incorrect - I was unable to recreate a working build of 3.1.3 on my machines yesterday, so I have to assume my earlier success was actually a testing error - that I wasn't running the version of OMPI I thought I was. The current state is that the problem definitely does not occur in 2.1.2 or 4.0.1 but in 3.1.3 it does and I can reproduce the issue with as few as 10 processes across 2 nodes. To produce the following I re-compiled OMPI 3.1.3 locally on each of two machines, with debuginfo enabled. One host was running RHEL 7.5 and one was running 7.6. Extracting stack traces from all processes, it appears they are all waiting for input from another process, possibly implying a lost message? It is interesting to note that the RHEL7.6 host is using libevent but the 7.6 host does not appear to be, despite both having the same version of libevent installed. HOST A: (the job was launched here, running RHEL7.5) Process: 45956 HOST B: (running RHEL7.6) Process: 1709 |
OMPI cannot run without libevent, you were lucky enough to stop your processes on NodeA outside the opal_progress loop. A possible reason is that the processes on Node A behave as if the run was oversubscribed (otherwise we should not call sched_yield), while on Node B they are all progressing. While this might introduce some delays in the communications, it should not create a deadlock. If you configure your stack with --enable-debug, you should be able to log the ongoing connections by adding |
@bosilca - this time my attempt to make a debug build worked. All processes on the node that launched the job are in sched_yield(), on the other node they are all in poll_dispatch(). Unfortunately, the log consists of about 2k lines of text from btl_vader_fbox and I don't recognize anything as particularly significant - the last few lines of output are:
Which I assume means that the last thing that happened was that process #7 received some data from process #1...? Complete output: mwh.log.gz Sample BT from first node:
Sample from second node:
|
I extracted the TCP information from your log file ( |
So, in a fascinating turn of events, I had a chance to try and reproduce the problem with 3.1.4 and the problem does not appear to exist in that release. I'm stress testing now to try and make sure - but at the moment, the full 100 processes on 2 nodes run has successfully completed ~10 times whereas with 3.1.3 it failed every time. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Intel OPA build of OMPI v3.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Internal CI testing on a back-to-back OPA fabric.
Please describe the system on which you are running
Details of the problem
Recently got this issue from one of our testers saying that the following command line:
/usr/mpi/gcc/openmpi-3.1.3-hfi/bin/mpirun -H hds1fna5102.hd.intel.com,hds1fna5103.hd.intel.com,hds1fna5104.hd.intel.com,hds1fna5101.hd.intel.com --allow-run-as-root --mca oob tcp --mca pml ob1 --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 -np 200 --map-by node --oversubscribe /usr/mpi/gcc/openmpi-3.1.3-hfi/tests/IMB-4.0/IMB-MPI1 Sendrecv -npmin 200 -iter 150 -iter_policy off
is hanging "unless they remove --mca pml ob1". Of course, looking at this command line, I'm pretty sure that if they do that they stop using OPA altogether.
Since I'm still trying to learn the OMPI internals, I'm unsure how to approach this, so a few questions:
The text was updated successfully, but these errors were encountered: