Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more helpful when OPAL_PREFIX is wrong #4028

Closed
wants to merge 5 commits into from
Closed

Be more helpful when OPAL_PREFIX is wrong #4028

wants to merge 5 commits into from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Aug 4, 2017

If none of the provided search directories for plugins exists, then we really have a non-recoverable problem. As things stand, we fail with a completely unhelpful message from the first framework that tries to open components, finds nothing, and errors out because it needs at least one to work.

The use-case that tripped this was an incorrect setting of OPAL_PREFIX to point to a non-existent location due to a typo. It took significant debugging to find that it wasn't the framework that was the problem - it was the envar.

Try to detect that and provide a more useful error message.

Signed-off-by: Ralph Castain [email protected]

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 5, 2017

Refs #3980

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 5, 2017

I'm seeing weird random failures in the opal_fifo tests that can't have anything to do with this PR - does anyone have any ideas?

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 5, 2017

bot:ompi:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 5, 2017

@bwbarrett I'm going to need some help here. I cannot make the opal_fifo tests fail on my machines, even copying the autogen/configure lines used here. Nothing in this PR relates to opal_fifo or touches anything to do with it. So I'm at a total loss as to what might be going on here.

…e really have a non-recoverable problem. As things stand, we fail with a completely unhelpful message from the first framework that tries to open components, finds nothing, and errors out because it needs at least one to work.

The use-case that tripped this was an incorrect setting of OPAL_PREFIX to point to a non-existent location due to a typo. It took significant debugging to find that it wasn't the framework that was the problem - it was the envar.

Try to detect that and provide a more useful error message.

Update ignores

Oops - we were exiting regardless of the value returned from opal_dl_foreachfile. :-(

Also, allow the repository search to continue across all path elements, not just fail after the first one that couldn't be opened

Signed-off-by: Ralph Castain <[email protected]>
@jjhursey
Copy link
Member

jjhursey commented Aug 7, 2017

I added a comment to the PMIx version of this:

One future feature request might be to clarify what happens if I have two directories in my mca_base_component_path both valid directories with components (let's say that they are non-overlapping/conflicting for the moment). For example, there is a system installed Open MPI (with headers) and a couple 'custom' components in my home directory. Are the union of the components in those two directories taken? Just those from the first directory? Just lost from the last directory?

@hjelmn
Copy link
Member

hjelmn commented Aug 7, 2017

Looks like the pull request checker is borked.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 7, 2017

bot:ompi:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 7, 2017

@jjhursey Based on what I see in the code, we work thru every directory in the search path, slurping up all components for the given framework. So it is the union of the directories. I'm not sure what happens if dlopen sees two components of the same name.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 7, 2017

Sigh - I give up. Now we are back to getting opal_fifo test errors when --no-ompi, --no-oshmem, or --no-orte is given to autogen.

@hjelmn
Copy link
Member

hjelmn commented Aug 7, 2017

I will try to reproduce the issue tomorrow. Failures in just about all the tests that call opal_init_util ().

Ralph Castain added 2 commits August 7, 2017 15:42
…y and output a more helpful error message.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

What the heck is this one?

08:39:12 make[3]: Leaving directory `/scrap/jenkins/jobs/gh-ompi-master-pr/workspace/test/dss'
08:39:12 make[2]: Leaving directory `/scrap/jenkins/jobs/gh-ompi-master-pr/workspace/test/dss'
08:39:12 Making check in symbol_name
08:39:12 make[2]: Entering directory `/scrap/jenkins/jobs/gh-ompi-master-pr/workspace/test/symbol_name'
08:39:12 make  check-TESTS
08:39:12 make[3]: Entering directory `/scrap/jenkins/jobs/gh-ompi-master-pr/workspace/test/symbol_name'
08:39:12 Test expects env var MYBASE set to base dir
08:39:12 (where ompi/ opal/ orte/ test/ etc live)
08:39:12 And optionally OMPI_LIBMPI_NAME should be set
08:39:12 if MPI is configured with some name other than
08:39:12 "mpi" for that.
08:39:12 FAIL: nmcheck_prefix
08:39:12 ========================================================
08:39:12 1 of 1 test failed
08:39:12 Please report to http://www.open-mpi.org/community/help/

@jjhursey
Copy link
Member

jjhursey commented Aug 8, 2017

IBM added a feature to rename libmpi* to libFOO if you pass --with-libmpi-name=FOO. That's probably what it's encountering.
IBM uses this in our MTT testing, but not our CI testing (though we probably should).

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Is there some way to remove this for now so we can get past it? Or update the OMPI internal tests so it passes?

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Just wondering: since this PR doesn't touch that functionality, how did this get thru CI before?

@jjhursey
Copy link
Member

jjhursey commented Aug 8, 2017

Humm not sure how that would have gotten through CI (that option has been in there for quite a while now). Let me see if I can reproduce locally.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

FWIW: I can reproduce on my machine as well. I'm wondering is someone changed the tests as opposed to the option itself? I thought I saw something go thru just this morning about it...

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Bingo! #4036 is what broke things.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

or at least, it touched this test - the PR saids it passed CI...?

@open-mpi open-mpi deleted a comment from ibm-ompi Aug 8, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Aug 8, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Aug 8, 2017
@jjhursey
Copy link
Member

jjhursey commented Aug 8, 2017

Hard as I try I cannot reproduce the symbol test issue. I've tried with the configure parameters Ralph sent me offline, and the ones from Mellanox CI.

Maybe there is an environment variable conflict for MYBASE which is why the error is being generated. @markalle Do you have any ideas on why it might fail like this?

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

interesting - and I cannot get it to pass on my machines! I don't have any envar like MYBASE set. Yet it consistently fails for me with that message.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Tried configuring with nothing other than a prefix, but same problem:

make  check-TESTS
make[1]: Entering directory `/home/common/openmpi/ompi/test/symbol_name'
make[2]: Entering directory `/home/common/openmpi/ompi'
make[2]: Leaving directory `/home/common/openmpi/ompi'
make[2]: Entering directory `/home/common/openmpi/ompi'
make[2]: Leaving directory `/home/common/openmpi/ompi'
Test expects env var MYBASE set to base dir
(where ompi/ opal/ orte/ test/ etc live)
And optionally OMPI_LIBMPI_NAME should be set
if MPI is configured with some name other than
"mpi" for that.
FAIL: nmcheck_prefix
========================================================
1 of 1 test failed
Please report to http://www.open-mpi.org/community/help/
========================================================
make[1]: *** [check-TESTS] Error 1
make[1]: Leaving directory `/home/common/openmpi/ompi/test/symbol_name'
make: *** [check-am] Error 2

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Aha - here is what is needed to make this test pass:

$ export MYBASE=/home/common/openmpi/ompi
$ make check
make[1]: Entering directory `/home/common/openmpi/ompi'
make[1]: Leaving directory `/home/common/openmpi/ompi'
make[1]: Entering directory `/home/common/openmpi/ompi'
make[1]: Leaving directory `/home/common/openmpi/ompi'
make  check-TESTS
make[1]: Entering directory `/home/common/openmpi/ompi/test/symbol_name'
make[2]: Entering directory `/home/common/openmpi/ompi'
make[2]: Leaving directory `/home/common/openmpi/ompi'
make[2]: Entering directory `/home/common/openmpi/ompi'
make[2]: Leaving directory `/home/common/openmpi/ompi'
NOTE: found static /home/common/openmpi/ompi/ompi/mca/bml/.libs/libmca_bml.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/coll/.libs/libmca_coll.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/crcp/.libs/libmca_crcp.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/fbtl/.libs/libmca_fbtl.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/fcoll/.libs/libmca_fcoll.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/fs/.libs/libmca_fs.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/io/.libs/libmca_io.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/mtl/.libs/libmca_mtl.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/op/.libs/libmca_op.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/osc/.libs/libmca_osc.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/pml/v/.libs/libmca_pml_v.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/pml/.libs/libmca_pml.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/rte/orte/.libs/libmca_rte_orte.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/rte/.libs/libmca_rte.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/sharedfp/.libs/libmca_sharedfp.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/topo/.libs/libmca_topo.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/vprotocol/.libs/libmca_vprotocol.a
NOTE: found static /home/common/openmpi/ompi/ompi/mca/hook/.libs/libmca_hook.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/allocator/.libs/libmca_allocator.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/backtrace/execinfo/.libs/libmca_backtrace_execinfo.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/backtrace/.libs/libmca_backtrace.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/base/.libs/libmca_base.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/btl/.libs/libmca_btl.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/compress/.libs/libmca_compress.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/crs/.libs/libmca_crs.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/dl/dlopen/.libs/libmca_dl_dlopen.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/dl/.libs/libmca_dl.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/event/libevent2022/.libs/libmca_event_libevent2022.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/event/.libs/libmca_event.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/hwloc/hwloc2a/.libs/libmca_hwloc_hwloc2a.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/hwloc/.libs/libmca_hwloc.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/if/linux_ipv6/.libs/libmca_if_linux_ipv6.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/if/posix_ipv4/.libs/libmca_if_posix_ipv4.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/if/.libs/libmca_if.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/installdirs/config/.libs/libmca_installdirs_config.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/installdirs/env/.libs/libmca_installdirs_env.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/installdirs/.libs/libmca_installdirs.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/memchecker/.libs/libmca_memchecker.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/memcpy/.libs/libmca_memcpy.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/memory/patcher/.libs/libmca_memory_patcher.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/memory/.libs/libmca_memory.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/mpool/.libs/libmca_mpool.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/patcher/.libs/libmca_patcher.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/.libs/libmca_pmix.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/gds/.libs/libmca_gds.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pdl/pdlopen/.libs/libmca_pdl_pdlopen.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pdl/.libs/libmca_pdl.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pif/linux_ipv6/.libs/libmca_pif_linux_ipv6.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pif/posix_ipv4/.libs/libmca_pif_posix_ipv4.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pif/.libs/libmca_pif.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pinstalldirs/config/.libs/libmca_pinstalldirs_config.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pinstalldirs/env/.libs/libmca_pinstalldirs_env.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pinstalldirs/.libs/libmca_pinstalldirs.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pnet/.libs/libmca_pnet.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/psec/.libs/libmca_psec.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/bfrops/.libs/libmca_bfrops.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/preg/.libs/libmca_preg.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/psensor/.libs/libmca_psensor.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/pshmem/.libs/libmca_pshmem.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pmix/pmix2x/pmix/src/mca/ptl/.libs/libmca_ptl.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/pstat/.libs/libmca_pstat.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/rcache/.libs/libmca_rcache.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/reachable/.libs/libmca_reachable.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/shmem/.libs/libmca_shmem.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/timer/linux/.libs/libmca_timer_linux.a
NOTE: found static /home/common/openmpi/ompi/opal/mca/timer/.libs/libmca_timer.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/dfs/.libs/libmca_dfs.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/errmgr/.libs/libmca_errmgr.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/ess/.libs/libmca_ess.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/filem/.libs/libmca_filem.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/grpcomm/.libs/libmca_grpcomm.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/iof/.libs/libmca_iof.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/notifier/.libs/libmca_notifier.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/odls/.libs/libmca_odls.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/oob/.libs/libmca_oob.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/plm/.libs/libmca_plm.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/ras/.libs/libmca_ras.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/rmaps/.libs/libmca_rmaps.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/rml/.libs/libmca_rml.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/routed/.libs/libmca_routed.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/rtc/.libs/libmca_rtc.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/schizo/.libs/libmca_schizo.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/snapc/.libs/libmca_snapc.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/sstore/.libs/libmca_sstore.a
NOTE: found static /home/common/openmpi/ompi/orte/mca/state/.libs/libmca_state.a
NOTE: found static /home/common/openmpi/ompi/oshmem/mca/atomic/.libs/libmca_atomic.a
NOTE: found static /home/common/openmpi/ompi/oshmem/mca/memheap/.libs/libmca_memheap.a
NOTE: found static /home/common/openmpi/ompi/oshmem/mca/scoll/.libs/libmca_scoll.a
NOTE: found static /home/common/openmpi/ompi/oshmem/mca/spml/.libs/libmca_spml.a
NOTE: found static /home/common/openmpi/ompi/oshmem/mca/sshmem/.libs/libmca_sshmem.a
Checking for bad symbol names in the main libs:
checking /home/common/openmpi/ompi/ompi/.libs/libmpi.so
checking /home/common/openmpi/ompi/ompi/mpi/fortran/mpif-h/.libs/libmpi_mpifh.so
checking /home/common/openmpi/ompi/orte/.libs/libopen-rte.so
checking /home/common/openmpi/ompi/opal/.libs/libopen-pal.so
PASS: nmcheck_prefix
=============
1 test passed
=============
make[1]: Leaving directory `/home/common/openmpi/ompi/test/symbol_name'
01:30:12  {topic/mca} /home/common/openmpi/ompi/test/symbol_name$ 

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

@jjhursey Do you have MYBASE set somewhere? I'm wondering if this perl script in the directory is somehow trying to set it, but is using a specific shell syntax that isn't correct for everyone? I'm using bashrc, which I think is what the test environment uses.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Here it is, right in the beginning of the test:

sub main {
    if (!$ENV{MYBASE}) {
        print "Test expects env var MYBASE set to base dir\n";
        print "(where ompi/ opal/ orte/ test/ etc live)\n";
        print "And optionally OMPI_LIBMPI_NAME should be set\n";
        print "if MPI is configured with some name other than\n";
        print "\"mpi\" for that.\n";
        exit -1;
    }

# env var MYBASE should be the top dir where ompi/ opal/ orte/ test/ etc live.

So this test cannot run unless MYBASE is set.

@jjhursey
Copy link
Member

jjhursey commented Aug 8, 2017

I don't have it set in my environment (I'm using bash as well). The Makefile.am in that directory should be setting it appropriately:

AM_TESTS_ENVIRONMENT = MYBASE='$(top_builddir)'; OMPI_LIBMPI_NAME=@OMPI_LIBMPI_NAME@; export MYBASE OMPI_LIBMPI_NAME;

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Maybe this is the problem - from what I can see the correct automake directive is TESTS_ENVIRONMENT, not AM_TESTS_ENVIRONMENT

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

digging...digging...well, the automake docs say it is okay to use AM_TESTS_ENVIRONMENT by the developer, so that doesn't seem to be the issue. However, for whatever reason, it isn't setting the environment on my machine, nor on the test machines.

@jjhursey
Copy link
Member

jjhursey commented Aug 8, 2017

Curious. What version of automake are you using? I'm at 1.15.

From here I would think that AM_TESTS_ENVIRONMENT is what we want since we are the system and not the user (which is supposed to use TESTS_ENVIRONMENT). But if you are using the serial executor then there is a note that AM_TESTS_ENVIRONMENT will not work...

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

What I have done is push a change to that script so it doesn't error out with a failed test if the envar isn't set. It just skips the test. This seems like a better test to me as the feature it is testing is purely optional.

I'm using 1.15 as well - I believe the Jenkins tester is also.

@jjhursey
Copy link
Member

jjhursey commented Aug 8, 2017

@rhc54 I think that's fair. I'd like someone (cough @markalle cough) to see if they can reproduce separate from this PR (as this is a digression on the PR) and sort through the issue. What I don't want is to somehow have it that the test is never run and no one notices.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 8, 2017

Okay, we are back to dlopen_test failing on various platforms. @hjelmn You have any thoughts as to why? I simply cannot replicate anywhere.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Aug 10, 2017

not worth my time

@rhc54 rhc54 closed this Aug 10, 2017
@rhc54 rhc54 deleted the topic/mca branch August 10, 2017 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants