-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be more helpful when OPAL_PREFIX is wrong #4028
Conversation
Refs #3980 |
I'm seeing weird random failures in the opal_fifo tests that can't have anything to do with this PR - does anyone have any ideas? |
bot:ompi:retest |
@bwbarrett I'm going to need some help here. I cannot make the opal_fifo tests fail on my machines, even copying the autogen/configure lines used here. Nothing in this PR relates to opal_fifo or touches anything to do with it. So I'm at a total loss as to what might be going on here. |
…e really have a non-recoverable problem. As things stand, we fail with a completely unhelpful message from the first framework that tries to open components, finds nothing, and errors out because it needs at least one to work. The use-case that tripped this was an incorrect setting of OPAL_PREFIX to point to a non-existent location due to a typo. It took significant debugging to find that it wasn't the framework that was the problem - it was the envar. Try to detect that and provide a more useful error message. Update ignores Oops - we were exiting regardless of the value returned from opal_dl_foreachfile. :-( Also, allow the repository search to continue across all path elements, not just fail after the first one that couldn't be opened Signed-off-by: Ralph Castain <[email protected]>
I added a comment to the PMIx version of this:
One future feature request might be to clarify what happens if I have two directories in my |
Looks like the pull request checker is borked. |
bot:ompi:retest |
@jjhursey Based on what I see in the code, we work thru every directory in the search path, slurping up all components for the given framework. So it is the union of the directories. I'm not sure what happens if dlopen sees two components of the same name. |
Sigh - I give up. Now we are back to getting opal_fifo test errors when --no-ompi, --no-oshmem, or --no-orte is given to autogen. |
I will try to reproduce the issue tomorrow. Failures in just about all the tests that call |
…y and output a more helpful error message. Signed-off-by: Ralph Castain <[email protected]>
…meter Signed-off-by: Ralph Castain <[email protected]>
What the heck is this one?
|
IBM added a feature to rename |
Is there some way to remove this for now so we can get past it? Or update the OMPI internal tests so it passes? |
Just wondering: since this PR doesn't touch that functionality, how did this get thru CI before? |
Humm not sure how that would have gotten through CI (that option has been in there for quite a while now). Let me see if I can reproduce locally. |
FWIW: I can reproduce on my machine as well. I'm wondering is someone changed the tests as opposed to the option itself? I thought I saw something go thru just this morning about it... |
Bingo! #4036 is what broke things. |
or at least, it touched this test - the PR saids it passed CI...? |
Hard as I try I cannot reproduce the symbol test issue. I've tried with the configure parameters Ralph sent me offline, and the ones from Mellanox CI. Maybe there is an environment variable conflict for |
interesting - and I cannot get it to pass on my machines! I don't have any envar like MYBASE set. Yet it consistently fails for me with that message. |
Tried configuring with nothing other than a prefix, but same problem:
|
Aha - here is what is needed to make this test pass:
|
@jjhursey Do you have MYBASE set somewhere? I'm wondering if this perl script in the directory is somehow trying to set it, but is using a specific shell syntax that isn't correct for everyone? I'm using bashrc, which I think is what the test environment uses. |
Here it is, right in the beginning of the test:
So this test cannot run unless MYBASE is set. |
Signed-off-by: Ralph Castain <[email protected]>
I don't have it set in my environment (I'm using bash as well). The
|
Maybe this is the problem - from what I can see the correct automake directive is |
digging...digging...well, the automake docs say it is okay to use |
Curious. What version of automake are you using? I'm at From here I would think that |
What I have done is push a change to that script so it doesn't error out with a failed test if the envar isn't set. It just skips the test. This seems like a better test to me as the feature it is testing is purely optional. I'm using 1.15 as well - I believe the Jenkins tester is also. |
Okay, we are back to dlopen_test failing on various platforms. @hjelmn You have any thoughts as to why? I simply cannot replicate anywhere. |
Signed-off-by: Ralph Castain <[email protected]>
not worth my time |
If none of the provided search directories for plugins exists, then we really have a non-recoverable problem. As things stand, we fail with a completely unhelpful message from the first framework that tries to open components, finds nothing, and errors out because it needs at least one to work.
The use-case that tripped this was an incorrect setting of OPAL_PREFIX to point to a non-existent location due to a typo. It took significant debugging to find that it wasn't the framework that was the problem - it was the envar.
Try to detect that and provide a more useful error message.
Signed-off-by: Ralph Castain [email protected]