-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMPI v3.0: pmix install dirs info is not propagated to orteds #3980
Comments
@rhc54 I know that on the dev meeting we had discussion about variable propagation. Do we have any news on that? |
One of the obvious solutions is to check if |
We said at the meeting that we would create some kind of registration mechanism to cover these things, but that won't be ready for awhile and certainly wouldn't go into 3.0. For now, the only real solution is to manually do these things in the schizo/ompi component. I can add some code to cover it. |
Ok, thank you. |
Actually, one correction: we only forward OPAL_PREFIX for the plm/rsh component. So the addition will occur there. |
I only checked with pml/rsh but I it would be good to make sure that others are working fine as well. |
There actually isn't any way to forward something in the other methods - we have to rely solely on the configuration of the resource manager. Usually that config will forward nearly everything - i.e., the config generally doesn't scan for particular envar patterns like OPAL_. However, you are correct that users know to set the OPAL envar and expect it to cover everything. Probably the easiest solution is to just special case it and see if the daemon spots it, and then add the corresponding envar. Ugly, but likely the only current solution. |
To clarify: in the issue description PMIX_INSTALL_PREFIX variable was manually set by me. |
Discussed at the call today. Adding Blocker label to this for v3.0 Expect there is a similar issue with hwloc components directory as well, except that hwloc doesn't build mca components by default (?) |
@bwbarrett we have some discussion of the current solution in #3985. |
What about a convention that the PMIX mca components and the HWLOC mca components are always in a known relative subdirectory of the OPAL_PREFIX IF the specific PMIX_PREFIX or HWLOC_PREFIX (if that's a thing) isn't set? |
This is fine unless we are using external components. |
Here is the update to a problem description as it seems that there might be a misunderstanding: We have 2 problems now: This issue is mostly about a) but it is fint to fix b) as part of it. With respect to a) here is what I observe: export PMIX_INSTALL_PREFIX mpirun -np 2 -H cn01 hostname Will work fine as export PMIX_INSTALL_PREFIX mpirun -np 2 -H cn01,cn02 hostname is not working because OMPI does not propagate [cn01:17570] [[10714,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> \
OPAL_PREFIX=<ompi-path>/ompi-v3.0.x ; export OPAL_PREFIX; \
PATH=<ompi-path>/ompi-v3.0.x/bin:$PATH ; export PATH ; \
LD_LIBRARY_PATH=<ompi-path>/ompi-v3.0.x/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; \
DYLD_LIBRARY_PATH=<ompi-path>/ompi-v3.0.x/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; \
<ompi-path>/ompi-v3.0.x/bin/orted -mca orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "702152704" \
-mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "cn01,cn02@0(2)" \
-mca orte_hnp_uri "702152704.0;tcp://<IP1>,<IP2>;ud://<UD>" -mca coll_hcoll_enable "1" -mca pml "yalla" \
--mca plm_base_verbose "100" -mca plm "rsh" -mca rmaps_base_mapping_policy "node" \
-mca hwloc_base_binding_policy "core" -mca rmaps_base_display_map "1" |
Whoa there, partners - you are blowing this way out of proportion. First off, it is the users responsibility to set the paths for both OMPI and any secondary libraries on the backend nodes. We cannot take responsibility for forwarding paths for everything - the command line has stringent length limits. There was a lot of argument about excepting OPAL_PREFIX back when we first did it, and it wasn't clear that we should be doing so as it became an exception to the rule. However, it was felt that it was convenient enough - and a special enough use-case - to warrant making an exception. The most compelling rationale was that it dealt specifically with OMPI internal libraries, and so we were only supporting what we ship. HWLOC doesn't have any plugins, and so it doesn't need "prefix" support. Ditto for libevent. So please leave them out of this discussion. It isn't clear to me at all that we should be forwarding PMIX_INSTALL_DIR for someone that is using an external PMIx library. This isn't an OMPI library issue - it is the users responsibility to ensure the backend is correctly setup. The only issue we have is that the user community expects OPAL_PREFIX to result in a fully functional OMPI installation. This therefore must include pointing to the location of the internal PMIx plugin directory, which is based on OPAL_PREFIX. Since the PMIx library cannot see OPAL_PREFIX, we must manually make the translation to ensure the user gets what they expect. So again, to be clear: we bear no responsibility for forwarding things on behalf of external libraries, else we would need to do so for UCX, PSM, and every other library we link against. We only bear responsibility to ensure that the OMPI-included libraries work correctly together when someone sets OPAL_PREFIX. My commit does that - I see no reason for us to be doing anything different. |
@rhc54 one comment
I'm not sure I understand. If we don't forward PMIX env how user can communicate this to the server-side part of the external PMIx sitting in orted's. |
We always require that users take responsibility for ensuring that their environment (path and library path) is properly setup on the backend to reach any libraries they linked against OMPI. This includes external copies of hwloc, libevent, and pmix. If they installed their external pmix in a location that requires PMIX_INSTALL_DIR, then they need to set that in their backend environment - if they are using rsh, the normal method is to put it in their bashrc or equivalent. We only take care of our own internal code. In this case, our concern has to be that setting OPAL_PREFIX should ensure that all our internal code is pointed to the correct locations. So it makes sense that we set PMIX_INSTALL_DIR under the covers when OPAL_PREFIX has been set. The only caveats to this hinges on what happens if the user has PMIX_INSTALL_DIR set in their environment for some other reason (e.g., when working with something that used an external PMIx library). There are two cases of concern here:
Given those situations, I'm beginning to think that we should do the following when we execute:
I suppose one could argue about direct launch vs mpirun for that last case, but the problem is that the app will still be linked against libopen-pal, which points to our internal PMIx library. So letting the direct launched app grab plugins from some other library is going to lead to trouble. At configure time, we should check to see if PMIX_INSTALL_DIR is set. If we are building the internal PMIx library, then we should error out with a message indicating that this will cause problems. HTH |
BTW: that last point should only be done if having PMIX_INSTALL_DIR set actually causes us to relocate the PMIx plugins when we build the internal PMIx library. If not, then we can just ignore that envar. |
@rhc54 I don't see a |
I believe that |
I agree with that,the recent commit does that differently: |
Our (IBM) original thinking on this topic was:
The argument against this would be that if there is a PMIx installed in the default search path, then we don't want to force the sysadmin to set In any case we should be propagating Do we also need to propagate |
install prefix is enough - I verified that at runtime. However I'm not sure I understand your point about the handling of existing/nonexisting install prefix. |
@jjhursey re-reading your comment it seems like you suggesting to revert all the changes in pmix2 component (that auto-set `PMIx install prefix) and keep only the rsh part that propagates install prefix envar. |
You can't just blindly propagate like that - see my notes
…Sent from my iPhone
On Aug 2, 2017, at 10:57 AM, Artem Polyakov ***@***.***> wrote:
@jjhursey re-reading your comment it seems like you suggesting to revert all the changes in pmix2 component (that auto-set `PMIx install prefix) and keep only the rsh part that propagates install prefix envar.
Is that correct?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Some notes from our teleconf. Case A: Internal PMIx component
Action Item: Ralph is working on a PR for these cases. Will be an extension of the work started in PR #4012. Case B: External PMIx component
Action Item: Josh is Working on a PR for the external component MCA parameter to set Slightly better error checking for bad installations
Action item: Ralph will work on PRs for the error handling bullets above. |
Note: I did get started on the external component MCA parameter. There are a couple snags that I'll need to come back to. Plan to have more on Monday. |
@jjhursey has your external component changes went in? If not - I guess that this issue is not fully solved in master as well |
There is nothing to do in the external components. OPAL_PREFIX doesn't impact them, and it is the user's responsibility to propagate any PMIx prefix requirements. |
I have a question on that btw. |
There are many things that one environment provides and others don't, so this isn't the only difference users encounter. All managed systems generally pass a fair range of envars, depending on configuration, so most of those will likely be okay. Users running on unmanaged systems are most likely to build against the internal PMIx, or against one installed by the system that is going to be in a standard location - and thus they should also be okay. The corner-case scenarios will undoubtedly appear, but we can't solve everything - the cmd line limitations won't let us. |
This reverts commit 71da0fc. (per open-mpi#4052). Refs: open-mpi#3980 Signed-off-by: Artem Polyakov <[email protected]>
Merged #4076, so closing this ticket. |
Open MPI version
gitclone ompi v3.0.x (2f13cce)
Details of the problem
If OMPI installation is moved to a different location
OPAL_PREFIX
env var is used to identify that andinstalldirs/env
is handling this correctly.PMIx also has MCA infrastructure and similar installdirs component. in PMIx
PMIX_INSTALL_PREFIX
playing the role ofOPAL_PREFIX
. The important difference is thatOPAL_PREFIX
is an OMPI variable and it gets propagated to orte daemons, i.e.:As it can be seen there
PMIX_INSTALL_PREFIX
is not propagated so orteds are failing with:Because they can't find PMIx mca ptl components.
The text was updated successfully, but these errors were encountered: