Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR 12101 broke support for PRTE_MCA_ras_base_launch_orted_on_hn #12150

Closed
hppritcha opened this issue Dec 7, 2023 · 4 comments
Closed

PR 12101 broke support for PRTE_MCA_ras_base_launch_orted_on_hn #12150

hppritcha opened this issue Dec 7, 2023 · 4 comments

Comments

@hppritcha
Copy link
Member

The sha advance for the 3rd-party/prrte done in PR #12101 pulled in a bug that broke support for PRTE_MCA_ras_base_launch_orted_on_hn when set to 1.

This parameter is important when

  • using a system managed with SLURM or Cray PALS (native launcher)
  • the vendor supported method for acquiring RDMA credentials to use "high speed" networks relies on mechanisms within the SLURM/PALS launch procedure to provide the application processes with this info (the case for HPE SS11).

The bug is supposed to be fixed in PRRTe master but not yet in release branches used by Open MPI.

This problem is in both main and 5.0.x at the moment.

I'm marking this as critical because for sites using HPE SS11 and not supporting PMIx in SLURM or PALS, there's no alternative to using prrte based launch so currently there's a failure to launch (at least easily) on these platforms. I believe is the case for ORNL systems.

@hppritcha
Copy link
Member Author

see comments in 2f9cabf for more details

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2023

Are you saying that advancing the submodule pointers does not fix the problem? Or are you just filing this as a reminder to update the pointers before release (which is planned anyway)?

@hppritcha
Copy link
Member Author

Are you saying that advancing the submodule pointers does not fix the problem? Or are you just filing this as a reminder to update the pointers before release (which is planned anyway)?

the later so we (Open MPI) don't forget to advance the shas to pull in the fix once you've committed to prrte release branches.

@janjust
Copy link
Contributor

janjust commented Dec 12, 2023

fixed with #12152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants