Add GTL #716

JBlaschke · 2023-02-26T05:43:53Z

Yay!!! GPU-Aware MPI breaks more ABIs. Here's an example what happens without loading GTL before a libmpi that needs it:

MPICH ERROR [Rank 0] [job id 5680267.11] [Wed Feb 22 23:01:58 2023] [nid002845] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
 (Other MPI error)

aborting job:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked

(sorta makes sense I guess 🤨 ... vendors don't want to have to compile two different libmpis ... just insert a libgtl whenever GPUs are around ... yes, that's wayyyy better 😛 )

In the case of some system MPI libraries, GPU-aware MPI is implemented as another library -- bearing the fancy name of GPU Transport Layer (GTL). For example, on Perlmutter it's called libmpi_gtl_cuda.so. Often it's important that this library is loaded before libmpi. These changes do:

Don't change the default behavior
MPIPreferences has an option: gtl_names, which -- if not nothing -- is a list of possible names for the GTL library
MPI will dlopen libgtl before libmpi (if not nothing).

I have tested this on Perlmutter. Will test on Crusher next. Also I don't know if I accidentally broke MPITrampoline, which I will do asap.

This PR represents a tradeoff. Clearly there is no standard way that GTL is defined. So I avoided creating a default search strategy. One could be tempted to look for Cray systems and then "just load GTL". This would cause problems on our CPU nodes, which have the GTL libraries installed (we want to have a single SW image for all nodes), but don't support it (what GPUs? this is a CPU node!).

This PR allows us (the helpful sysadmins) to provide two different Preferences.toml files for each type of node. It does come at the cost of forcing users to potentially having to manage different LocalPreferences.toml files if they have MPI in the LocalPreferences.

sloede · 2023-02-26T05:49:53Z

It does come at the cost of forcing users to potentially having to manage different LocalPreferences.toml files if they have MPI in the LocalPreferences.

What would they have to do right now to make it work? Or asked differently: Is this changing things from "inconvenient" to "even more inconvenient" for users on systems without sysadmin support, or from "impossible" to "doable but inconvenient"? If it is the latter, I think it will be an improvement nonetheless, wouldn't it?

simonbyrne · 2023-02-26T05:54:06Z

This is quite a cumbersome patch just to deal with Cray MPI, I wish there was a better way. What does mpi4py do?

simonbyrne · 2023-02-26T05:58:42Z

Since it's only required at runtime, we could just do it based on environment variables?

This would cause problems on our CPU nodes, which have the GTL libraries installed (we want to have a single SW image for all nodes), but don't support it (what GPUs? this is a CPU node!).

How is this logic handled for C programs?

JBlaschke · 2023-02-26T06:10:41Z

@sloede

What would they have to do right now to make it work? Or asked differently: Is this changing things from "inconvenient" to "even more inconvenient" for users on systems without sysadmin support, or from "impossible" to "doable but inconvenient"? If it is the latter, I think it will be an improvement nonetheless, wouldn't it?

Right now, they would either have to use:

LD_PRELOAD=${CRAY_MPICH_ROOTDIR}/gtl/lib/libmpi_gtl_cuda.so

or add

Libc.Libdl.dlopen("libmpi_gtl_cuda.so", Libc.Libdl.RTLD_GLOBAL)

before the first using MPI line.

Re advice for users who can't ask the sysadmin, I would document an example use_system_binary call -- they would have to do that anyway. So something like:

MPIPreferences.use_system_binary(; library_names=["libmpi_cray"], gtl_names=["libmpi_gtl_cuda", "libmpi_gtl_hsa"], mpiexec="srun")

would cover both Perlmutter and Frontier.

JBlaschke · 2023-02-26T06:11:28Z

This is quite a cumbersome patch just to deal with Cray MPI, I wish there was a better way. What does mpi4py do?

They do the same thing that Cray always tells us to do: "use the compiler wrappers to build mpi4py".

JBlaschke · 2023-02-26T06:19:52Z

@simonbyrne

Since it's only required at runtime, we could just do it based on environment variables?

Urgh ... If it where up to me alone, then sure! Let's put in an env variable. But I kinda like the idea of having preferences managed by ... well ... Preferences (with a capital "P"). Anyway, GTL is part of using the system binary, so I think keeping this alongside libmpi makes sense.

How is this logic handled for C programs?

They are compiled using the Cray compiler wrappers -- I don't know how the compiler wrappers work in detail (no / very limited documentation). I suspect they futz around with the linker to make sure that GTL is linked before MPI. Note: when you build a program with GTL enabled, it can't run on CPU nodes. So the compiler wrappers do insert something....

simonbyrne · 2023-02-26T06:57:45Z

Note: when you build a program with GTL enabled, it can't run on CPU nodes.

Ah, that's disappointing. I was hoping there was some magic environment variable set on your GPU nodes that we could rely upon. Ah well.

This PR allows us (the helpful sysadmins) to provide two different Preferences.toml files for each type of node.

Why do you need two Preferences.toml files for each type of node?

simonbyrne · 2023-02-26T07:00:12Z

I need to go to bed, but I can get on board with this if we make it a little less Cray-specific: what if we just called it preload, and it could be a list of libraries that get dlopen-ed before libmpi?

JBlaschke · 2023-02-26T07:09:41Z

Ah, that's disappointing. I was hoping there was some magic environment variable set on your GPU nodes that we could rely upon. Ah well.

That's what I was hoping for also...

Why do you need two Preferences.toml files for each type of node?

One with the preloads and one without.

JBlaschke · 2023-02-26T07:12:17Z

@simonbyrne

I need to go to bed, but I can get on board with this if we make it a little less Cray-specific: what if we just called it preload, and it could be a list of libraries that get dlopen-ed before libmpi?

I like this! It would be a bit more effort, but would cover a broader set of use cases. If we can define preloads that depend on an env var (MPICH_GPU_SUPPORT_ENABLED, https://docs.nersc.gov/development/compilers/wrappers/#set-the-accelerator-target-to-gpus-for-cuda-aware-mpi-on-perlmutter ) then we might be able to get away with just one set of preferences. (That env var is turned off on the CPU nodes)

vchuravy · 2023-02-26T13:01:52Z

we can define preloads that depend on an env var MPICH_GPU_SUPPORT_ENABLED

So I am wondering if we should do something:

Add a vendor flag to MPIPreferences and try to autodetect cray
When vendor is cray check the environment variable MPICH_GPU_SUPPORT_ENABLED
and attempt a Libc.Libdl.dlopen("libmpi_gtl_cuda.so", Libc.Libdl.RTLD_GLOBAL)?

Now the big question for me is how to deal with rocm vs cuda... and do we need something like a LD_LIBRARY_PATH (I would expect the module file to set that correctly)?

JBlaschke · 2023-02-26T16:24:58Z

Vendor flags might make a lot of sense for a different reason @vchuravy : as it stands right now, if a user loads say PrgEnv-nvidia then they have to load libmpi_nvidia, not libmpi_cray and so on. Sometimes these are compatible (e.g. libmpi_cray seems to work with PrgEnv-gnu), but not always.

I build different Julia modules for each PE, but if a user rolls their own Julia environment then it might rely on a specific PE. Having smarter logic in MPI.jl would fix that.

Now the big question for me is how to deal with rocm vs cuda

The rocm version of gtl is called libmpi_gtl_hsa. My first attempt would be to try to detect either using find_library. I noticed that Perlmutter has both libmpi_gtl_cuda.so and libmpi_gtl_hsa.so, but find_library only detects the cuda version. Maybe on an AMD machine it's the other way around.

giordano · 2023-02-26T16:30:34Z

How is libgtl related to libmpi? The latter requires something (symbols?) from the former? If so, does libmpi dynamically link to libgtl (i.e. what's the output of readelf -d /path/to/libmpi)? If the answer to this last question is no, then something looks broken to me in how this is packaged up, but perhaps I'm missing something.

vchuravy · 2023-02-26T17:05:09Z

@giordano my assumption is that they use dlsym to see if the library is preloaded/linked into the binary.

Right now I just want to shout at cray and have everyone use OpenMPI.

JBlaschke · 2023-02-26T17:16:46Z

Yea, there are no symbols the libmpi needs from libmpi_gtl_* that I can see -- otherwise we'd be getting linker errors, not Cray's own error.

What @vchuravy says makes sense. I can't find any documentation on this (other than: if you see this error, recompile with the cray compiler wrappers).

Following this up at NERSC, to see if Cray would be willing to change the behavior of Cray MPICH. We should still work on vendor flags in the meantime.

simonbyrne · 2023-02-26T17:19:02Z

Why do you need two Preferences.toml files for each type of node?

One with the preloads and one without.

I don't understand: wouldn't you always want the preloads for the GPU nodes, and no preloads on non-GPU?

JBlaschke · 2023-02-26T17:21:44Z

I don't understand: wouldn't you always want the preloads for the GPU nodes, and no preloads on non-GPU?

You mean CPU? Mainly for sanity: I don't know what GTL will do on a system without GPUs ...

This is also more general: NERSC has a history of systems with different kinds of nodes. @simonbyrne I like your approach of keeping it general. So in general different kinds of nodes might keep libraries in different places, etc. NERSC has been using slurm and the module system to give users a way to deploy their codes on different hardware (e.g. Cori GPU). This also isn't unique to NERSC

simonbyrne · 2023-02-26T17:31:31Z

Are you able to join the JuliaHPC meeting on Tuesday? It might be easier to discuss there.

JBlaschke · 2023-02-26T17:31:54Z

Right now I just want to shout at cray and have everyone use OpenMPI.

@vchuravy in 20 years we'll have compatible ABIs https://www.mpich.org/abi/

JBlaschke · 2023-02-26T17:37:01Z

Are you able to join the JuliaHPC meeting on Tuesday? It might be easier to discuss there.

Sadly no. I can do an impromptu call at 9am PT tomorrow (Monday)

JBlaschke · 2023-02-27T17:14:17Z

Quick update: I just confirmed that no preloads are necessary when setting MPICH_GPU_SUPPORT_ENABLED=0 -- so users that don't want GPU-aware MPI, don't need to preload GTL as long as they also have something like export MPICH_GPU_SUPPORT_ENABLED=0 in their environment.

Doesn't get us off the hook completely though, as we still need preload GTL for GPU-aware MPI. The nice thing is that we don't strictly need to not preload GTL, as MPICH_GPU_SUPPORT_ENABLED=0 does seem to turn it off, even when preloaded.

I am going to work on vendor flags regardless, as they might still be useful for autmaticaly adding vendor preloads (e.g. picking the "right" GTL for AMD vs Nvidia)

simonbyrne · 2023-02-27T19:02:03Z

What if we were to load the GTL if MPICH_GPU_SUPPORT_ENABLED=1 is set? Can we simply assume that it is in the same directory as the libmpi?

Possible alternative to #716

vchuravy · 2023-02-27T19:33:21Z

What is nm -D on these libraries?

JBlaschke · 2023-02-27T21:53:02Z

What if we were to load the GTL if MPICH_GPU_SUPPORT_ENABLED=1 is set? Can we simply assume that it is in the same directory as the libmpi?

Right, so that's the spirit behind #717 -- that would solve part of the problem, but makes us vulnerable to env var names changing. Also it doesn't help deciding between libmpi_gtl_hsa and libmpi_gtl_cuda. Parsing the output from CC --cray-print-opts=libswould solve that particular problem. GTL libraries are not guaranteed to be in the same directory as MPI. However, they are in theLD_LIBRARY_PATH` so I think we can safely rely on them being loaded by name.

JBlaschke · 2023-02-27T22:00:20Z

What is nm -D on these libraries?

@vchuravy Here you go:

vchuravy · 2023-02-28T02:19:51Z

So now the question is what is exported on Frontier/Crusher. My worry is that the symbols overlap and it would only be legal to preload one of them

JBlaschke · 2023-07-06T05:41:35Z

Ah! looks like cleaning up formatting seems to have solved the docs-build problem

JBlaschke · 2023-07-06T18:47:21Z

Can someone familiar with CI comment on what I should do with the failing tests. Right now I don't understand how and if my changes triggered these regressions

JBlaschke · 2023-07-13T04:50:33Z

@simonbyrne any chance you can merge this?

simonbyrne

To try to keep backward compatibility, we should only update the _format key if it uses the new features. Otherwise, we can keep it at "1.0".

lib/MPIPreferences/src/MPIPreferences.jl

docs/src/configuration.md

Only bump format for where the new version is needed Co-authored-by: Simon Byrne <[email protected]>

only require v1.1 if vendor is input Co-authored-by: Simon Byrne <[email protected]>

Co-authored-by: Simon Byrne <[email protected]>

src/MPI.jl

Co-authored-by: Simon Byrne <[email protected]>

lib/MPIPreferences/src/MPIPreferences.jl

Co-authored-by: Valentin Churavy <[email protected]>

vchuravy · 2023-07-20T17:54:20Z

LGTM!

lib/MPIPreferences/src/MPIPreferences.jl

src/MPI.jl

lib/MPIPreferences/src/MPIPreferences.jl

JBlaschke · 2023-07-20T22:28:33Z

Ok, so I've cleaned things up a bit. I moved all the preload logic to MPIPreferences. This way, if the vendor logic changes, then we only need to bump MPIPreferences. It also avoids code duplication. I tested on Perlmutter. Things are working nicely.

@simonbyrne @vchuravy feel free to merge.

simonbyrne · 2023-07-20T23:40:57Z

LGTM

Just need to add the docstring to the docs:
https://github.com/JuliaParallel/MPI.jl/actions/runs/5616415928/job/15218697196?pr=716#step:4:21

JBlaschke · 2023-07-21T18:42:09Z

@simonbyrne Docstring added

src/MPI.jl

simonbyrne · 2023-07-21T18:46:56Z

Can you bump the patch version of MPIPreferences?

JBlaschke · 2023-07-21T23:18:30Z

This looks good -- @simonbyrne do you also want to bump the MPI.jl patch version?

JBlaschke added 2 commits February 25, 2023 21:24

add libgtl

c6050e6

add reference to PR in comment

c24bf46

simonbyrne added a commit that referenced this pull request Feb 27, 2023

Cray MPI GPU compatibility

f8d0843

Possible alternative to #716

simonbyrne mentioned this pull request Feb 27, 2023

Cray MPI GPU compatibility #717

Closed

simonbyrne requested changes Jul 13, 2023

View reviewed changes

lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved

lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved

docs/src/configuration.md Outdated Show resolved Hide resolved

docs/src/configuration.md Show resolved Hide resolved

JBlaschke and others added 4 commits July 13, 2023 10:14

Update lib/MPIPreferences/src/MPIPreferences.jl

7a81fc4

Only bump format for where the new version is needed Co-authored-by: Simon Byrne <[email protected]>

Update lib/MPIPreferences/src/MPIPreferences.jl

15e064a

only require v1.1 if vendor is input Co-authored-by: Simon Byrne <[email protected]>

Update docs/src/configuration.md

d06178a

Co-authored-by: Simon Byrne <[email protected]>

Update docs/src/configuration.md

cef0d0f

Co-authored-by: Simon Byrne <[email protected]>

simonbyrne approved these changes Jul 13, 2023

View reviewed changes

simonbyrne reviewed Jul 13, 2023

View reviewed changes

src/MPI.jl Outdated Show resolved Hide resolved

Update src/MPI.jl

a73e399

Co-authored-by: Simon Byrne <[email protected]>

vchuravy reviewed Jul 20, 2023

View reviewed changes

lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved

vchuravy approved these changes Jul 20, 2023

View reviewed changes

lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved

Update lib/MPIPreferences/src/MPIPreferences.jl

7c4b393

Co-authored-by: Valentin Churavy <[email protected]>

clean up, and sanitize preloading

a1b2251

JBlaschke commented Jul 20, 2023

View reviewed changes

lib/MPIPreferences/src/MPIPreferences.jl Show resolved Hide resolved

src/MPI.jl Outdated Show resolved Hide resolved

lib/MPIPreferences/src/MPIPreferences.jl Outdated Show resolved Hide resolved

add docstring to docs

82b2750

simonbyrne reviewed Jul 21, 2023

View reviewed changes

src/MPI.jl Show resolved Hide resolved

simonbyrne added 2 commits July 21, 2023 12:16

Update src/MPI.jl

c38a6dc

Bump patch

26cfb69

simonbyrne and others added 2 commits July 24, 2023 10:05

Bump version

aa16897

lowerbound MPIPreferences

4147500

giordano mentioned this pull request Jul 24, 2023

[CI] Use patched MVAPICH2 to avoid segfault #751

Closed

simonbyrne merged commit fd2c626 into JuliaParallel:master Jul 25, 2023

Add GTL #716

Add GTL #716

Conversation

JBlaschke commented Feb 26, 2023

sloede commented Feb 26, 2023

simonbyrne commented Feb 26, 2023

simonbyrne commented Feb 26, 2023 • edited Loading

JBlaschke commented Feb 26, 2023 • edited Loading

JBlaschke commented Feb 26, 2023

JBlaschke commented Feb 26, 2023 • edited Loading

simonbyrne commented Feb 26, 2023

simonbyrne commented Feb 26, 2023

JBlaschke commented Feb 26, 2023

JBlaschke commented Feb 26, 2023 • edited Loading

vchuravy commented Feb 26, 2023

JBlaschke commented Feb 26, 2023

giordano commented Feb 26, 2023

vchuravy commented Feb 26, 2023

JBlaschke commented Feb 26, 2023 • edited Loading

simonbyrne commented Feb 26, 2023

JBlaschke commented Feb 26, 2023 • edited Loading

simonbyrne commented Feb 26, 2023

JBlaschke commented Feb 26, 2023

JBlaschke commented Feb 26, 2023

JBlaschke commented Feb 27, 2023

simonbyrne commented Feb 27, 2023

vchuravy commented Feb 27, 2023

JBlaschke commented Feb 27, 2023

JBlaschke commented Feb 27, 2023

vchuravy commented Feb 28, 2023

JBlaschke commented Jul 6, 2023

JBlaschke commented Jul 6, 2023

JBlaschke commented Jul 13, 2023

simonbyrne left a comment

Choose a reason for hiding this comment

vchuravy commented Jul 20, 2023

JBlaschke commented Jul 20, 2023 • edited Loading

simonbyrne commented Jul 20, 2023

JBlaschke commented Jul 21, 2023

simonbyrne commented Jul 21, 2023

JBlaschke commented Jul 21, 2023

simonbyrne commented Feb 26, 2023 •

edited

Loading

JBlaschke commented Feb 26, 2023 •

edited

Loading

JBlaschke commented Feb 26, 2023 •

edited

Loading

JBlaschke commented Feb 26, 2023 •

edited

Loading

JBlaschke commented Feb 26, 2023 •

edited

Loading

JBlaschke commented Feb 26, 2023 •

edited

Loading

JBlaschke commented Jul 20, 2023 •

edited

Loading