-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA + MPI_TYPE_INDEXED + MPI_SEND/RECV super slow #12202
Comments
The current approach of packing/unpacking device memory issues one cudaMemcpy per item in the indexed datatype. The performance will be terrible. This is something I wanted to take a look at for a long time, but I can't give you a time estimate for a potential fix. Meanwhile, for device memory I would suggest you pack the data yourself (with a device function if possible). |
Thanks, appreciate it. Let me know if I can help. As I wrote, something like https://docs.nvidia.com/cuda/cusparse/#cusparsegather and scatter do the trick. Is including gdr_copy worth the effort? Or is it only useful in combination with InfiniBand? Affinity handling by mpirun is also not so optimal (just my opinion), rank should auto--bind close to same cuda device (or rank % cuda_device_count). I don't like the manual discovery method suggested by FAQ... But that's another topic. |
FWIW: some of us are working on that now |
I would suggest that any fixes be applied to v5.0.x and later, and probably not be back-ported to v4.1.x. |
But then you would need to mark MPI_TYPE_INDEXED as an experimental feature and not expose it to users without warnings. The way it is implemented right now is worse than not implementing it at all (forcing a user to find alternatives). I was lucky to came from the other direction (packing using device functions, linear transfer), so I know it would work rather fast, but if a user approaches this by using MPI_TYPE_INDEXED first it would be rather frustrating and discouraging, assuming OMPI always works optimal. |
Background information
We developed a CFD code some years ago for arbitrary 3D shapes and parallelize via domain decomposition. To exchange halo/ghost cell data we used gather/scatter loops to combine data into a linear buffer before sending. That worked very well and surprisingly so for GPU2GPU, even with the detour of copying that buffer into host mem first, freeing MPI from knowing/caring about device-memory. Now, as multi-GPU environments become standard, we wanted to see if we can improve parallelism by transferring device2device memory with CUDA aware OpenMPI.
To make the code even more simple we switched to TYPE_INDEXED and sent / received directly with that type instead of first creating a linear buffer. This works very well in classic HPC environment (about 5% faster than prev method), but simply applying the same logic to device memory has tremendous impact on speed (factor 500 slower). The detour with linearizing first in device memory via cusparseGather() and cusparseScatter() yielded at least the same speed as before (with / without the detour to host memory).
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Tested with OMPI4.1 and latest 5.1 branch, same outcome.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Compiled on system with latest UCX + CUDA12.3.1
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.+088c88499514c3c55aee0c433d6f4fb46b0c3587 3rd-party/openpmix (v1.1.3-3973-g088c8849)
bc5de986e13c14db575141714a8b5e0089f4d0bf 3rd-party/prrte (psrvr-v2.0.0rc1-4741-gbc5de986e1)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)
Please describe the system on which you are running
Details of the problem
Thanks a lot for your help!
The text was updated successfully, but these errors were encountered: