-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, #6902) #6904
Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, #6902) #6904
Conversation
…trilinos#6902) Turns out that the tests in the "cuda-aware-mpi" builds on 'vortex' were not actually setting TPETRA_ASSUME_CUDA_AWARE_MPI=1. This commit fixes that and it also documents for users how to run the test suite with and without CUDA-aware MPI in Tpetra.
FYI: I manually merged this commit to 'atdm-nightly-manual-updates' in commit dc4093f so this will run tomorrow morning. But it would still be nice to get a review of the documentation before this gets merged (for which I will create new commits to fix any problems). |
There is actually a typo in the
It should have been setting I am going to try adding the fix to this PR |
I see the correct output from the
|
Yea, I noticed that when I was testing with |
@jjellio, that is after you fixed it, right? |
Yes, I pushed to this branch... I think, but it doesn't seem to be showing up in your branch. edit: e.g.,
|
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
Yea, I don't know how to push a change to your PR. If you can spell out how to do it, I can push the typo correction. It would seem I need to clone your entire fork |
@jjellio: assuming Ross gives you permissions to push to his for you should be able to do: git remote add ross [email protected]:bartlettroscoe/Trilinos.git
git fetch ross
git checkout cdofa-100-tril-6902-cuda-aware-mpi
git cherry-pick YOUR-COMMIT-SHA
git push ross cdofa-100-tril-6902-cuda-aware-mpi |
@jjellio, I created this PR and checked the "Allow maintainers to edit branch" option. Therefore, if are a member of the Trilinos GitHub organization, you should be able to push to this branch. An alternative set of git commands are: git remote add bartlettroscoe [email protected]:bartlettroscoe/Trilinos.git
git fetch bartlettroscoe
git checkout --track bartlettroscoe/cdofa-100-tril-6902-cuda-aware-mpi
<edit any file you want>
git commit -a
git push # Will push to tracking bartlettroscoe/cdofa-100-tril-6902-cuda-aware-mpi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Ross! :-)
There is a small typo in the README.
cmake/std/atdm/README.md
Outdated
$ ctest -j4 | ||
``` | ||
|
||
By set the mode for CUDA-aware MPI, set the env var: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To set the mode for..."
cmake/std/atdm/README.md
Outdated
$ ctest -j4 | ||
``` | ||
|
||
before running `ctest`. Otherwise, if `TPETRA_ASSUME_CUDA_AWARE_MPI` is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we hyperlink to the documentation for TPETRA_ASSUME_CUDA_AWARE_MPI
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e10harvey, I don't think there is documentation for TPETRA_ASSUME_CUDA_AWARE_MPI
. I think our README file is it!
@@ -186,8 +186,6 @@ if [[ "$ATDM_CONFIG_COMPILER" == "CUDA-10.1.243_"* ]]; then | |||
export KOKKOS_NUM_DEVICES=4 | |||
|
|||
# CTEST Settings | |||
# TPETRA_ASSUME_CUDA_AWARE_MPI is used by cmake/std/atdm/ats2/trilinos_jsrun | |||
export TPETRA_ASSUME_CUDA_AWARE_MPI=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't local-driver.sh run after environment.sh? Why is this export of TPETRA_ASSUME_CUDA_AWARE_MPI
not picked up in local-driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e10harvey, the ctets-s-driver.sh script sources the load-env.sh script again and overwrites this. We need to just not touch the TPETRA_ASSUME_CUDA_AWARE_MPI
var in the atdm/environment.sh script.
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
It could help to look at what E.g., |
…os_jsrun (CDOFA-100) This ensures that the trilinos_jsrun script's logic for setting the default for TPETRA_ASSUME_CUDA_AWARE_MPI=0 if TPETRA_ASSUME_CUDA_AWARE_MPI is unset actually works. That is important for developers who just want to manually run the test suite.
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
… mention trilinos_jsrun (CDOFA-100, trilinos#6902)
0228511
to
85614cf
Compare
@e10harvey, @jjellio, and/or @kddevin, can you please review the updated documentation in the atdm/READM.md file? I made a few changes to hopefully make it more clear what is going on and I now mention the existence of |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
FYI: In commit 506fac1 I made the non-CUDA-aware MPI build start with unset
which posted to:
We can see the tests in the
Now that looks correct. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks, Ross!
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ e10harvey ]! |
Status Flag 'Pull Request AutoTester' - AutoMerge IS ENABLED, but the Label AT: AUTOMERGE is not set. Either set Label AT: AUTOMERGE or manually merge the PR... |
Okay, I will let this merge now ... |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 6904: IS A SUCCESS - Pull Request successfully merged |
```bash | ||
$ ctest -j4 | ||
``` | ||
|
||
The MPI test exectuables are run by a wrapper script `trilinos_jsrun` which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to know how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration. If one wants to reproduce a failing test, where should one look in CDash to get the value used for that test? The environment variable setting is not archived in the CMake configuration output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to know how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration.
It is printed out by trilinos_jsrun
before it runs jsrun
. Therefore, that information is on CDash in the detailed test output. For example, if you if you look at the output for the test TpetraCore_Behavior_Default_MPI_4
here you will see:
BEFORE: jsrun '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=0; jsrun '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
If you compare that to the CUDA-aware running of that same test here you see:
BEFORE: jsrun '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=1; jsrun '-E LD_PRELOAD=/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2019.06.24/lib/pami_451/libpami.so' '-M -gpu' '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
Hopefully that is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; that information should be in the documentation. Trilinos developers are not accustomed to looking for that information, and it is specialized to the ATS-2 builds.
It could make sense to document that the different machines may report 'cuda-awareness' differently based on the node you are on.... UGH! At the above point in time, if you configured with the rolling release module on login, you could get 'not-cuda-aware', and if you configured with a specific dated version OR on a compute, they would report 'cuda-aware'. This motivated (in part) why testing is orchestrated the way it is. Now, with ATS2 testing, we are ensuring that regardless of what CMake discovered we will attempt to exercise both code paths. I'm still not sure that downstream users will want to use I point out the prior paragraph because the wrapper depends on the ENV, and I don't think many users are aware that Tpetra's cuda-awareness is a runtime option. (they are accustom to CMake setting a default and just using whatever) The danger if using the It may also be worth documenting that the wrapper does something special for A goal in how I wrote the wrapper was that it always shows the user a single command line they could copy/paste. (BEFORE/AFTER) - It would make sense to drive that home - If a developer wants to replicate the wrapper, they can copy/paste one of those lines and they should get similar behavior.**** **** CAVEAT: With the |
Turns out that the tests in the "cuda-aware-mpi" builds on 'vortex' were not actually setting TPETRA_ASSUME_CUDA_AWARE_MPI=1 (see CDOFA-100). This commit fixes that and it also documents for users how to run the test suite with and without CUDA-aware MPI in Tpetra (#6902).
How was this tested?
On 'vortex' I ran:
That posted to CDash:
Evidence that
TPETRA_ASSUME_CUDA_AWARE_MPI
is being set correctly now is these builds can be seen with the testTpetraCore_Behavior_Default_MPI_4
in the buildTrilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt-exp
here showing:and in the build
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi
here showing: