Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, #6902) #6904

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions cmake/ctest/drivers/atdm/ats2/local-driver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ source $WORKSPACE/Trilinos/cmake/std/atdm/load-env.sh $JOB_NAME

set -x

# Allow default setting for TPETRA_ASSUME_CUDA_AWARE_MPI=0 in trilinos_jsrun
unset TPETRA_ASSUME_CUDA_AWARE_MPI

atdm_run_script_on_compute_node \
$WORKSPACE/Trilinos/cmake/ctest/drivers/atdm/ctest-s-driver.sh \
$PWD/ctest-s-driver.out \
Expand Down
41 changes: 34 additions & 7 deletions cmake/std/atdm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1037,12 +1037,12 @@ $ bsub -x -Is -n 20 \

### ATS-2

Once logged on a suppported ATS-2 system like 'vortex' (SRN), one can either
build and configure on the login node or the compute node. Make sure to setup
SSH keys as described in `/opt/VORTEX_INTRO` before trying to build on a
compute node. For example to configure, build and run the tests for the
default `cuda-debug` build for `Kokkos` (after cloning Trilinos on the
`develop` branch), do:
Once logged on a supported ATS-2 system like 'vortex' (SRN), one can either
build and configure on a login node or a compute node. Make sure to setup SSH
keys as described in `/opt/VORTEX_INTRO` before trying to build on a compute
node. For example, to configure, build and run the tests for the default
`cuda-debug` build for `Kokkos` (after cloning Trilinos on the `develop`
branch), do:

```bash
$ cd <some_build_dir>/
Expand All @@ -1058,7 +1058,8 @@ $ cmake -GNinja \
$ make NP=20
```

You may run the above commands from an interactive bsub session as well:
You may run the above commands from an interactive bsub session as well using:

```bash
$ bsub -J <YOUR_JOB_NAME> -W 4:00 -Is bash
```
Expand All @@ -1067,10 +1068,36 @@ CTest runs everything using the `jsrun` command. You must run jsrun from a
compute node which can be acquired using the above bsub command.

Once you're on a compute node, you can run ctest. For example:

```bash
$ ctest -j4
```

The MPI test exectuables are run by a wrapper script `trilinos_jsrun` which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to know how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration. If one wants to reproduce a failing test, where should one look in CDash to get the value used for that test? The environment variable setting is not archived in the CMake configuration output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to know how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration.

It is printed out by trilinos_jsrun before it runs jsrun. Therefore, that information is on CDash in the detailed test output. For example, if you if you look at the output for the test TpetraCore_Behavior_Default_MPI_4 here you will see:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=0; jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

If you compare that to the CUDA-aware running of that same test here you see:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=1; jsrun  '-E LD_PRELOAD=/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2019.06.24/lib/pami_451/libpami.so' '-M -gpu' '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

Hopefully that is clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; that information should be in the documentation. Trilinos developers are not accustomed to looking for that information, and it is specialized to the ATS-2 builds.

calls the `jsrun` command which modifies the input arguments to accommodate
the MPI test suite in Trilinos (see the implementation of the script
`trilinos_jsrun` for details). By default, the script `trilinos_jsrun` will
set `export TPETRA_ASSUME_CUDA_AWARE_MPI=0` if `TPETRA_ASSUME_CUDA_AWARE_MPI`
is unset in the environment. Therefore, by default, the tests are run without
CUDA-aware MPI on this system.

To explicitly **disable CUDA-aware MPI** when running the test suite, set the
environment variable:

```bash
$ export TPETRA_ASSUME_CUDA_AWARE_MPI=0
$ ctest -j4
```

and to explicitly **enable CUDA-aware MPI** when running the test suite set:

```bash
$ export TPETRA_ASSUME_CUDA_AWARE_MPI=1
$ ctest -j4
```

before running `ctest`.

**NOTES:**
- Do NOT do `module purge` before loading the environment. Simply start off with
a clean default environment on vortex.
Expand Down
2 changes: 0 additions & 2 deletions cmake/std/atdm/ats2/environment.sh
Original file line number Diff line number Diff line change
Expand Up @@ -186,8 +186,6 @@ if [[ "$ATDM_CONFIG_COMPILER" == "CUDA-10.1.243_"* ]]; then
export KOKKOS_NUM_DEVICES=4

# CTEST Settings
# TPETRA_ASSUME_CUDA_AWARE_MPI is used by cmake/std/atdm/ats2/trilinos_jsrun
export TPETRA_ASSUME_CUDA_AWARE_MPI=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't local-driver.sh run after environment.sh? Why is this export of TPETRA_ASSUME_CUDA_AWARE_MPI not picked up in local-driver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e10harvey, the ctets-s-driver.sh script sources the load-env.sh script again and overwrites this. We need to just not touch the TPETRA_ASSUME_CUDA_AWARE_MPI var in the atdm/environment.sh script.

# Trilinos_CTEST_RUN_CUDA_AWARE_MPI is used by cmake/ctest/driver/atdm/ats2/local-driver.sh
export Trilinos_CTEST_RUN_CUDA_AWARE_MPI=1

Expand Down
2 changes: 1 addition & 1 deletion cmake/std/atdm/ats2/trilinos_jsrun
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ function evaluate_jsrun_command {
# Check if TPETRA_ASSUME_CUDA_AWARE_MPI is set and default to 0 if unset.
if [[ "$TPETRA_ASSUME_CUDA_AWARE_MPI" != "0" ]] && [[ "$TPETRA_ASSUME_CUDA_AWARE_MPI" != "1" ]]; then
echo "WARNING, you have not set TPETRA_ASSUME_CUDA_AWARE_MPI=0 or 1, defaulting to TPETRA_ASSUME_CUDA_AWARE_MPI=0"
export TPETRA_ASSUME_CUDA_AWARE=0
export TPETRA_ASSUME_CUDA_AWARE_MPI=0
fi

# Parse input arguments and modify them for jsrun
Expand Down
1 change: 0 additions & 1 deletion cmake/std/atdm/utils/unset_atdm_config_vars_environment.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ unset OMP_PLACES
unset OMPI_CC
unset OMPI_CXX
unset OMPI_FC
unset TPETRA_ASSUME_CUDA_AWARE_MPI
unset Trilinos_CTEST_RUN_CUDA_AWARE_MPI
unset ATDM_CONFIG_ENABLE_SPARC_SETTINGS
unset ATDM_CONFIG_USE_NINJA
Expand Down