ATDM: ATS-2 testing documentation needs environment variable info #6902

kddevin · 2020-02-25T20:53:15Z

Documentation

ATS-2 testing documentation has no info on use of environment variables such as TPETRA_ASSUME_CUDA_AWARE_MPI in testing.
https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ats-2

For users to reproduce errors seen on the dashboard, they need to know where to look in CDash for environment variable settings and how to set them the same way on the platform. The location of these settings in CDash is not obvious.

This documentation is needed because ATS-2 testing now uses a single build with different environment variable settings for multiple tests. Previously, all environment information could be gleaned from the build configuration, as currently described in the documentation link above.

bartlettroscoe · 2020-02-25T21:28:02Z

@kddevin, I will look into adding some documentation for this to the atdm/READM.md file. However, I just noticed that it looks like TPETRA_ASSUME_CUDA_AWARE_MPI=1 many not actually be getting set in the "cuda-aware-mpi" builds on 'vortex' from looking at the test TpetraCore_Behavior_Default_MPI_4 in the build Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi shown here which shows:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=0; jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

That is not good. We will have to debug why the driver that is setting export TPETRA_ASSUME_CUDA_AWARE_MPI=1 is not propagating this down to the test level (see CDOFA-100). We likely need to debug and fix that before we update the documentation, least we be lying to the user.

Also an FYI: The Trilinos Framework team does not support that ATDM Trilinos build configurations currently. Perhaps that will change in the future but at this point there is no connection between the ATDM Trilinos testing efforts and that Trilinos Framework efforts. (Therefore, I removed the mention of the Trilinos Framework team and removed the PA: Framework label and added the ATDM DevOps label).

…trilinos#6902) Turns out that the tests in the "cuda-aware-mpi" builds on 'vortex' were not actually setting TPETRA_ASSUME_CUDA_AWARE_MPI=1. This commit fixes that and it also documents for users how to run the test suite with and without CUDA-aware MPI in Tpetra.

bartlettroscoe · 2020-02-26T02:42:44Z

@kddevin, the updated documentation is in PR #6904 and you can see this in the rendered document:

https://github.com/bartlettroscoe/Trilinos/tree/cdofa-100-tril-6902-cuda-aware-mpi/cmake/std/atdm#ats-2

Please review that PR and let me know if that is acceptable or not.

…anual-updates (CDOFA-100, #6902)

… mention trilinos_jsrun (CDOFA-100, trilinos#6902)

…6902-cuda-aware-mpi Automatically Merged using Trilinos Pull Request AutoTester PR Title: Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, #6902) PR Author: bartlettroscoe

kddevin · 2020-02-26T21:36:43Z

Thanks for the PR #6904
I would also like documentation on how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration. If one wants to reproduce a failing test, where should one look in CDash to get the value used for that test? The environment variable setting is not archived in the CMake configuration output.

bartlettroscoe · 2020-02-26T21:50:12Z

I would also like documentation on how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration. If one wants to reproduce a failing test, where should one look in CDash to get the value used for that test? The environment variable setting is not archived in the CMake configuration output.

It is printed out by trilinos_jsrun before it runs jsrun. Therefore, that information is on CDash in the detailed test output. For example, if you if you look at the output for the test TpetraCore_Behavior_Default_MPI_4 here you will see:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=0; jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

If you compare that to the CUDA-aware running of that same test here you see:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=1; jsrun  '-E LD_PRELOAD=/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2019.06.24/lib/pami_451/libpami.so' '-M -gpu' '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

How much of that info needs to be stated in that README.md file? Can we just say:

The value of TPETRA_ASSUME_CUDA_AWARE_MPI used is printed by the trilinos_jsrun script to STDOUT before it calls jsrun and therefore is also archived locally by CTest and uploaded to CDash along with the rest of the test STDOUT.

?

kddevin · 2020-02-26T21:53:30Z

I like showing the examples, as you did in this github issue, so that people know what they are looking for.

bartlettroscoe · 2020-02-26T22:06:05Z

I like showing the examples, as you did in this github issue, so that people know what they are looking for.

@kddevin, okay, I will post a new PR and let you review it

…s:develop' (b6d9daf). * trilinos-develop: ATDM: ats2: Update documentation for TPETRA_ASSUME_CUDA_AWARE_MPI and mention trilinos_jsrun (CDOFA-100, trilinos#6902) Allow for default setting of TPETRA_ASSUME_CUDA_AWARE_MPI=0 in trilinos_jsrun (CDOFA-100) Fix typo in trilinos_jsrun default value for Tpetra cuda-aware MueLu: Use Amesos2 parameter list option for matrices with non-contiguous Maps. Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, trilinos#6902) Framework: Enhance failure messaging in PR test driver Framework: Fixing bug in PR test script Framework: Fixing PR test driver python script tests

bartlettroscoe · 2020-02-27T13:45:22Z

@jjellio, responding to your comment from the merged PR #6904 below (since this is the issue about documentation) ...

It could make sense to document that the different machines may report 'cuda-awareness' differently based on the node you are on.... UGH!
See,
#5179 (comment)

At the above point in time, if you configured with the rolling release module on login, you could get 'not-cuda-aware', and if you configured with a specific dated version OR on a compute, they would report 'cuda-aware'.

This motivated (in part) why testing is orchestrated the way it is. Now, with ATS2 testing, we are ensuring that regardless of what CMake discovered we will attempt to exercise both code paths.

I'm still not sure that downstream users will want to use trilinos_jsrun, but I suppose we could help them use ATS2 easier by adding 'cuda-aware' to the load_env.sh stuff. E.g., load_env.sh gnu-volta100-release-debug-cuda-aware, and this would set Tpetra's cuda aware variable.

I can see some advantages to having the build name keyword cuda-aware-mpi trigger setting the env var TPETRA_ASSUME_CUDA_AWARE_MPI=1 (and having it be TPETRA_ASSUME_CUDA_AWARE_MPI=0 by default). That would solve @kddevin's documentation issue since this would be part of the build-name keywords. @kddevin, would you be happy with that? However, this would change the behavior of some of our existing CUDA builds that may already be running with CUDA-aware MPI in that it would make then run not-CUDA-aware MPI unless we explicitly added the keyword cuda-aware-mpi to the build name. So is that what we would want to do?

I point out the prior paragraph because the wrapper depends on the ENV, and I don't think many users are aware that Tpetra's cuda-awareness is a runtime option. (they are accustom to CMake setting a default and just using whatever)

Right. I could not find any documentation for the env var TPETRA_ASSUME_CUDA_AWARE_MPI at all. It would be good if someone could add it to the Tpetra Doxygen documentation:

https://docs.trilinos.org/dev/packages/tpetra/doc/html/index.html

The only documentation for this should not be in the ATDM Trilinos configuration system and just for ATS-2. That is not good.

The danger if using the load_env approach, is that if they use jsrun directly, they would need to ensure that -M -gpu is passed.

It may also be worth documenting that the wrapper does something special for NP=1. Specifically, it disables Spectrum's Cuda hooks, because single process executables may not call MPI_Init

Is there is any disadvantage to turning off the CUDA hooks for a single-rank MPI job? I can't think of any for smart code. (Software that is running on one MPI rank should not be calling any MPI communication routines if it is smart.)

A goal in how I wrote the wrapper was that it always shows the user a single command line they could copy/paste. (BEFORE/AFTER) - It would make sense to drive that home - If a developer wants to replicate the wrapper, they can copy/paste one of those lines and they should get similar behavior.****

**** CAVEAT: With the LD_PRELOAD stuff added still in there, it is possible they could try to reproduce running a unit test that allocates before MPI_Init, and it would crash with missing PAMI symbols. Where as the wrapper is currently preloading the PAMI shared library, which means they will see different behavior... (so we need to nuke that LD_PRELOAD stuff if developers want to use this)

@jjellio, would you be willing to try to post a new PR to extend the documentation in the section:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ats-2

and negotiate with @kddevin on what she would consider to be adequate documentation for this system?

bartlettroscoe · 2020-02-27T14:12:40Z

I can see some advantages to having the build name keyword cuda-aware-mpi trigger setting the env var TPETRA_ASSUME_CUDA_AWARE_MPI=1 (and having it be TPETRA_ASSUME_CUDA_AWARE_MPI=0 by default). That would solve @kddevin's documentation issue since this would be part of the build-name keywords. However, this would change the behavior of some of our existing CUDA builds that may already be running with CUDA-aware MPI in that it would make then run not-CUDA-aware MPI unless we explicitly added the keyword cuda-aware-mpi to the build name. So is that what we would want to do?

@jjellio and @kddevin , it occurred to me that we could define the ATDM Trilinos build name keywords no-cuda-aware-mpi and cuda-aware-mpi that would set export TPETRA_ASSUME_CUDA_AWARE_MPI=0 and export TPETRA_ASSUME_CUDA_AWARE_MPI=1, respectively, but if neither of those keywords existed in the build name, then the Tpetra configure-time default would be used. That way, we would not change existing behavior but we could clearly document when we made a CUDA build use or not use CUDA-aware MPI. Then we would have the 'ats2' builds on 'vortex' called:

Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_no-cuda-aware-mpi
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_no-cuda-aware-mpi
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi

That could not be more clear. It would also simplify the logic in our ctest -S driver scripts as well and in our instructions on the README.md page.

@kddevin, I think that would solve the CUDA-aware MPI documentation problem, no?

kddevin · 2020-02-27T18:36:04Z

Sure, the naming convention plus the BEFORE/AFTER examples you have above would be great. Thanks.

github-actions · 2021-12-11T12:25:34Z

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions · 2022-01-12T12:28:27Z

This issue was closed due to inactivity for 395 days.

kddevin assigned bartlettroscoe Feb 25, 2020

bartlettroscoe changed the title ~~Framework: ATS-2 testing documentation needs environment variable info~~ ATDM: ATS-2 testing documentation needs environment variable info Feb 25, 2020

bartlettroscoe mentioned this issue Feb 26, 2020

Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, #6902) #6904

Merged

bartlettroscoe added a commit that referenced this issue Feb 26, 2020

Merge branch 'cdofa-100-tril-6902-cuda-aware-mpi' into atdm-nightly-m…

dc4093f

…anual-updates (CDOFA-100, #6902)

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 26, 2020

ATDM: ats2: Update documentation for TPETRA_ASSUME_CUDA_AWARE_MPI and…

0228511

… mention trilinos_jsrun (CDOFA-100, trilinos#6902)

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 26, 2020

ATDM: ats2: Update documentation for TPETRA_ASSUME_CUDA_AWARE_MPI and…

85614cf

… mention trilinos_jsrun (CDOFA-100, trilinos#6902)

github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Dec 11, 2021

github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jan 12, 2022

github-actions bot closed this as completed Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ATDM: ATS-2 testing documentation needs environment variable info #6902

ATDM: ATS-2 testing documentation needs environment variable info #6902

kddevin commented Feb 25, 2020 •

edited by bartlettroscoe

Loading

bartlettroscoe commented Feb 25, 2020

bartlettroscoe commented Feb 26, 2020

kddevin commented Feb 26, 2020

bartlettroscoe commented Feb 26, 2020

kddevin commented Feb 26, 2020

bartlettroscoe commented Feb 26, 2020

bartlettroscoe commented Feb 27, 2020

bartlettroscoe commented Feb 27, 2020

kddevin commented Feb 27, 2020

github-actions bot commented Dec 11, 2021

github-actions bot commented Jan 12, 2022

ATDM: ATS-2 testing documentation needs environment variable info #6902

ATDM: ATS-2 testing documentation needs environment variable info #6902

Comments

kddevin commented Feb 25, 2020 • edited by bartlettroscoe Loading

Documentation

bartlettroscoe commented Feb 25, 2020

bartlettroscoe commented Feb 26, 2020

kddevin commented Feb 26, 2020

bartlettroscoe commented Feb 26, 2020

kddevin commented Feb 26, 2020

bartlettroscoe commented Feb 26, 2020

bartlettroscoe commented Feb 27, 2020

bartlettroscoe commented Feb 27, 2020

kddevin commented Feb 27, 2020

github-actions bot commented Dec 11, 2021

github-actions bot commented Jan 12, 2022

kddevin commented Feb 25, 2020 •

edited by bartlettroscoe

Loading