Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATDM: ATS-2 testing documentation needs environment variable info #6902

Closed
kddevin opened this issue Feb 25, 2020 · 11 comments
Closed

ATDM: ATS-2 testing documentation needs environment variable info #6902

kddevin opened this issue Feb 25, 2020 · 11 comments
Assignees
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. impacting: documentation The issue is primarily about documentation vs. a problem with the build or tests MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: enhancement Issue is an enhancement, not a bug

Comments

@kddevin
Copy link
Contributor

kddevin commented Feb 25, 2020

Documentation

ATS-2 testing documentation has no info on use of environment variables such as TPETRA_ASSUME_CUDA_AWARE_MPI in testing.
https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ats-2

For users to reproduce errors seen on the dashboard, they need to know where to look in CDash for environment variable settings and how to set them the same way on the platform. The location of these settings in CDash is not obvious.

This documentation is needed because ATS-2 testing now uses a single build with different environment variable settings for multiple tests. Previously, all environment information could be gleaned from the build configuration, as currently described in the documentation link above.

@kddevin kddevin added impacting: documentation The issue is primarily about documentation vs. a problem with the build or tests ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos PA: Framework Issues that fall under the Trilinos Framework Product Area labels Feb 25, 2020
@bartlettroscoe bartlettroscoe added ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project type: enhancement Issue is an enhancement, not a bug and removed PA: Framework Issues that fall under the Trilinos Framework Product Area labels Feb 25, 2020
@bartlettroscoe bartlettroscoe changed the title Framework: ATS-2 testing documentation needs environment variable info ATDM: ATS-2 testing documentation needs environment variable info Feb 25, 2020
@bartlettroscoe
Copy link
Member

@kddevin, I will look into adding some documentation for this to the atdm/READM.md file. However, I just noticed that it looks like TPETRA_ASSUME_CUDA_AWARE_MPI=1 many not actually be getting set in the "cuda-aware-mpi" builds on 'vortex' from looking at the test TpetraCore_Behavior_Default_MPI_4 in the build Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi shown here which shows:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=0; jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

That is not good. We will have to debug why the driver that is setting export TPETRA_ASSUME_CUDA_AWARE_MPI=1 is not propagating this down to the test level (see CDOFA-100). We likely need to debug and fix that before we update the documentation, least we be lying to the user.

Also an FYI: The Trilinos Framework team does not support that ATDM Trilinos build configurations currently. Perhaps that will change in the future but at this point there is no connection between the ATDM Trilinos testing efforts and that Trilinos Framework efforts. (Therefore, I removed the mention of the Trilinos Framework team and removed the PA: Framework label and added the ATDM DevOps label).

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 26, 2020
…trilinos#6902)

Turns out that the tests in the "cuda-aware-mpi" builds on 'vortex' were not
actually setting TPETRA_ASSUME_CUDA_AWARE_MPI=1.  This commit fixes that and
it also documents for users how to run the test suite with and without
CUDA-aware MPI in Tpetra.
@bartlettroscoe
Copy link
Member

@kddevin, the updated documentation is in PR #6904 and you can see this in the rendered document:

Please review that PR and let me know if that is acceptable or not.

bartlettroscoe added a commit that referenced this issue Feb 26, 2020
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 26, 2020
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 26, 2020
trilinos-autotester added a commit that referenced this issue Feb 26, 2020
…6902-cuda-aware-mpi

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, #6902)
PR Author: bartlettroscoe
@kddevin
Copy link
Contributor Author

kddevin commented Feb 26, 2020

Thanks for the PR #6904
I would also like documentation on how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration. If one wants to reproduce a failing test, where should one look in CDash to get the value used for that test? The environment variable setting is not archived in the CMake configuration output.

@bartlettroscoe
Copy link
Member

I would also like documentation on how one would check which value of TPETRA_ASSUME_CUDA_AWARE_MPI was used in a particular test configuration. If one wants to reproduce a failing test, where should one look in CDash to get the value used for that test? The environment variable setting is not archived in the CMake configuration output.

It is printed out by trilinos_jsrun before it runs jsrun. Therefore, that information is on CDash in the detailed test output. For example, if you if you look at the output for the test TpetraCore_Behavior_Default_MPI_4 here you will see:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=0; jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

If you compare that to the CUDA-aware running of that same test here you see:

BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=1; jsrun  '-E LD_PRELOAD=/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2019.06.24/lib/pami_451/libpami.so' '-M -gpu' '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/tpetra/core/test/Behavior/TpetraCore_Behavior_Default.exe'

How much of that info needs to be stated in that README.md file? Can we just say:

The value of TPETRA_ASSUME_CUDA_AWARE_MPI used is printed by the trilinos_jsrun script to STDOUT before it calls jsrun and therefore is also archived locally by CTest and uploaded to CDash along with the rest of the test STDOUT.

?

@kddevin
Copy link
Contributor Author

kddevin commented Feb 26, 2020

I like showing the examples, as you did in this github issue, so that people know what they are looking for.

@bartlettroscoe
Copy link
Member

I like showing the examples, as you did in this github issue, so that people know what they are looking for.

@kddevin, okay, I will post a new PR and let you review it

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Feb 27, 2020
…s:develop' (b6d9daf).

* trilinos-develop:
  ATDM: ats2: Update documentation for TPETRA_ASSUME_CUDA_AWARE_MPI and mention trilinos_jsrun (CDOFA-100, trilinos#6902)
  Allow for default setting of TPETRA_ASSUME_CUDA_AWARE_MPI=0 in trilinos_jsrun (CDOFA-100)
  Fix typo in trilinos_jsrun default value for Tpetra cuda-aware
  MueLu: Use Amesos2 parameter list option for matrices with non-contiguous Maps.
  Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, trilinos#6902)
  Framework: Enhance failure messaging in PR test driver
  Framework: Fixing bug in PR test script
  Framework: Fixing PR test driver python script tests
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Feb 27, 2020
…s:develop' (b6d9daf).

* trilinos-develop:
  ATDM: ats2: Update documentation for TPETRA_ASSUME_CUDA_AWARE_MPI and mention trilinos_jsrun (CDOFA-100, trilinos#6902)
  Allow for default setting of TPETRA_ASSUME_CUDA_AWARE_MPI=0 in trilinos_jsrun (CDOFA-100)
  Fix typo in trilinos_jsrun default value for Tpetra cuda-aware
  MueLu: Use Amesos2 parameter list option for matrices with non-contiguous Maps.
  Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, trilinos#6902)
  Framework: Enhance failure messaging in PR test driver
  Framework: Fixing bug in PR test script
  Framework: Fixing PR test driver python script tests
@bartlettroscoe
Copy link
Member

@jjellio, responding to your comment from the merged PR #6904 below (since this is the issue about documentation) ...

It could make sense to document that the different machines may report 'cuda-awareness' differently based on the node you are on.... UGH!
See,
#5179 (comment)

At the above point in time, if you configured with the rolling release module on login, you could get 'not-cuda-aware', and if you configured with a specific dated version OR on a compute, they would report 'cuda-aware'.

This motivated (in part) why testing is orchestrated the way it is. Now, with ATS2 testing, we are ensuring that regardless of what CMake discovered we will attempt to exercise both code paths.

I'm still not sure that downstream users will want to use trilinos_jsrun, but I suppose we could help them use ATS2 easier by adding 'cuda-aware' to the load_env.sh stuff. E.g., load_env.sh gnu-volta100-release-debug-cuda-aware, and this would set Tpetra's cuda aware variable.

I can see some advantages to having the build name keyword cuda-aware-mpi trigger setting the env var TPETRA_ASSUME_CUDA_AWARE_MPI=1 (and having it be TPETRA_ASSUME_CUDA_AWARE_MPI=0 by default). That would solve @kddevin's documentation issue since this would be part of the build-name keywords. @kddevin, would you be happy with that? However, this would change the behavior of some of our existing CUDA builds that may already be running with CUDA-aware MPI in that it would make then run not-CUDA-aware MPI unless we explicitly added the keyword cuda-aware-mpi to the build name. So is that what we would want to do?

I point out the prior paragraph because the wrapper depends on the ENV, and I don't think many users are aware that Tpetra's cuda-awareness is a runtime option. (they are accustom to CMake setting a default and just using whatever)

Right. I could not find any documentation for the env var TPETRA_ASSUME_CUDA_AWARE_MPI at all. It would be good if someone could add it to the Tpetra Doxygen documentation:

The only documentation for this should not be in the ATDM Trilinos configuration system and just for ATS-2. That is not good.

The danger if using the load_env approach, is that if they use jsrun directly, they would need to ensure that -M -gpu is passed.

It may also be worth documenting that the wrapper does something special for NP=1. Specifically, it disables Spectrum's Cuda hooks, because single process executables may not call MPI_Init

Is there is any disadvantage to turning off the CUDA hooks for a single-rank MPI job? I can't think of any for smart code. (Software that is running on one MPI rank should not be calling any MPI communication routines if it is smart.)

A goal in how I wrote the wrapper was that it always shows the user a single command line they could copy/paste. (BEFORE/AFTER) - It would make sense to drive that home - If a developer wants to replicate the wrapper, they can copy/paste one of those lines and they should get similar behavior.****

**** CAVEAT: With the LD_PRELOAD stuff added still in there, it is possible they could try to reproduce running a unit test that allocates before MPI_Init, and it would crash with missing PAMI symbols. Where as the wrapper is currently preloading the PAMI shared library, which means they will see different behavior... (so we need to nuke that LD_PRELOAD stuff if developers want to use this)

@jjellio, would you be willing to try to post a new PR to extend the documentation in the section:

and negotiate with @kddevin on what she would consider to be adequate documentation for this system?

@bartlettroscoe
Copy link
Member

I can see some advantages to having the build name keyword cuda-aware-mpi trigger setting the env var TPETRA_ASSUME_CUDA_AWARE_MPI=1 (and having it be TPETRA_ASSUME_CUDA_AWARE_MPI=0 by default). That would solve @kddevin's documentation issue since this would be part of the build-name keywords. However, this would change the behavior of some of our existing CUDA builds that may already be running with CUDA-aware MPI in that it would make then run not-CUDA-aware MPI unless we explicitly added the keyword cuda-aware-mpi to the build name. So is that what we would want to do?

@jjellio and @kddevin , it occurred to me that we could define the ATDM Trilinos build name keywords no-cuda-aware-mpi and cuda-aware-mpi that would set export TPETRA_ASSUME_CUDA_AWARE_MPI=0 and export TPETRA_ASSUME_CUDA_AWARE_MPI=1, respectively, but if neither of those keywords existed in the build name, then the Tpetra configure-time default would be used. That way, we would not change existing behavior but we could clearly document when we made a CUDA build use or not use CUDA-aware MPI. Then we would have the 'ats2' builds on 'vortex' called:

  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_no-cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_no-cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi

That could not be more clear. It would also simplify the logic in our ctest -S driver scripts as well and in our instructions on the README.md page.

@kddevin, I think that would solve the CUDA-aware MPI documentation problem, no?

@kddevin
Copy link
Contributor Author

kddevin commented Feb 27, 2020

Sure, the naming convention plus the BEFORE/AFTER examples you have above would be great. Thanks.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Dec 11, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. impacting: documentation The issue is primarily about documentation vs. a problem with the build or tests MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

2 participants