-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ATDM: Address several 'ats2' issues (#7402, #7406, #7122, #2422) #7427
ATDM: Address several 'ats2' issues (#7402, #7406, #7122, #2422) #7427
Conversation
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 6809 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 6621 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3_SERIAL # 5046 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_7.2.0 # 4893 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_8.3.0 # 1083 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_cuda_9.2 # 4382 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_clang_9.0.0 # 757 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_2 # 2517 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_3 # 2528 (click to expand)
|
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
… atdv-351-atdm-ats2-refactor (trilinos#7406)
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
The convention is to make all letter uppper-case in ATDM_CONFIG_COMPILER.
* Added a new function get_sparc_dev_module_name to get the name of the sparc-dev module from the ATDM_CONFIG_COMPILER argument. (This will make it easier to update envs and add new envs.) * Remove any explicit mention of compiler or CUDA version at all from the ats2/environment.sh file. (There is just generic logic for "GNU" and "CUDA".) * Error out right away if user tries to select xl compiler. (Why wait?)
See comment in file.
I added this print of the jsrun return code to see if we can create a CDash queryTests.php filter to filter out the 255 failure (see trilinos#7211)
0c05b60
to
0037354
Compare
@e10harvey, I cleaned up the commits and I pushed a commit that I had forgotten to push. This should not be ready to merge. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
@e10harvey, I approved the PRs #7402 and #7406 and noted that GitHub will show them as merged automatically when this PR gets merged to 'develop'. I also manually merged this branch to 'atdm-nightly-manual-updates' in commit b62dc12 so we will see this run in automated testing in the 'ats2' builds tomorrow on 'vortex'. Please review at your leisure. Any issues you find will need to be addressed with new commits (not rebasing or amending any existing commits) since this branch has already been merged to the 'atdm-nightly-manual-updates' branch. |
Starting manually now. The driver script names changed. |
@e10harvey, thanks for catching that. I forgot about that detail. (I was working on 'ats1' so much with the new single driver that does not use individual Jenkins jobs for each build.) When I can find some time later this FY, I will write up a new CMake/CTest-based driver that will run the builds so we can have just a single Jenkins (or cron or GitLab CI) driver on each system. |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Note that we are still seeing mass jsrun failures with #7406.
@bartlettroscoe: Is it possible to get the ctest test name in trilinos_jsrun for https://github.com/trilinos/Trilinos/pull/7427/files#diff-458dd950017f41243302957e239371fcR58? The ctest test name does not exist in the env.
# Purge then load StdEnv to get back to a fresh env in case previous other | ||
# modules were loaded. | ||
module purge --silent | ||
module load StdEnv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be removed, StdEnv is sticky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not. Running module purge --silent
blows past sticky and removes all the modules! Give it a try for yourself and see if you get the same behavior.
Why do you say that? What is the evidence for that? Yes, we do still seem some mass test failures, but they are due to the "RM connection failures" described in #6861 and the related empty output with return code 255 failures described in #7122 . They are not due to problems with restart related to #7406 . The only 'ats2' build that showed these mass failures was:
which showed:
All of the 2154 failing tests for that build are shown in this query. If you filter for the know randomly failing tests that show the regex If you filter out those known randomly failing tests in this query we get 10 failing tests which show:
Those tests are consistently failing in many builds and I will be creating Trilinos GitHub issues for them soon. (As soon as I can write a script to automate the generation of Trilinos GitHub issues.) Does this make sense? |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ e10harvey ]! |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 7427: IS A SUCCESS - Pull Request successfully merged |
This PR brings in the contributions for the existing 'ats2'-related PRs #7402 and #7406.
I then added a commit to address: #7122 and added a commit to switch over to the new CTest GPU allocation method first merged in PR #6840, motivated in #2422 and described in more detail in:
and I did some manualy testing with that. (The CTest GPU allocation method works fantastically on 'ats2' 'vortex, which is not what James E. predicted.)
I then ran a bunch of partial builds submitting to CDash and analyzed the data (I am just waiting for the last CUDA 'dbg' build to finish but that is all). This is all looking really good! (Or as good as it can look for ATS-2 given the mess that jsrun is.)
See full details in ATDV-351 (and the detailed notes file link to from there). I have also attached some detailed notes on testing below.
The list of tasks completed to create the PR is also given below.
Tasks list (click to expand)
Create new branch 'rab-github/atdv-351-atdm-ats2-refactor' from branch 'e10harvey/atdv-351-atdm-ats2' [Done]
Add unit tets for atd2/custom_builds.sh [Done]
Refactor ats2/custom_builds.sh to use keyword matching [Done]
Upper case 'rolling' in ATDM_CONFIG_COMPILER [Done]
Refactor ats2/enviornment.sh to just load one sparc-dev module [Done]
Merge in branch 'e10harvey/swat_trilinos_jsrun_bug' to try out 'trilinos_jsrun' fix [Done]
Enable and run the Kokkos, Teuchos, and Tpetra test suites for cuda-opt build [Done]
Run test cuda-opt build with Kokkos,Teuchos,Tpetra with ctest-s-local-test-driver.sh [Done]
Run test cuda-opt build with all packages with ctest-s-local-test-driver.sh [Done]
Experiment with running the TpetraCore_gemm test suite with different levels of 'ctest -j' ... Terrible speedup [Done]
Try out patched CMake 3.17.2 and ctest GPU allocation for 'ats2' on 'vortex':
Test out 'module purge --silent; module load StdEnv' and see if all configures and builds pass for Teuchos, Kokkos, and Tpetra [Done]
Add code to 'trilinos_jsrun' to print the 'jsrun' return code and if the *.out file contains any output [Done]
Run all supported builds with Kokkos,Teuchos,Tpetra with ctest-s-local-test-driver.sh [Done]
Run all supported builds for all packagses ctest-s-local-test-driver.sh ... Waiting for the last -exp builds to finish ...
Look over recent commits to ATDM/ats2: 2019.06.24->rolling (ATDV-351) #7402 and see how to encorporate into updated branch 'rab-github/atdv-351-atdm-ats2-refactor' ... IN PROGRESS ...
Development and testing details (click to expand)
(5/22/2020)
Testing on 'vortex':
The test results were:
Now to run with Tpetra as well:
That submitted to:
Just one failing test:
and it failed in a way we have seen before in other 'ats2' builds on 'votex' as recently as today shown in this query:
Now let's run a full Trilinos build:
That posted to:
and showed results:
The only failing tests were Adelus tests and the test TrilinosATDMConfigTests_ats2_custom_builds_unit_tests which I have since fixed.
Now to run some experiments with the TpetraCore_gemm tests:
Wow, that is terrible anti-speedup.
(5/24/2020)
Installing a patched version of CMake 3.17.2 on 'vortex' as the 'rabartl' account:
Now to update the 'ats2' env to use this new version of CMake and set up the hooks to use the new CTest GPU allocation method.
Now to run Kokkos, Teuchos, and Tpetra test suites again:
That posted to:
The two failing tests shown here where:
Those are the same KokkosCore tests that I have been seeing fialing on 'waterman' and 'ride' since switching to the ctest GPU allocation approach.
Now to test running the TpetraCore_gemm_ tests again with different {{ctest -j}} levels:
Okay, so going from {{ctest -j4}} to {{ctest -j8}} is no different for these tests because there is just 4 of them. But it does prove that the ctest GPU allocation algorithm does do breath-first to allocate GPUs. Let's try running the entire Kokkos, Teuchos, and Tpetra test suite and see what happens (starting on an allocation on 'vortex59'):
The most expensive tests were:
So we see at leaset one of the TpetraCore_gemm test times falling off a cliff:
But note that the total test time of 6m40.661s is almost half of what it was when using {{ctest -j4}} shown [here|https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos&parentid=5477075] which took 11m 33s. But the decrease in walclock time is likey not worth the risk of additional timeouts.
But this shows that 'jsrun' was NOT doing the right thing before with mutiple jsrun commands on the same node. It was NOT spreading out the work to the various GPUs. So we are going to switch over to use the updated GPU allocation settings.
Now to try out adding:
in the commit:
and tested with:
That posted to CDash:
So that seemed two work.
(5/25/2020)
Now I need to add the jsrun return code to the output and see how that goes ... Done in commit:
Testing this on 'vortex':
This posted to:
and shows the same two tests that fail KokkosCore_UnitTest_CudaInterOpInit_MPI_1 and KokkosCore_UnitTest_CudaInterOpStreams_MPI_1.
One of the failed tests shown [here|https://testing-dev.sandia.gov/cdash/testDetails.php?test=5169&build=5479794] shows:
Now to run all of the builds for just Kokkos,Teuchos,Tpetra:
That submitted to CDash:
What is is interesting it that it one of the builds,
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt-exp
, showed 313 mass test failures. If we look at the filted failing tests in those builds shown here:we see 6 total failing tests over those 6 builds:
We can see that two of these tests for the build
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi-exp
showed "jsrun return value: 255" as shown in this query:showing two tests:
Adding "jsrun return value: 255" to the filer in:
we filter out those two failing tests and just see the 4 failing tests:
So it looks like this is going to work perfectly.
Now, starting the run of the full set of builds:
(5/26/2020)
As for 11:45 AM EDT, only the following builds have completed or are still running:
These are submitting to CDash shown at:
What is interesting is that we are seeing mass test failures in the builds:
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt
.Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi-exp
If we look at the filtered failing tests for this new round of builds this time at:
we see 10 total filtered faiking tests:
So fo these we expect, some we don't. If we filter for "jsrun return value: 255" failures in the query:
we see 5 matching tests:
If we add "jsrun return value: 255" to the full filter:
we see the 5 failing tests:
Now those are the tests I expect to see failing.
I will let rest of the builds complete running. Otherwise, I think this branch is ready to merge and get testing.
But first, I want to look over Evan's recent patches to the PR:
It just did some small refactorings that are already handled in my updated branch:
How was this tested?
See above details and more details in ATDV-351 (and the detailed notes file link to from there).