-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change TriBITS/Trilinos TPLs to use find_package(CUDATookit) to fix builds with downstream customers using find_package(CUDATookit) (#10954) #11093
Conversation
NOTE: This requires the matching change to the TriBITS file tribits/core/std_tpls/FindTPLCUDA.cmake.
Origin repo remote tracking branch: 'github/master' Origin repo remote repo URL: 'github = [email protected]:TriBITSPub/TriBITS.git' Git describe: Vera4.0-RC1-start-1292-g6d3bb5b3 At commit: commit 23dc20b901ab55943b71e51f5e64a244ad186b5a Author: Roscoe A. Bartlett <[email protected]> Date: Thu Sep 29 11:46:47 2022 -0600 Summary: Change to use find_package(CUDAToolkit) (trilinos#10954)
FYI: I will post my detailed testing results above shortly ... Update: I posted the detailed testing results with complete reproduction info above. |
@rppawlo, can you please re-approve? I accidentally had the target as 'master' instead of 'develop'. I fixed that and now it requires re-approval. |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_PR_gcc-8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-serial
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-debug
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_intel-17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_clang-10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-10.1.243
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-11.4.2-uvm-off
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_PR_gcc-8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-serial
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-debug
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_intel-17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_clang-10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-10.1.243
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-11.4.2-uvm-off
Jenkins Parameters
|
…to 11002-enable-disable-tests-examples (trilinos#11093) I have to merge branch '10954-nalu-wind-fix-cuda' into this branch which has the updated snapshot of TriBITS that contains the updated FindTPLCUDA.cmake module but does not have the matching changes to FindTPLCUBLAS.cmake and FindTPLCUSPARSE.cmake from PR trilinos#11093 to make it work.
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_PR_gcc-8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-serial
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-debug
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_intel-17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_clang-10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-10.1.243
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-11.4.2-uvm-off
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_PR_gcc-8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-serial
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-debug
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_intel-17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_clang-10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-10.1.243
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-11.4.2-uvm-off
Jenkins Parameters
|
Wow, incredible, the last PR testing iteration above failed again due to mass random failures as shown here failed due to the known system errors on ATS-2 'vortex':
as shown in these queries:
I will put |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_PR_gcc-8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-serial
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-debug
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_intel-17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_clang-10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-10.1.243
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-11.4.2-uvm-off
Jenkins Parameters
Using Repos:
Pull Request Author: bartlettroscoe |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_PR_gcc-8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-serial
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_gcc-7.2.0-debug
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_intel-17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_clang-10.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-10.1.243
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_python3
Jenkins Parameters
Build InformationTest Name: Trilinos_PR_cuda-11.4.2-uvm-off
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ rppawlo tasmith4 ]! |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 11093: IS A SUCCESS - Pull Request successfully merged |
CC: @trilinos/teuchos
Description
The main thing this PR does is change the TriBITS and Trilinos CUDA-related TPLs from from using
find_package(CUDA)
(or just raw find operations) to usingfind_package(CUDAToolkit)
to avoid imported target namespace clashes with downstream CMake projects thatfind_package(CUDAToolkit)
(see #10954 and especially #10954 (comment)).This PR also includes the updated snapshot of TriBITS 'master' which also includes the TriBITS PRs and commits directly on 'master'
Of those other changes, the biggest is the renaming of TriBITS packages and TPLs, see the details in:
Instructions for reviewers
cmake/tribits/core/std_tpls/FindTPLCUDA.cmake
.Testing
I tested offline the Trilinos PR builds for 'cuda-11' on 'ascicgpu16' and the 'cuda-10' build on 'vortex. I also tested against the Spack manager build of Nalu-wind and verified that it fixes the build problem reported in #10954.
For Trilinos PR testing, the repo state was:
Trilinos PR builds repo state: (click to expand)
Trilinos PR rhel7 'cuda-11.4.2-uvm-off' build on 'ascicgpu16'
To test this I ran:
which submitted to CDash here showing:
Log file run_trilinos_pr_builds.rhel7-cuda.all.out (click to expand)
Trilinos PR rhel7 'ats2' build on 'vortex'
With that same repo state on 'vortex' I first build on the login node with:
Log file run_trilinos_pr_builds.ats2.all.build.out.txt
then I ran the tests on launch node with:
with log file run_trilinos_pr_builds.ats2.all.test.out.txt
There were three failing tests:
As I found before, those fail because you can't install from the compute node with this setup. But when I re-ran them (as described below)
Analysis of failing TrilinosInstallTests tests on 'ats2' build (click to expand)
The 3 failing tests were:
As I found before, those fail because you can't install from the compute node with this setup.
To make those pass:
Wow, the test
TrilinosInstallTests_simpleBuildAgainstTrilinos
failed! It showed. The LastTest.log file showed:Let's try running that test from the login node:
This time, only the test fails.
Trying to run the test from the launch node
The test fails due to a an insufficient CUDA version exception:
The configuration of the project simpleBuildAgainstTrilinos on the login node showed:
We did not see that warning when configuring Trilinos on the login node.
I wonder what is going on here and if this is going to impact Trilinos customers or not on this system.
I think we will just have to wait and see what happens. But if Trilinos can configure successfully without this warning, it must be possible for other CMake projects to successfully call find_package(CUDAToolkit) as well.
Therefore, I am going to assume this is not going to be a problem for application codes. And if it is, we will work through the issues.
Testing against nalu-wind with Spack Manager on 'ascicgpu16'
Leveraging the directories I used to preproduce the problem in #10954 (comment) that already has everything build (except nalu-wind), I used the updated Trilinos branch from this PR and got a passing build for nalu-wind and a partially passing set of unit tests (which was said to be okay in #10954 (comment)). Details are given below.
Details for testing against nalu-wind with spack manager (click to expand)
I have completed testing of these TriBITS and Trilinos changes on the cuda-11 build on 'ascicgpu16' and for the cuda-10 build on 'vortex' (see above). Now to test against Nalu-Wind.
The Trilinos repo is in a bit of a mess due to Spack patching and writing a bunch of non-ignored files:
First, I add ignores for these files to the local Trilinos
.git/info/excludes
file.To update safely I do:
And with that, the repo versions I am using are:
Now to test the builds the updated installs of Trilinos and Nalu-Wind:
Now to test it:
As described in #10954 (comment), this segfault is expected.
Therefore, the updated Trilinos branch '10954-nalu-wind-fix-cuda' would seem to be validated against Nalu-Wind. [PASSED]