-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Frontier build to gfortran and latest rocm #2969
Move Frontier build to gfortran and latest rocm #2969
Conversation
I used this branch, before the last merge with
I |
A two-node run of
|
Am I correct that the current s/w stack can't be run from this branch? |
Yes, as written, it replaces the current stack. Shall I modify it so that the previous stack can still be selected? It could key off of the compiler choice. The change to |
Tagging @rljacob @ndkeen @jgfouca @PeterCaldwell @AaronDonahue. Trey, I don't know the right answers to your questions in terms of what everyone wants. But in terms of technical steps:
|
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
I marked this PR as WIP so that the autotester doesn't run on it until it's ready. |
I've isolated at least one nondeterministic diff to P3 (small-kernels version). I'll first try the deopt that worked for the (differently expressed) issue with P3 using ROCm 5.7. I'll also look at the monolithic version of P3. If nothing works, then I'll have to do some manual work to isolate the problem further. |
@trey-ornl the P3 deoptimization is working. Thus, in your performance testing, if you haven't started already, I suggest adding this: diff --git a/cime_config/machines/cmake_macros/craygnu-hipcc.cmake b/cime_config/machines/cmake_macros/craygnu-hipcc.
cmake
index 8322c5d3a9..6cb79c0146 100644
--- a/cime_config/machines/cmake_macros/craygnu-hipcc.cmake
+++ b/cime_config/machines/cmake_macros/craygnu-hipcc.cmake
@@ -5,7 +5,7 @@ set(SCC "cc")
set(SCXX "hipcc")
set(SFC "ftn")
-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU")
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2")
if (COMP_NAME STREQUAL gptl)
string(APPEND CPPDEFS " -DHAVE_NANOTIME -DBIT64 -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDA
Y")
endif()
diff --git a/components/eamxx/src/physics/p3/disp/p3_main_impl_part2_disp.cpp b/components/eamxx/src/physics/p3/disp/
p3_main_impl_part2_disp.cpp
index 2b619d54bf..28445a35c3 100644
--- a/components/eamxx/src/physics/p3/disp/p3_main_impl_part2_disp.cpp
+++ b/components/eamxx/src/physics/p3/disp/p3_main_impl_part2_disp.cpp
@@ -9,7 +9,9 @@ namespace p3 {
* Implementation of p3 main function. Clients should NOT #include
* this file, #include p3_functions.hpp instead.
*/
-
+#ifdef SCREAM_SYSTEM_WORKAROUND_P3_PART2
+# pragma clang optimize off
+#endif
template <>
void Functions<Real,DefaultDevice>
::p3_main_part2_disp(
@@ -130,7 +132,9 @@ void Functions<Real,DefaultDevice>
if (!hydrometeorsPresent(i)) return;
});
}
-
+#ifdef SCREAM_SYSTEM_WORKAROUND_P3_PART2
+# pragma clang optimize on
+#endif
} // namespace p3
} // namespace scream I'll edit this comment with more information as I gather it. First, with the above, I got 11 passes, 0 fails, of ERS_P8x1_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnu-hipcc.scream-internal_diagnostics_level. All final bfbhash lines are identical. Without the fix, this test failed every time. Second, I tried monolithic-kernel P3 and get
The core file with gdb doesn't give me a stack trace, but the hasher output is enough to know that the crash is in P3, as we would expect given this one change. Third, I tried to deopt the corresponding "part2" section of code. It either doesn't solve the problem or the pragma doesn't work within a kernel. If I put the pragma outside the kernel invocation (deoptimizing the entire monolithic P3), then the test runs. I'm at 5 passes, 0 fails, all final bfbhash lines identical. The lines are also identical to the small-kernels-P3 case, although that isn't necessary. |
@trey-ornl you should key off the compiler choice but we want to deprecate all machine entries with "scream" in the title. Maybe now would be a good time to do that. |
Also I think the E3SM convention for this compiler option name is "craygnugpu" . We have not put the vendor-specific GPU language in the names. Not sure if we should. |
Why would the compiler name not be just gnugpu? |
"craygnugpu" says to me that the Cray wrappers will use the Gnu compilers to offload to GPUs. That is not the case here. The intended message for the name to convey is this: "Use Cray wrappers with Gnu compilers for the host, and use the AMD Hip compiler for the GPU (currently named 'hipcc')." Maybe "craygnuamdpgu" or "craygnu-amdgpu"? |
The name "gnugpu" says to me that the Gnu compilers are used directly ("g++", "gfortran"), and that they are used to compile for the GPUs. Neither is true in this case. |
@trey-ornl good point about the GPU compile. I guess craygnuamdpgu is what we should use. Yes "craygnu" implies cray wrappers while "gnu" is just calling gfortran directly. Ignoring ones with "scream", we currently have:
So we'd be adding "craygnuamdgpu_frontier.cmake" |
Ignore my comment about editing the "frontier" entry. We'll do that later. Go ahead and add to frontier-scream-gpu |
@ambrad, are the tests in |
I don't use those tests for nondeterminism analysis and have not run them in years. I consider a single-node ne30 ERS test to be the most useful test configuration. |
Why did you merge master in to this feature branch? We discourage that unless you have a specific reason. |
Newbie mistake. How do you test your branch with all the latest changes without merging them in first? Is the expectation that we only test our changes relative to where the branch started? Please let me know if there is TFM that I should R. |
To answer your question, I think CI does a merge to master and tests that. You could also do a merge to master in your local copy, test, then reset. |
Rebase ( |
@trey-ornl , would ou mind manually testing on |
Just built and ran an ne30 performance test from @ndkeen yesterday on Frontier with these changes merged into master. New build (compiler |
I did a test of the decadal run with the new Timing results from 5-day decadal run on 2048 nodes of Frontier.
Previous version of new
Current new
We will likely want to update the modules again once Frontier has a fixed version of Libfabric. |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Looks good. Visual inspection shows that the current build (crayclang-scream) is available again after the most recent commits to this PR. Trey, have you checked that crayclang-scream in this PR reproduces crayclang-scream on master? One way to check would be to diff the |
Status Flag 'Pre-Test Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED by label AT: PRE-TEST INSPECTED! Autotester is Removing Label; this inspection will remain valid until a new commit to source branch is performed. |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
Using Repos:
Pull Request Author: trey-ornl |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: SCREAM_PullRequest_Autotester_Weaver
Jenkins Parameters
Build InformationTest Name: SCREAM_PullRequest_Autotester_Mappy
Jenkins Parameters
SCREAM_PullRequest_Autotester_Weaver # 6089 PASSED (click to see last 100 lines of console output)
SCREAM_PullRequest_Autotester_Mappy # 5859 FAILED (click to see last 100 lines of console output)
|
@ambrad I confirmed with the decadal run on Frontier that |
Since Andrew approved, and the fail reported is unrelated to this PR as expected, I have set this PR to automerge and skipped testing. Other frontier work depends on this PR, so hopefully this will let the PR sooner rather than later. Thanks @trey-ornl! 🎉 |
I'll go ahead and manually merge now that we have approval and confirmation that everything works as expected on frontier. Thanks for putting this in @trey-ornl ! |
Move Frontier build to gfortran and latest rocm
Added new
cime_config/machines/cmake_macros/craygnu-hipcc.cmake
file, which uses the Cray compiler wrappers with Gnu compilers, along withhipcc
for AMD GPUs.Changed Adios2 path to an appropriate OLCF build.
Deleted all Crusher files. Crusher no longer exists.