-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
taylorGreenVortex_p3 is unstable on Trilinos develop #858
Comments
I'll take a look on our ascicgpu systems to see if it happens there too. |
Ran with cuda-11/gcc7.2, 474975f trilinos commit and 176f3c3 nalu-wind commit without seeing this issue
taylorGreenVortex_p3 momentum is different from the other test by using belos's block GMRES. That would be my main suspicion, to try switching to pseudoblock gmres/bicgstab instead. Is UVM off on the eagle trilinos configuration? |
474975f still blows up for me. I was on 5a2c077 when I first tested. Looking at the Trilinos CMakeCache.txt, it appears that UVM is on for all the relevant variables: I did build with Cuda 10.2.89. |
I switched to cuda 10.2.89 and was able to reproduce the issue locally,
Not sure exactly what's going wrong, but I'll take a look at it monday. |
I do not see this issue on Summit running cuda 10.2.89 and gcc 7.4.0 with trilinos develop 1487be4 which is up to date as of 5/3/21. |
Seems like the issue is with the multivector |
FYI @kddevin |
@rcknaus Can you post a cmake configure recipe and the module environment for the passing/not passing builds? |
@jhux2 ascic-build-env.tar.gz on ascicgpu24. I tested with trilinos/Trilinos@cbeb75a with cuda-10 passed and trilinos/Trilinos@474975f with cuda-10 failed but passed with cuda-11. |
Looking at it a bit more, rather than being a cuda-10 vs cuda-11 thing, |
@rcknaus Thanks for your leg work. I ran the Tpetra unit test suite on ascicgpu031. If @rcknaus How did you narrow it down to |
I can look at TpetraCore_CrsMatrix_Bug8794_MPI_4; Kyungjoo was having some problems with that test as well. That test exercises dense matrix rows more so than other Tpetra tests. I wouldn't expect doExport to care about the density, but perhaps it does. Does the Nalu test have dense rows? |
@jhux2 nothing special. The residual computation for the gradient is wrong in the This is the version with |
@rcknaus Ok, this is good data to have. I'll see if I can craft a stand-alone test from it. |
@jhux2 The failing Tpetra test does appear to be giving different results after doImport depending on TPETRA_ASSUME_CUDA_AWARE_MPI. I am tracking it today. |
@rcknaus I pushed a fix in trilinos/Trilinos#9117 |
@kddevin I tested trilinos/Trilinos#9117 with nalu-wind and confirmed that |
This appears to be resolved as the test seems to be stable in our nightly builds. Thanks for everyone's help tracking this down. |
The taylorGreeVortex_p3 regression test is blowing up on GPUs (Eagle) on Trilinos develop. This test uses all Trilinos/Tpetra algorithms and I think it is matrix free. I've tried 2 versions of Trilinos, develop and master, current as of 4/30/2021. Master works fine. The first iteration seems ok. however the second shows in stability. Ultimately the norms grow out of control.
1/1 Equation System Iteration
dpdx 6 5.22243e-07 0.00112119 1
pressure 9 2.23513e-05 0.0597388 1
dpdx 7 9.86341e-08 0.000835196 0.744917
1/2 myLowMach
velocity 200 0.0495843 0.00127094 1
pressure 12 0.000334666 1.45767 24.4007
dpdx 9 1.02156e-06 0.0131696 11.746
The text was updated successfully, but these errors were encountered: