-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several MueLu and Panzer CUDA debug test failures on Power systems showing 'Concurrent modification of host and device views in DualView' starting 2020-02-20 #6882
Comments
Then it is likely the changes in 4ef6a8e from @ndellingwood. |
The changes in the sha referenced simply replaced |
The message |
We can always back out my changes to see if that helps. |
@jhux2 I looked at the commit associated with your SHA, don't see how those changes could have triggered this either... |
😕 |
Then can someone please triage and debug the failures? The new commits pull testing day 2020-02-20 these failures started are shown here. There were not that many commits or files changed. |
I’m at a meeting this week and am not going to be able to do this quickly. Sent with GitHawk |
Is this still an issue? |
@jhux2, please click the above link to results on CDash under "Current Status on CDash". That should make the status pretty clear hopefully. I have tried to make this as easy as I can for developers to see the status if the associated tests on CDash. Also note that @rmmilewi will be working on automated updates if GitHub issues like this with results from CDash in #3887. |
I can reproduce this error. However, if I checkout/build Trilinos for the previous night (when the tests were passing on the dashboard) the same failure occurs. Did something perhaps change in the vortex environment or modules? For the record, here's the stack for one of the tests:
|
@jhux2, between what dates? |
@jhux2, I can't find any changes in the 'vortex' system between 2020-02-19 and 2020-02-20 but there is (at least what appears to be) a bug in the Kokkos CMake rebuild where the Kokkos libraries were not being rebuilt correctly after the Kokkos 2.99 update on 2020-02-03 (see #6855) and I had to force a build from scratch in the commit 8aa9287 that was pulled on the testing day 2020-02-20:
That resulted in correctly build Kokkos libraries on 2020-02-20 (from a source code update that actually occurred on 2020-02-03). Could it be that the updated Kokkos (once the libraries were rebuild correctly) is asserting a dual view error in debug-mode checking that was already there in this MueLu code? |
Note that the optimized builds do not show any of these test failures so this is a debug checking thing that is catching this. Does this make sense @ndellingwood? (See above comment). |
@bartlettroscoe I think this makes sense, and was similar to previous cases of these types of errors that popped up after the 2.9.99 merge. @jhux2 in past cases where I debugged these issues I used the kokkos-tools kernel logger to help find the kernel where the failure was triggered (I didn't have access to a machine that allowed ssh -X for MPI debugging), hopefully there is only one or two culprits resulting in the majority of the errors. I'm OOO tomorrow but can help chasing these down when I get back to the office if they're elusive to find. |
@bartlettroscoe Looks like we fixed this. Can you confirm? |
@cgcgcg, what does CDash show through the link at the above the section "Current Status on CDash"? |
Looks like all the tests that fail are system errors. |
As shown in this query, if you filter out the random failures you can see there were no failures of these tests since testing day 2020-04-22. I guess that was due to the fixing PR #7227 merged on 2020-04-22? |
Yep, I looks like we got it :-) |
CC: @trilinos/kokkos-kernels, @trilinos/muelu, @trilinos/panzer, @srajama1 (Trilinos Linear Solvers Product Lead), @mperego (Trilinos Discretizations Product Lead)
Next Action Status
Description
As shown in this query the tests:
MueLu_Maxwell3D-Tpetra_MPI_4
MueLu_UnitTestsIntrepid2Tpetra_MPI_1
MueLu_UnitTestsIntrepid2Tpetra_MPI_4
MueLu_UnitTestsTpetra_MPI_1
MueLu_UnitTestsTpetra_MPI_4
MueLu_VarDofDriver_MPI_1
MueLu_VarDofDriver_MPI_2
PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_1
PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_4
in the Power/GPU CUDA builds:
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
started failing on testing day 2020-02-20.
Clicking on "Show Matching Output" on the upper right on the above query shows the failures:
The new commit
The new commits that were pulled on testing day 2020-02-20 as shown, for example, here. Looking over that set of commits the only candidates that seems like it could have triggered this are e70d170 from @jhux2:
and 4ef6a8e from @ndellingwood:
but there are some "research" commits from @lucbv to MueLu (that one would hope would not impact tests in Panzer).
Current Status on CDash
Steps to Reproduce
One should be able to reproduce this failure on the machines 'ride', 'white', 'waterman', or 'vortex' as described in:
More specifically, the commands given for the system 'ride' on the machines 'white' (SON) or 'ride' (SRN) are provided at:
The exact commands to reproduce the failures for the build
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
, for example, on 'white' or 'ride' should be:The text was updated successfully, but these errors were encountered: