Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several MueLu and Panzer CUDA debug test failures on Power systems showing 'Concurrent modification of host and device views in DualView' starting 2020-02-20 #6882

Closed
bartlettroscoe opened this issue Feb 21, 2020 · 23 comments
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/kokkos-kernels, @trilinos/muelu, @trilinos/panzer, @srajama1 (Trilinos Linear Solvers Product Lead), @mperego (Trilinos Discretizations Product Lead)

Next Action Status

Description

As shown in this query the tests:

  • MueLu_Maxwell3D-Tpetra_MPI_4
  • MueLu_UnitTestsIntrepid2Tpetra_MPI_1
  • MueLu_UnitTestsIntrepid2Tpetra_MPI_4
  • MueLu_UnitTestsTpetra_MPI_1
  • MueLu_UnitTestsTpetra_MPI_4
  • MueLu_VarDofDriver_MPI_1
  • MueLu_VarDofDriver_MPI_2
  • PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_1
  • PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_MPI_4

in the Power/GPU CUDA builds:

  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug

started failing on testing day 2020-02-20.

Clicking on "Show Matching Output" on the upper right on the above query shows the failures:

Kokkos::DualView::modify ERROR: Concurrent modification of host and device views in DualView "MV::DualView"

The new commit

The new commits that were pulled on testing day 2020-02-20 as shown, for example, here. Looking over that set of commits the only candidates that seems like it could have triggered this are e70d170 from @jhux2:

e70d17089e:  Ifpack2: MueLu: fix clang 7.0.1 warnings
Author: Jonathan Hu <[email protected]>
Date:   Wed Feb 19 18:31:31 2020 -0700

M	packages/ifpack2/src/Ifpack2_BandedContainer_decl.hpp
M	packages/muelu/test/convergence/Convergence.cpp

and 4ef6a8e from @ndellingwood:

4ef6a8ea52:  trsv: workaround intel/19.0.4 interal compiler error
Author: Nathan Ellingwood <[email protected]>
Date:   Tue Feb 18 17:45:31 2020 -0700

M	packages/kokkos-kernels/src/sparse/impl/KokkosSparse_trsv_impl.hpp

but there are some "research" commits from @lucbv to MueLu (that one would hope would not impact tests in Panzer).

Current Status on CDash

Steps to Reproduce

One should be able to reproduce this failure on the machines 'ride', 'white', 'waterman', or 'vortex' as described in:

More specifically, the commands given for the system 'ride' on the machines 'white' (SON) or 'ride' (SRN) are provided at:

The exact commands to reproduce the failures for the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug, for example, on 'white' or 'ride' should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \
    Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON  -DTrilinos_ENABLE_Panzer=ON \
 $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j4
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: MueLu pkg: Panzer client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area labels Feb 21, 2020
@bartlettroscoe
Copy link
Member Author

@bathmatt, @jmgate, @rppawlo, this might impact EMPIRE, not sure.

@jhux2
Copy link
Member

jhux2 commented Feb 21, 2020

Looking over that set of commits the only candidates that seems like it could have triggered this are e70d170 from @jhux2:

I don't see how these changes could be causing the reported failures.

@bartlettroscoe
Copy link
Member Author

I don't see how these changes could be causing the reported failures.

Then it is likely the changes in 4ef6a8e from @ndellingwood.

@ndellingwood
Copy link
Contributor

Then it is likely the changes in 4ef6a8e from @ndellingwood.

The changes in the sha referenced simply replaced /= operators for an intel/19 compiler issue (e.g. a /= b; replaced by a = a / b;) and would have no connection with these failures.

@ndellingwood
Copy link
Contributor

The message Concurrent modification of host and device views in DualView means that someone made a change where modify calls are made on device and host without a sync or clearing of flags in between.

@jhux2
Copy link
Member

jhux2 commented Feb 21, 2020

We can always back out my changes to see if that helps.

@ndellingwood
Copy link
Contributor

We can always back out my changes to see if that helps.

@jhux2 I looked at the commit associated with your SHA, don't see how those changes could have triggered this either...

@jhux2
Copy link
Member

jhux2 commented Feb 21, 2020

😕

@bartlettroscoe
Copy link
Member Author

The message Concurrent modification of host and device views in DualView means that someone made a change where modify calls are made on device and host without a sync or clearing of flags in between.

Then can someone please triage and debug the failures? The new commits pull testing day 2020-02-20 these failures started are shown here. There were not that many commits or files changed.

@jhux2
Copy link
Member

jhux2 commented Feb 24, 2020

I’m at a meeting this week and am not going to be able to do this quickly.

Sent with GitHawk

@jhux2
Copy link
Member

jhux2 commented Feb 27, 2020

Is this still an issue?

@bartlettroscoe
Copy link
Member Author

Is this still an issue?

@jhux2, please click the above link to results on CDash under "Current Status on CDash". That should make the status pretty clear hopefully. I have tried to make this as easy as I can for developers to see the status if the associated tests on CDash.

Also note that @rmmilewi will be working on automated updates if GitHub issues like this with results from CDash in #3887.

@jhux2
Copy link
Member

jhux2 commented Feb 29, 2020

I can reproduce this error. However, if I checkout/build Trilinos for the previous night (when the tests were passing on the dashboard) the same failure occurs. Did something perhaps change in the vortex environment or modules?

For the record, here's the stack for one of the tests:

Kokkos::Impl::host_abort@/vscratch1/jhu/tmp/Trilinos/packages/kokkos/core/src/impl/Kokkos_Error.cpp:63
Kokkos::abort@/vscratch1/jhu/tmp/Trilinos/packages/kokkos/core/src/impl/Kokkos_Error.hpp:168
Kokkos::DualView<double**,@/vscratch1/jhu/tmp/Trilinos/packages/kokkos/containers/src/Kokkos_DualView.hpp:608
Tpetra::MultiVector<double,@/vscratch1/jhu/tmp/Trilinos/packages/tpetra/core/src/Tpetra_MultiVector_decl.hpp:1460
Tpetra::CrsMatrix<double,@/vscratch1/jhu/tmp/Trilinos/packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:3923
Xpetra::TpetraCrsMatrix<double,@/vscratch1/jhu/tmp/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_TpetraCrsMatrix_def.hpp:367
Xpetra::CrsMatrixWrap<double,@/vscratch1/jhu/tmp/Trilinos/packages/xpetra/sup/Matrix/Xpetra_CrsMatrixWrap_def.hpp:305
Xpetra::MatrixUtils<double,@/vscratch1/jhu/tmp/Trilinos/packages/xpetra/sup/Utils/Xpetra_MatrixUtils.hpp:523
MueLu::RAPFactory<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/src/Misc/MueLu_RAPFactory_def.hpp:223
MueLu::TwoLevelFactoryBase::CallBuild@/vscratch1/jhu/tmp/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_TwoLevelFactoryBase.hpp:151
MueLu::Level::Get<Teuchos::RCP<Xpetra::Operator<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/src/Interface/../MueCentral/MueLu_Level.hpp:204
MueLu::TopRAPFactory<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/src/MueCentral/MueLu_TopRAPFactory_def.hpp:76
MueLu::Hierarchy<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/src/MueCentral/MueLu_Hierarchy_def.hpp:419
MueLu::HierarchyManager<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/src/Interface/MueLu_HierarchyManager.hpp:241
MueLu::ParameterListInterpreter<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/src/Interface/MueLu_ParameterListInterpreter_def.hpp:2269
MueLu::CreateXpetraPreconditioner<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/adapters/xpetra/MueLu_CreateXpetraPreconditioner.hpp:97
main_<double,@/vscratch1/jhu/tmp/Trilinos/packages/muelu/test/vardofpernode/VarDofDriver.cpp:392
main@/vscratch1/jhu/tmp/Trilinos/packages/muelu/test/vardofpernode/VarDofDriver.cpp:499

@bartlettroscoe
Copy link
Member Author

Did something perhaps change in the vortex environment or modules?

@jhux2, between what dates?

@bartlettroscoe
Copy link
Member Author

@jhux2, I can't find any changes in the 'vortex' system between 2020-02-19 and 2020-02-20 but there is (at least what appears to be) a bug in the Kokkos CMake rebuild where the Kokkos libraries were not being rebuilt correctly after the Kokkos 2.99 update on 2020-02-03 (see #6855) and I had to force a build from scratch in the commit 8aa9287 that was pulled on the testing day 2020-02-20:

8aa92870d6:  Do a rebuild of all of the ATDM Trilinos nightly builds (SPAR-727, #6855)
Author: Roscoe A. Bartlett <[email protected]>
Date:   Wed Feb 19 19:28:14 2020 -0700

M	cmake/ctest/drivers/atdm/TrilinosCTestDriverCore.atdm.cmake

That resulted in correctly build Kokkos libraries on 2020-02-20 (from a source code update that actually occurred on 2020-02-03).

Could it be that the updated Kokkos (once the libraries were rebuild correctly) is asserting a dual view error in debug-mode checking that was already there in this MueLu code?

@bartlettroscoe
Copy link
Member Author

Note that the optimized builds do not show any of these test failures so this is a debug checking thing that is catching this. Does this make sense @ndellingwood? (See above comment).

@ndellingwood
Copy link
Contributor

Could it be that the updated Kokkos (once the libraries were rebuild correctly) is asserting a dual view error in debug-mode checking that was already there in this MueLu code

@bartlettroscoe I think this makes sense, and was similar to previous cases of these types of errors that popped up after the 2.9.99 merge.

@jhux2 in past cases where I debugged these issues I used the kokkos-tools kernel logger to help find the kernel where the failure was triggered (I didn't have access to a machine that allowed ssh -X for MPI debugging), hopefully there is only one or two culprits resulting in the majority of the errors. I'm OOO tomorrow but can help chasing these down when I get back to the office if they're elusive to find.

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 23, 2020

@bartlettroscoe Looks like we fixed this. Can you confirm?

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe Looks like we fixed this. Can you confirm?

@cgcgcg, what does CDash show through the link at the above the section "Current Status on CDash"?

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 23, 2020

Looks like all the tests that fail are system errors.

@cgcgcg cgcgcg closed this as completed Apr 23, 2020
@bartlettroscoe
Copy link
Member Author

As shown in this query, if you filter out the random failures you can see there were no failures of these tests since testing day 2020-04-22. I guess that was due to the fixing PR #7227 merged on 2020-04-22?

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 24, 2020

Yep, I looks like we got it :-)

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 6, 2020

@rmmilewi

FYI: Below is a sample of the comment that will be added automatically as part of the work in #3887. (Hopefully the first version deployed soon.)


Test results for issue #6882 as of YYYY-MM-DD

Tests with issue trackers Passed: twip=27
Tests with issue trackers Missing: twim=9

Details test results: (click to expand)

Tests with issue trackers Passed: twip=27

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
waterman Trilinos-atdm-waterman-cuda-9.2-debug MueLu_­Maxwell3D-Tpetra_­MPI_­4 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug MueLu_­Maxwell3D-Tpetra_­MPI_­4 Passed Completed 27 0 27 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­1 Passed Completed 10 0 10 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­1 Passed Completed 10 0 10 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­1 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­1 Passed Completed 27 0 27 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­4 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­4 Passed Completed 27 0 27 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg MueLu_­UnitTestsTpetra_­MPI_­1 Passed Completed 10 0 10 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi MueLu_­UnitTestsTpetra_­MPI_­1 Passed Completed 9 1 9 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug MueLu_­UnitTestsTpetra_­MPI_­1 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug MueLu_­UnitTestsTpetra_­MPI_­1 Passed Completed 27 0 27 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug MueLu_­UnitTestsTpetra_­MPI_­4 Passed Completed (Completed) 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug MueLu_­UnitTestsTpetra_­MPI_­4 Passed Completed (Completed) 27 0 27 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg MueLu_­VarDofDriver_­MPI_­1 Passed Completed 10 0 10 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi MueLu_­VarDofDriver_­MPI_­1 Passed Completed 10 0 10 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug MueLu_­VarDofDriver_­MPI_­1 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug MueLu_­VarDofDriver_­MPI_­1 Passed Completed 27 0 27 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg MueLu_­VarDofDriver_­MPI_­2 Passed Completed 4 2 8 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug MueLu_­VarDofDriver_­MPI_­2 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug MueLu_­VarDofDriver_­MPI_­2 Passed Completed 27 0 27 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­1 Passed Completed 10 0 10 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­1 Passed Completed 9 1 9 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­1 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­1 Passed Completed 27 0 27 #6882
waterman Trilinos-atdm-waterman-cuda-9.2-debug PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­4 Passed Completed 25 0 25 #6882
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­4 Passed Completed 27 0 27 #6882

Tests with issue trackers Missing: twim=9

Site Build Name Test Name Status Details Consec­utive Missing Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg MueLu_­Maxwell3D-Tpetra_­MPI_­4 Missing / Failed Completed (Failed) 0 3 7 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi MueLu_­Maxwell3D-Tpetra_­MPI_­4 Missing / Failed Completed (Failed) 0 4 6 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­4 Missing / Failed Completed (Failed) 0 3 7 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi MueLu_­UnitTestsIntrepid2Tpetra_­MPI_­4 Missing / Failed Completed (Failed) 0 4 6 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg MueLu_­UnitTestsTpetra_­MPI_­4 Missing / Failed Completed (Failed) 0 3 7 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi MueLu_­UnitTestsTpetra_­MPI_­4 Missing / Failed Completed (Failed) 0 4 6 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi MueLu_­VarDofDriver_­MPI_­2 Missing / Failed Completed (Failed) 0 4 6 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­4 Missing / Failed Completed (Failed) 0 3 7 #6882
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi PanzerMiniEM_­MiniEM-BlockPrec_­RefMaxwell_­MPI_­4 Missing / Failed Completed (Failed) 0 4 6 #6882

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu pkg: Panzer type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants