Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15 #5179

Closed
fryeguy52 opened this issue May 15, 2019 · 132 comments
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area pkg: Amesos2 pkg: Panzer pkg: Tpetra pkg: Xpetra stage: in review Primary work is completed and now is just waiting for human review and/or test feedback type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

fryeguy52 commented May 15, 2019

Bug Report

CC: @trilinos/panzer, @kddevin (Trilinos Data Services Product Lead), @srajama1 (Trilinos Linear Solver Data Services), @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Since PR #5346 was merged on 6/7/2019 which fixed a file read/write race in the test, there has only been one failing Panzer test on any ATDM Trilinos platform as of 6/11/2019 looking to be related. Also, on 6/11/2019 @bathmatt reported EMPIRE is not failing in a similar way in his recent tests. Next: Watch results over next few weeks to see if more random failures like this occur ...

Description

As shown in this query the tests:

  • PanzerAdaptersSTK_MixedPoissonExample
  • PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-1

are failing in the build:

  • Trilinos-atdm-waterman-cuda-9.2-opt

Additionally the test:

  • PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2

is failing in a different build on the same machine:

  • Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
Expand to see new commits on 2019-05-14
*** Base Git Repo: Trilinos
7b6d69a:  Merge remote-tracking branch 'origin/develop' into atdm-nightly
Author: Roscoe A. Bartlett <[email protected]>
Date:   Mon May 13 21:05:15 2019 -0600

085e9d8:  Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106
Author: trilinos-autotester <[email protected]>
Date:   Mon May 13 18:57:23 2019 -0600

238800a:  Merge pull request #5163 from kyungjoo-kim/fix-5148
Author: kyungjoo-kim <[email protected]>
Date:   Mon May 13 15:05:45 2019 -0600

7b827c7:  Tpetra: resolution to #5161 (#5162)
Author: Tim Fuller <[email protected]>
Date:   Mon May 13 14:57:31 2019 -0600

D	packages/tpetra/core/src/Tpetra_Experimental_BlockMultiVector.cpp

925a0a7:  Ifpack2 - fix for #5148
Author: Kyungjoo Kim <[email protected]>
Date:   Mon May 13 11:52:27 2019 -0600

M	packages/ifpack2/src/Ifpack2_BlockTriDiContainer_impl.hpp

a847648:  Made even bigger.
Author: K. Devine <[email protected]>
Date:   Wed May 8 10:48:35 2019 -0600

M	packages/zoltan/src/driver/dr_main.c
M	packages/zoltan/src/driver/dr_mainCPP.cpp

9051d9f:  zoltan:  minor change to fix #5106
Author: K. Devine <[email protected]>
Date:   Wed May 8 10:38:59 2019 -0600

M	packages/zoltan/src/driver/dr_main.c
M	packages/zoltan/src/driver/dr_mainCPP.cpp

Current Status on CDash

Results for the current testing day

Steps to Reproduce

One should be able to reproduce this failure on waterman as described in:

More specifically, the commands given for waterman are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman-cuda-9.2-opt
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Panzer=ON \
 $TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -n 20 ctest -j20
@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: Panzer client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area labels May 15, 2019
@rppawlo
Copy link
Contributor

rppawlo commented May 15, 2019

Panzer has not changed recently. According to empire testing (thanks @jmgate !), these are the candidate commits that could have caused the new panzer failure.

* b3a8dc8 ([email protected]) Mon May 13 21:28:33 2019 -0600
|
 Merge pull request #5167 from kyungjoo-kim/ifpack2-develop
| Ifpack2 develop
* 085e9d8 ([email protected])
 Mon May 13 18:57:23 2019 -0600
| Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106
| PR Title:
 zoltan: minor change to fix #5106
| PR Author: kddevin
* 238800a ([email protected])
 Mon May 13 15:05:45 2019 -0600
| Merge pull request #5163 from kyungjoo-kim/fix-5148
| Ifpack2 - fix for
 #5148
* 7b827c7 ([email protected]) Mon May 13 14:57:31 2019 -0600

 Tpetra: resolution to #5161 (#5162)

@bartlettroscoe
Copy link
Member

@kyungjoo-kim, @tjfulle, do you guys know if you recent commits listed above might caused this?

@kyungjoo-kim
Copy link
Contributor

@rppawlo @bartlettroscoe No, my commits are only intended for ifpack2 blocktridicontainer. The commits are not related to Panzer.

@bartlettroscoe
Copy link
Member

Does Panzer use Ifpack2?

@kyungjoo-kim
Copy link
Contributor

Panzer may use other ifpack2 components but it does not use blocktridicontainer solver. The solver I am working on is only used by SPARC.

@mhoemmen
Copy link
Contributor

@rppawlo and I talked about this over e-mail. The issue is that Trilinos does not yet work correctly when deprecated Tpetra code is disabled (Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF). See e.g., the following issues:

@trilinos/tpetra is working on fixing these. The work-around for now is not to disabled deprecated code.

@bathmatt
Copy link
Contributor

@mhoemmen correct me if I'm wrong, but these failures don't have Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF set,

@rppawlo
Copy link
Contributor

rppawlo commented May 15, 2019

The deprecated code is enabled as @bathmatt mentioned. So the errors from the tests are different. One test shows:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

While the two other failures show:

terminate called after throwing an instance of 'std::runtime_error'
  what():  /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 1) When converting column indices from global to local, we encountered 462 indices that do not live in the column Map on this process.  That's too many to print.

[waterman2:05756] *** Process received signal ***

@mhoemmen - are there any changes that to tpetra in the last 2 days that might trigger this?

@mhoemmen
Copy link
Contributor

I don't think so, but it's possible. @trilinos/tpetra

For debugging, to see if this is a Panzer issue, we could adjust that print threshold temporarily.

@mhoemmen
Copy link
Contributor

Try also setting the environment variable TPETRA_DEBUG=1. In the worst case, we could also set TPETRA_VERBOSE=CrsGraph.

@bathmatt
Copy link
Contributor

I'm seeing the second error Roger mentioned in EMPIRE now with the EMPlasma trilinos. So, this isn't a new bug, it is an older bug that is starting to pop up more often it looks like

@bathmatt
Copy link
Contributor

My statement might be incorrect, I wiped everything clean and it looks like it isn't popping up anymore

@rppawlo
Copy link
Contributor

rppawlo commented May 20, 2019

After rebuilding from scratch, this looks like the parallel level is too high and the cuda card is running out of memory with multiple tests hitting the same card. In the steps to reproduce, the tests are run with "cmake -j20". I could not reproduce the errors running the tests manually or when the cmake parallel level was reduced. I think we run the other cuda machines at -j8. Maybe we need to do that here also?

@bartlettroscoe bartlettroscoe changed the title Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM builds on waterman Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM opt builds on waterman May 20, 2019
@bartlettroscoe
Copy link
Member

@rppawlo, looking at the Jenkins driver at:

it shows:

ATDM_CONFIG_CTEST_PARALLEL_LEVEL=8

Therefore, it is running them with ctest -j8.

But that may be too much for some of these Panzer tests?

@rppawlo
Copy link
Contributor

rppawlo commented May 20, 2019

But that may be too much for some of these Panzer tests?

I think that is ok. The instructions at the top of this ticket have -j20 so I assumed that is what the tests were running. With -j20 I see a bunch of failures. With -j8 nothing fails. Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.

@bartlettroscoe
Copy link
Member

@rppawlo asked:

Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.

By default all of the nightly ATDM Trilinos builds build from scratch each day. We can look on Jenkins to see if that is the case to be sure. For example, at:

it shows:

09:30:08 Cleaning out binary directory '/home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/BUILD' ...

and does not show any errors so I would assume that it is blowing away the directories.

@fryeguy52 fryeguy52 changed the title Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM opt builds on waterman Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM cuda 9.2 builds May 28, 2019
@fryeguy52
Copy link
Contributor Author

It looks like these are also failing on non-waterman builds. There are 74 failing PanzerAdaptersSTK_* tests between 4 builds between 2019-05-01 ans 2019-05-29 shown here

Note that the above link filters out builds on white and ride because we have seen a lot of failures on those machines recently but these tests may be failing there too. failures on white/ride in the last 2 weeks

current 2 week history of failing PanzerAdaptersSTK* tests

@rppawlo
Copy link
Contributor

rppawlo commented May 29, 2019

All failures are in cuda builds using the tpetra deprecated dynamic profile. I've tracked the multiblock test failure to a separate issue and will push a fix shortly.

The majority of the random errors look to be in the fillComplete on the CrsMatrix. I have not had good luck reproducing in raw panzer tests. EMPIRE is also seeing similar failures and @bathmatt was able to get the following stack trace:

#0  0x0000000017f868bc in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPosts<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#1  0x0000000017f87420 in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPostsAndWaits<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#2  0x0000000017f8ce58 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransferNew(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Tpetra::Distributor&, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, bool, bool) ()
#3  0x0000000017f73bc0 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransfer(Tpetra::SrcDistObject const&, Tpetra::Details::Transfer<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, char const*, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, Tpetra::CombineMode, bool) ()
#4  0x0000000017f6fd1c in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport(Tpetra::SrcDistObject const&, Tpetra::Export<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, Tpetra::CombineMode, bool) ()
#5  0x0000000016f3ff34 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::globalAssemble() ()
#6  0x0000000016f40d90 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Teuchos::ParameterList> const&) ()
#7  0x00000000130b5aa4 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::buildTpetraGraph(int, int) const ()
#8  0x00000000130cf5d0 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getGraph(int, int) const ()
#9  0x00000000130ba304 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getTpetraMatrix(int, int) const ()
#10 0x0000000012fe0430 in panzer::L2Projection<int, long long>::buildMassMatrix(bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > > const*) ()

The failures occur in different ways - maybe a race condition? Sometimes we see a raw seg fault and sometimes we get the following two different errors reported from tpetra:

terminate called after throwing an instance of 'std::runtime_error'
  what():  /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 3) When converting column indices from global to local, we encountered 72 indices that does not live in the column Map on this process.
(Process 3) Here are the bad global indices, listed by local row: 
(Process 3)  Local row 262 (global row 558): [550,551,560,561,666,667,668,669,670]
(Process 3)  Local row 263 (global row 559): [550,551,560,561,666,667,668,669,670]
(Process 3)  Local row 264 (global row 570): [562,563,572,573,686,687,688,689,690]
(Process 3)  Local row 265 (global row 571): [562,563,572,573,686,687,688,689,690]
(Process 3)  Local row 266 (global row 582): [574,575,584,585,706,707,708,709,710]
(Process 3)  Local row 267 (global row 583): [574,575,584,585,706,707,708,709,710]
(Process 3)  Local row 270 (global row 606): [598,599,608,609,746,747,748,749,750]
(Process 3)  Local row 271 (global row 607): [598,599,608,609,746,747,748,749,750]
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 3!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 1!
terminate called after throwing an instance of 'std::runtime_error'
  what():  View bounds error of view MV::DualView ( -1 < 297 , 0 < 1 )
Traceback functionality not available

[ascicgpu14:80428] *** Process received signal ***
[ascicgpu14:80428] Signal: Aborted (6)
[ascicgpu14:80428] Signal code:  (-6)
[ascicgpu14:80428] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f0ea620a5d0]
[ascicgpu14:80428] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f0ea55b7207]
[ascicgpu14:80428] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f0ea55b88f8]
[ascicgpu14:80428] [ 3] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125)[0x7f0ea5efa695]
[ascicgpu14:80428] [ 4] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f316)[0x7f0ea5ef8316]
[ascicgpu14:80428] [ 5] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f361)[0x7f0ea5ef8361]
[ascicgpu14:80428] [ 6] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f614)[0x7f0ea5ef8614]
[ascicgpu14:80428] [ 7] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/kokkos/core/src/libkokkoscore.so.12(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x369)[0x7f0ea7e8c809]
[ascicgpu14:80428] [ 8] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE22buildTaggedMultiVectorERKNS1_18ElementBlockAccessE+0xb7b)[0x7f0ee9a45e0b]
[ascicgpu14:80428] [ 9] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsERKN7Teuchos3RCPIKNS_12FieldPatternEEE+0x2ac)[0x7f0ee9a4838c]
[ascicgpu14:80428] [10] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsEv+0x245)[0x7f0ee9a4b235]
[ascicgpu14:80428] [11] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerINS_10DOFManagerIixEEEEN7Teuchos3RCPINS_19UniqueGlobalIndexerIixEEEERKNS6_IKNS5_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS6_INS_12PhysicsBlockEEESaISK_EERKNS6_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x642)[0x7f0eead9dd22]
[ascicgpu14:80428] [12] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerERKN7Teuchos3RCPIKNS2_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS3_INS_12PhysicsBlockEEESaISE_EERKNS3_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x9)[0x7f0eead9e449]
[ascicgpu14:80428] [13] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47aae0]
[ascicgpu14:80428] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0ea55a33d5]
[ascicgpu14:80428] [15] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47c768]
[ascicgpu14:80428] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 80428 on node ascicgpu14 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

--------------------------------------------------------------------------------

TEST_2: Return code = 134
TEST_2: Pass criteria = Match REGEX {ALL PASSED: Tpetra} [FAILED]
TEST_2: Result = FAILED

================================================================================

@mhoemmen and @tjfulle - have there been any changes recently to tpetra that might cause these kinds of errors?

@mhoemmen
Copy link
Contributor

Tpetra changes to that code haven't landed yet. Try a debug build, and set the environment variable TPETRA_DEBUG to 1. If that doesn't help, set the environment variable TPETRA_VERBOSE to 1.

@bathmatt
Copy link
Contributor

I'm trying that now, have a test that fails in opt code 3 out of 4 runs, however, in debug it hasn't failed yet. Maybe this flag will show something

@bathmatt
Copy link
Contributor

I have the log file for two failed runs, they are different, nothing is reported but it looks like a cuda mem error, should I rerun with verbose?

output2.txt
output.txt

@mhoemmen
Copy link
Contributor

@jjellio wrote:

The gain from host pinned probably doesn't out weigh the complexity of using it.

I made buffer_memory_space = CudaHostPinnedSpace build correctly a while back in Tpetra. Kyungjoo's solver can use it too. It's not hard for Trilinos to maintain, but it's annoying that CUDA-aware MPI can and does work on other systems but not here.

@jjellio
Copy link
Contributor

jjellio commented Aug 21, 2019

Let's take this conversation to:
https://gitlab-ex.sandia.gov/jhu/TrilinosSolverPerformance/issues/42#note_62537

I don't want to distract from this problem. The link above has info on how to run with Cuda on our machines, as well as LLNL ones, with Cuda enabled or without.

@bathmatt
Copy link
Contributor

@nmhamster @jjellio @jhux2 this gets my run past the MPI error, I ran all the way to the end with this flag. Not sure what it does. We don't have cuda direct mpi enabled..

@nmhamster
Copy link
Contributor

@bathmatt - I think it changes the memory regions where MPI/network has visibility. So this fixes your issue completely based on what I think you have above?

@bathmatt
Copy link
Contributor

bathmatt commented Aug 22, 2019 via email

mhoemmen pushed a commit that referenced this issue Aug 27, 2019
atrocities to make CUDA error in github issue #5179 reproducible.

now even worse
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Sep 4, 2019
@trilinos/tpetra

This is part of trilinos#5179 debugging.
mhoemmen added a commit that referenced this issue Sep 5, 2019
Tpetra: More permanent fixes for FixedHashTable issue in #5179
@bartlettroscoe
Copy link
Member

FYI: Looking at this query there were 35 failures of Panzer tests beginning with PanzerAdaptersSTK_MixedPoissonExample from 8/1/2019 through 10/2/2019. But once you filter out known random system errors (system flakiness):

  • Regex: srun: error: Slurm job .* has expired
  • Regex: cudaGetDeviceCount.*cudaErrorUnknown
  • Regex: cudaGetDeviceCount.*cudaErrorInsufficientDriver
  • String: "Open MPI was unable to obtain the username in order to create a path"

in this query, you see there are zero other failures.

Therefore, I dont' know that we are seeing these Panzer tests failing anymore. Therefore, I think we can remove tracking of these Panzer tests.

@rppawlo
Copy link
Contributor

rppawlo commented Oct 3, 2019

We probably should leave them in. There is an underlying issue and it can affect panzer. The team has a good idea of what it is, but fixing may take a bit still.

@bartlettroscoe bartlettroscoe changed the title Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting ??? Dec 11, 2019
@bartlettroscoe
Copy link
Member

FYI: PR #6425 disabled these Panzer tests for the build:

  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug

@bartlettroscoe bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Dec 12, 2019
@bartlettroscoe bartlettroscoe changed the title Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting ??? Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15 Jan 3, 2020
@bartlettroscoe
Copy link
Member

@trilinos/tpetra, @trilinos/xpetra, @trilinos/amesos2, @trilinos/muelu, @trilinos/panzer, @bathmatt,

FYI: According to this query, in the last 90 days there was just one failing test that showed View bounds error in the output which was:

Site Build Name Test Name Status Time Details Build Time
sems-rhel7 Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug TpetraCore_MatrixMatrix_UnitTests_MPI_4 Failed 15s 260ms Completed (Failed)

who's output showed:

3. Tpetra_MatMat_double_int_longlong_Kokkos_Compat_KokkosCudaWrapperNode_threaded_add_sorted_UnitTest ... :0: : block: [13,0,0], thread: [0,44,0] Assertion `View bounds error of view C colinds` failed.
:0: : block: [13,0,0], thread: [0,49,0] Assertion `View bounds error of view C colinds` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available

[ascicgpu14:33900] *** Process received signal ***
[ascicgpu14:33900] Signal: Aborted (6)
[ascicgpu14:33900] Signal code:  (-6)

That is just one test failure in the last 3 months.

Has this issue finally been resolved? Can we close this Issue?

If not, why are we not seeing any more of these failures in the last 2.5 months? Is this just dumb luck? (I wish we had kept clear statistics for how frequent these failures were before so we had something to compare against.)

@kddevin
Copy link
Contributor

kddevin commented Jan 15, 2020

@bartlettroscoe I think this is the elusive so-called "UVM error." It has not been resolved. I'm going with dumb luck for the recent successes, or maybe some sort of system upgrade. It is interesting that the error has not appeared for 2.5 months; thanks for tracking it.

@bartlettroscoe
Copy link
Member

FYI: Just heard from @crtrott yesterday that a reproducer was created for this with help from Stan Moore with EMPIRE that allowed the defect to be discovered and fixed. A modified debug-enabled EMPIRE executable was able to trigger it about 50% of the time after several invocations (don't remember the actual number of invocations).

This turned out to be a defect in Kokkos. The fix was just merged to Trilinos 'develop' in PR #6627 and was merged to the Kokkos release and 'develop' branches. Stan verified that the problem was fixed with EMPIRE.

To make a long story short, a race condition existed between different threads on the GPU computing the scan result that would (in very rare cases) compute the wrong result. The fix was to put in a fence to avoid the race. The reason this was triggering errors like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  View bounds error of view MV::DualView ( -1 < 297 , 0 < 1 )

was that a bad integer index was being computed that was invalid to create the view.

So it would appear that the native Trilinos test suite was in fact demonstrating the defect! So it would seem that if someone would have run one of these failing Trilinos tests hundreds or thousands of times, then they might have been able to reproduce the failures.

The reason why we saw less of these failures in the last 90 days is unclear. Perhaps other code was changed that lessened how often this was triggered?

So it would appear that this was NOT a UVM bug!!!!!

@bartlettroscoe
Copy link
Member

I will put this issue in review for now and then we can close after I discuss this with a few people.

@bartlettroscoe bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Jan 22, 2020
@bartlettroscoe
Copy link
Member

FYI: As shown in comments above, this random failure may have also likely occurred not just in Power9+GPU builds built also in x86+CUDA builds because we saw these bounds view errors in tests in the builds Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug for example.

@bartlettroscoe
Copy link
Member

@bathmatt reports this is fixed so we can close. Yea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area pkg: Amesos2 pkg: Panzer pkg: Tpetra pkg: Xpetra stage: in review Primary work is completed and now is just waiting for human review and/or test feedback type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests