Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15 #5179

fryeguy52 · 2019-05-15T14:12:32Z

Bug Report

CC: @trilinos/panzer, @kddevin (Trilinos Data Services Product Lead), @srajama1 (Trilinos Linear Solver Data Services), @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Since PR #5346 was merged on 6/7/2019 which fixed a file read/write race in the test, there has only been one failing Panzer test on any ATDM Trilinos platform as of 6/11/2019 looking to be related. Also, on 6/11/2019 @bathmatt reported EMPIRE is not failing in a similar way in his recent tests. Next: Watch results over next few weeks to see if more random failures like this occur ...

Description

As shown in this query the tests:

PanzerAdaptersSTK_MixedPoissonExample
PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-1

are failing in the build:

Trilinos-atdm-waterman-cuda-9.2-opt

Additionally the test:

PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2

is failing in a different build on the same machine:

Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt

Expand to see new commits on 2019-05-14

*** Base Git Repo: Trilinos
7b6d69a:  Merge remote-tracking branch 'origin/develop' into atdm-nightly
Author: Roscoe A. Bartlett <[email protected]>
Date:   Mon May 13 21:05:15 2019 -0600

085e9d8:  Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106
Author: trilinos-autotester <[email protected]>
Date:   Mon May 13 18:57:23 2019 -0600

238800a:  Merge pull request #5163 from kyungjoo-kim/fix-5148
Author: kyungjoo-kim <[email protected]>
Date:   Mon May 13 15:05:45 2019 -0600

7b827c7:  Tpetra: resolution to #5161 (#5162)
Author: Tim Fuller <[email protected]>
Date:   Mon May 13 14:57:31 2019 -0600

D	packages/tpetra/core/src/Tpetra_Experimental_BlockMultiVector.cpp

925a0a7:  Ifpack2 - fix for #5148
Author: Kyungjoo Kim <[email protected]>
Date:   Mon May 13 11:52:27 2019 -0600

M	packages/ifpack2/src/Ifpack2_BlockTriDiContainer_impl.hpp

a847648:  Made even bigger.
Author: K. Devine <[email protected]>
Date:   Wed May 8 10:48:35 2019 -0600

M	packages/zoltan/src/driver/dr_main.c
M	packages/zoltan/src/driver/dr_mainCPP.cpp

9051d9f:  zoltan:  minor change to fix #5106
Author: K. Devine <[email protected]>
Date:   Wed May 8 10:38:59 2019 -0600

M	packages/zoltan/src/driver/dr_main.c
M	packages/zoltan/src/driver/dr_mainCPP.cpp

Current Status on CDash

Results for the current testing day

Steps to Reproduce

One should be able to reproduce this failure on waterman as described in:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

More specifically, the commands given for waterman are provided at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-waterman-cuda-9.2-opt
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Panzer=ON \
 $TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -n 20 ctest -j20

The text was updated successfully, but these errors were encountered:

rppawlo · 2019-05-15T14:47:01Z

Panzer has not changed recently. According to empire testing (thanks @jmgate !), these are the candidate commits that could have caused the new panzer failure.

* b3a8dc8 ([email protected]) Mon May 13 21:28:33 2019 -0600
|
 Merge pull request #5167 from kyungjoo-kim/ifpack2-develop
| Ifpack2 develop
* 085e9d8 ([email protected])
 Mon May 13 18:57:23 2019 -0600
| Merge Pull Request #5138 from trilinos/Trilinos/zoltan_fix5106
| PR Title:
 zoltan: minor change to fix #5106
| PR Author: kddevin
* 238800a ([email protected])
 Mon May 13 15:05:45 2019 -0600
| Merge pull request #5163 from kyungjoo-kim/fix-5148
| Ifpack2 - fix for
 #5148
* 7b827c7 ([email protected]) Mon May 13 14:57:31 2019 -0600

 Tpetra: resolution to #5161 (#5162)

bartlettroscoe · 2019-05-15T16:03:18Z

@kyungjoo-kim, @tjfulle, do you guys know if you recent commits listed above might caused this?

kyungjoo-kim · 2019-05-15T16:25:44Z

@rppawlo @bartlettroscoe No, my commits are only intended for ifpack2 blocktridicontainer. The commits are not related to Panzer.

bartlettroscoe · 2019-05-15T18:31:11Z

Does Panzer use Ifpack2?

kyungjoo-kim · 2019-05-15T18:34:35Z

Panzer may use other ifpack2 components but it does not use blocktridicontainer solver. The solver I am working on is only used by SPARC.

mhoemmen · 2019-05-15T20:52:16Z

@rppawlo and I talked about this over e-mail. The issue is that Trilinos does not yet work correctly when deprecated Tpetra code is disabled (Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF). See e.g., the following issues:

@trilinos/tpetra is working on fixing these. The work-around for now is not to disabled deprecated code.

bathmatt · 2019-05-15T21:01:15Z

@mhoemmen correct me if I'm wrong, but these failures don't have Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF set,

rppawlo · 2019-05-15T22:28:35Z

The deprecated code is enabled as @bathmatt mentioned. So the errors from the tests are different. One test shows:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

While the two other failures show:

terminate called after throwing an instance of 'std::runtime_error'
  what():  /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 1) When converting column indices from global to local, we encountered 462 indices that do not live in the column Map on this process.  That's too many to print.

[waterman2:05756] *** Process received signal ***

@mhoemmen - are there any changes that to tpetra in the last 2 days that might trigger this?

mhoemmen · 2019-05-15T22:53:08Z

I don't think so, but it's possible. @trilinos/tpetra

For debugging, to see if this is a Panzer issue, we could adjust that print threshold temporarily.

mhoemmen · 2019-05-15T22:56:06Z

Try also setting the environment variable TPETRA_DEBUG=1. In the worst case, we could also set TPETRA_VERBOSE=CrsGraph.

bathmatt · 2019-05-16T18:13:54Z

I'm seeing the second error Roger mentioned in EMPIRE now with the EMPlasma trilinos. So, this isn't a new bug, it is an older bug that is starting to pop up more often it looks like

bathmatt · 2019-05-16T21:36:41Z

My statement might be incorrect, I wiped everything clean and it looks like it isn't popping up anymore

rppawlo · 2019-05-20T18:44:44Z

After rebuilding from scratch, this looks like the parallel level is too high and the cuda card is running out of memory with multiple tests hitting the same card. In the steps to reproduce, the tests are run with "cmake -j20". I could not reproduce the errors running the tests manually or when the cmake parallel level was reduced. I think we run the other cuda machines at -j8. Maybe we need to do that here also?

bartlettroscoe · 2019-05-20T19:18:31Z

@rppawlo, looking at the Jenkins driver at:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-waterman-cuda-9.2-opt/278/consoleFull

it shows:

ATDM_CONFIG_CTEST_PARALLEL_LEVEL=8

Therefore, it is running them with ctest -j8.

But that may be too much for some of these Panzer tests?

rppawlo · 2019-05-20T19:54:42Z

But that may be too much for some of these Panzer tests?

I think that is ok. The instructions at the top of this ticket have -j20 so I assumed that is what the tests were running. With -j20 I see a bunch of failures. With -j8 nothing fails. Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.

bartlettroscoe · 2019-05-20T19:59:52Z

@rppawlo asked:

Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build.

By default all of the nightly ATDM Trilinos builds build from scratch each day. We can look on Jenkins to see if that is the case to be sure. For example, at:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-waterman-cuda-9.2-opt/278/consoleFull

it shows:

09:30:08 Cleaning out binary directory '/home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/BUILD' ...

and does not show any errors so I would assume that it is blowing away the directories.

fryeguy52 · 2019-05-28T19:39:58Z

It looks like these are also failing on non-waterman builds. There are 74 failing PanzerAdaptersSTK_* tests between 4 builds between 2019-05-01 ans 2019-05-29 shown here

Note that the above link filters out builds on white and ride because we have seen a lot of failures on those machines recently but these tests may be failing there too. failures on white/ride in the last 2 weeks

current 2 week history of failing PanzerAdaptersSTK* tests

rppawlo · 2019-05-29T14:31:45Z

All failures are in cuda builds using the tpetra deprecated dynamic profile. I've tracked the multiblock test failure to a separate issue and will push a fix shortly.

The majority of the random errors look to be in the fillComplete on the CrsMatrix. I have not had good luck reproducing in raw panzer tests. EMPIRE is also seeing similar failures and @bathmatt was able to get the following stack trace:

#0  0x0000000017f868bc in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPosts<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#1  0x0000000017f87420 in std::enable_if<Kokkos::is_view<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > >::value&&Kokkos::is_view<Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >::value, void>::type Tpetra::Distributor::doPostsAndWaits<Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> >(Kokkos::View<long long const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, Teuchos::ArrayView<unsigned long const> const&, Kokkos::View<long long*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Teuchos::ArrayView<unsigned long const> const&) ()
#2  0x0000000017f8ce58 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransferNew(Tpetra::SrcDistObject const&, Tpetra::CombineMode, unsigned long, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, void, void> const&, Tpetra::Distributor&, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, bool, bool) ()
#3  0x0000000017f73bc0 in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doTransfer(Tpetra::SrcDistObject const&, Tpetra::Details::Transfer<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, char const*, Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::ReverseOption, Tpetra::CombineMode, bool) ()
#4  0x0000000017f6fd1c in Tpetra::DistObject<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::doExport(Tpetra::SrcDistObject const&, Tpetra::Export<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const&, Tpetra::CombineMode, bool) ()
#5  0x0000000016f3ff34 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::globalAssemble() ()
#6  0x0000000016f40d90 in Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> > const> const&, Teuchos::RCP<Teuchos::ParameterList> const&) ()
#7  0x00000000130b5aa4 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::buildTpetraGraph(int, int) const ()
#8  0x00000000130cf5d0 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getGraph(int, int) const ()
#9  0x00000000130ba304 in panzer::BlockedTpetraLinearObjFactory<panzer::Traits, double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::getTpetraMatrix(int, int) const ()
#10 0x0000000012fe0430 in panzer::L2Projection<int, long long>::buildMassMatrix(bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > > const*) ()

The failures occur in different ways - maybe a race condition? Sometimes we see a raw seg fault and sometimes we get the following two different errors reported from tpetra:

terminate called after throwing an instance of 'std::runtime_error'
  what():  /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:3958:

Throw number = 1

Throw test that evaluated to true: (makeIndicesLocalResult.first != 0)

Tpetra::CrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::fillComplete: (Process 3) When converting column indices from global to local, we encountered 72 indices that does not live in the column Map on this process.
(Process 3) Here are the bad global indices, listed by local row: 
(Process 3)  Local row 262 (global row 558): [550,551,560,561,666,667,668,669,670]
(Process 3)  Local row 263 (global row 559): [550,551,560,561,666,667,668,669,670]
(Process 3)  Local row 264 (global row 570): [562,563,572,573,686,687,688,689,690]
(Process 3)  Local row 265 (global row 571): [562,563,572,573,686,687,688,689,690]
(Process 3)  Local row 266 (global row 582): [574,575,584,585,706,707,708,709,710]
(Process 3)  Local row 267 (global row 583): [574,575,584,585,706,707,708,709,710]
(Process 3)  Local row 270 (global row 606): [598,599,608,609,746,747,748,749,750]
(Process 3)  Local row 271 (global row 607): [598,599,608,609,746,747,748,749,750]

Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 2!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 3!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 0!
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ascicgpu14 and rank 1!
terminate called after throwing an instance of 'std::runtime_error'
  what():  View bounds error of view MV::DualView ( -1 < 297 , 0 < 1 )
Traceback functionality not available

[ascicgpu14:80428] *** Process received signal ***
[ascicgpu14:80428] Signal: Aborted (6)
[ascicgpu14:80428] Signal code:  (-6)
[ascicgpu14:80428] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f0ea620a5d0]
[ascicgpu14:80428] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f0ea55b7207]
[ascicgpu14:80428] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f0ea55b88f8]
[ascicgpu14:80428] [ 3] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125)[0x7f0ea5efa695]
[ascicgpu14:80428] [ 4] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f316)[0x7f0ea5ef8316]
[ascicgpu14:80428] [ 5] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f361)[0x7f0ea5ef8361]
[ascicgpu14:80428] [ 6] /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/lib64/libstdc++.so.6(+0x8f614)[0x7f0ea5ef8614]
[ascicgpu14:80428] [ 7] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/kokkos/core/src/libkokkoscore.so.12(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x369)[0x7f0ea7e8c809]
[ascicgpu14:80428] [ 8] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE22buildTaggedMultiVectorERKNS1_18ElementBlockAccessE+0xb7b)[0x7f0ee9a45e0b]
[ascicgpu14:80428] [ 9] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsERKN7Teuchos3RCPIKNS_12FieldPatternEEE+0x2ac)[0x7f0ee9a4838c]
[ascicgpu14:80428] [10] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/dof-mgr/src/libpanzer-dof-mgr.so.12(_ZN6panzer10DOFManagerIixE19buildGlobalUnknownsEv+0x245)[0x7f0ee9a4b235]
[ascicgpu14:80428] [11] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerINS_10DOFManagerIixEEEEN7Teuchos3RCPINS_19UniqueGlobalIndexerIixEEEERKNS6_IKNS5_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS6_INS_12PhysicsBlockEEESaISK_EERKNS6_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x642)[0x7f0eead9dd22]
[ascicgpu14:80428] [12] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/disc-fe/src/libpanzer-disc-fe.so.12(_ZNK6panzer17DOFManagerFactoryIixE24buildUniqueGlobalIndexerERKN7Teuchos3RCPIKNS2_13OpaqueWrapperIP19ompi_communicator_tEEEERKSt6vectorINS3_INS_12PhysicsBlockEEESaISE_EERKNS3_INS_11ConnManagerEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x9)[0x7f0eead9e449]
[ascicgpu14:80428] [13] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47aae0]
[ascicgpu14:80428] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0ea55a33d5]
[ascicgpu14:80428] [15] /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/CurlLaplacianExample/PanzerAdaptersSTK_CurlLaplacianExample.exe[0x47c768]
[ascicgpu14:80428] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 80428 on node ascicgpu14 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

--------------------------------------------------------------------------------

TEST_2: Return code = 134
TEST_2: Pass criteria = Match REGEX {ALL PASSED: Tpetra} [FAILED]
TEST_2: Result = FAILED

================================================================================

@mhoemmen and @tjfulle - have there been any changes recently to tpetra that might cause these kinds of errors?

mhoemmen · 2019-05-29T15:58:32Z

Tpetra changes to that code haven't landed yet. Try a debug build, and set the environment variable TPETRA_DEBUG to 1. If that doesn't help, set the environment variable TPETRA_VERBOSE to 1.

bathmatt · 2019-05-29T16:14:45Z

I'm trying that now, have a test that fails in opt code 3 out of 4 runs, however, in debug it hasn't failed yet. Maybe this flag will show something

bathmatt · 2019-05-29T17:50:38Z

I have the log file for two failed runs, they are different, nothing is reported but it looks like a cuda mem error, should I rerun with verbose?

output2.txt
output.txt

mhoemmen · 2019-08-21T19:36:48Z

@jjellio wrote:

The gain from host pinned probably doesn't out weigh the complexity of using it.

I made buffer_memory_space = CudaHostPinnedSpace build correctly a while back in Tpetra. Kyungjoo's solver can use it too. It's not hard for Trilinos to maintain, but it's annoying that CUDA-aware MPI can and does work on other systems but not here.

jjellio · 2019-08-21T20:13:49Z

Let's take this conversation to:
https://gitlab-ex.sandia.gov/jhu/TrilinosSolverPerformance/issues/42#note_62537

I don't want to distract from this problem. The link above has info on how to run with Cuda on our machines, as well as LLNL ones, with Cuda enabled or without.

bathmatt · 2019-08-21T21:40:07Z

@nmhamster @jjellio @jhux2 this gets my run past the MPI error, I ran all the way to the end with this flag. Not sure what it does. We don't have cuda direct mpi enabled..

nmhamster · 2019-08-21T21:45:59Z

@bathmatt - I think it changes the memory regions where MPI/network has visibility. So this fixes your issue completely based on what I think you have above?

bathmatt · 2019-08-22T00:10:23Z

Si, sorry, didn’t follow that

…

On Wed, Aug 21, 2019 at 3:46 PM Si Hammond ***@***.***> wrote: @bathmatt <https://github.com/bathmatt> - I think it changes the memory regions where MPI/network has visibility. So this fixes your issue completely based on what I think you have above? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5179?email_source=notifications&email_token=ADR4GIBBBYAKXHRDK3BRF6DQFWZSBA5CNFSM4HNDXI72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD43HFYA#issuecomment-523662048>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADR4GIAVR2524PMNZOXDFC3QFWZSBANCNFSM4HNDXI7Q> .

atrocities to make CUDA error in github issue #5179 reproducible. now even worse

@trilinos/tpetra This is part of trilinos#5179 debugging.

Tpetra: More permanent fixes for FixedHashTable issue in #5179

bartlettroscoe · 2019-10-03T18:36:01Z

FYI: Looking at this query there were 35 failures of Panzer tests beginning with PanzerAdaptersSTK_MixedPoissonExample from 8/1/2019 through 10/2/2019. But once you filter out known random system errors (system flakiness):

Regex: srun: error: Slurm job .* has expired
Regex: cudaGetDeviceCount.*cudaErrorUnknown
Regex: cudaGetDeviceCount.*cudaErrorInsufficientDriver
String: "Open MPI was unable to obtain the username in order to create a path"

in this query, you see there are zero other failures.

Therefore, I dont' know that we are seeing these Panzer tests failing anymore. Therefore, I think we can remove tracking of these Panzer tests.

rppawlo · 2019-10-03T19:03:24Z

We probably should leave them in. There is an underlying issue and it can affect panzer. The team has a good idea of what it is, but fixing may take a bit still.

bartlettroscoe · 2019-12-11T23:09:45Z

FYI: PR #6425 disabled these Panzer tests for the build:

Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug

bartlettroscoe · 2020-01-15T13:38:50Z

@trilinos/tpetra, @trilinos/xpetra, @trilinos/amesos2, @trilinos/muelu, @trilinos/panzer, @bathmatt,

FYI: According to this query, in the last 90 days there was just one failing test that showed View bounds error in the output which was:

Site	Build Name	Test Name	Status	Time Details	Build Time
sems-rhel7	Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug	TpetraCore_MatrixMatrix_UnitTests_MPI_4	Failed	15s 260ms	Completed (Failed)

who's output showed:

3. Tpetra_MatMat_double_int_longlong_Kokkos_Compat_KokkosCudaWrapperNode_threaded_add_sorted_UnitTest ... :0: : block: [13,0,0], thread: [0,44,0] Assertion `View bounds error of view C colinds` failed.
:0: : block: [13,0,0], thread: [0,49,0] Assertion `View bounds error of view C colinds` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available

[ascicgpu14:33900] *** Process received signal ***
[ascicgpu14:33900] Signal: Aborted (6)
[ascicgpu14:33900] Signal code:  (-6)

That is just one test failure in the last 3 months.

Has this issue finally been resolved? Can we close this Issue?

If not, why are we not seeing any more of these failures in the last 2.5 months? Is this just dumb luck? (I wish we had kept clear statistics for how frequent these failures were before so we had something to compare against.)

kddevin · 2020-01-15T19:43:37Z

@bartlettroscoe I think this is the elusive so-called "UVM error." It has not been resolved. I'm going with dumb luck for the recent successes, or maybe some sort of system upgrade. It is interesting that the error has not appeared for 2.5 months; thanks for tracking it.

bartlettroscoe · 2020-01-22T12:42:54Z

FYI: Just heard from @crtrott yesterday that a reproducer was created for this with help from Stan Moore with EMPIRE that allowed the defect to be discovered and fixed. A modified debug-enabled EMPIRE executable was able to trigger it about 50% of the time after several invocations (don't remember the actual number of invocations).

This turned out to be a defect in Kokkos. The fix was just merged to Trilinos 'develop' in PR #6627 and was merged to the Kokkos release and 'develop' branches. Stan verified that the problem was fixed with EMPIRE.

To make a long story short, a race condition existed between different threads on the GPU computing the scan result that would (in very rare cases) compute the wrong result. The fix was to put in a fence to avoid the race. The reason this was triggering errors like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  View bounds error of view MV::DualView ( -1 < 297 , 0 < 1 )

was that a bad integer index was being computed that was invalid to create the view.

So it would appear that the native Trilinos test suite was in fact demonstrating the defect! So it would seem that if someone would have run one of these failing Trilinos tests hundreds or thousands of times, then they might have been able to reproduce the failures.

The reason why we saw less of these failures in the last 90 days is unclear. Perhaps other code was changed that lessened how often this was triggered?

So it would appear that this was NOT a UVM bug!!!!!

bartlettroscoe · 2020-01-22T12:43:25Z

I will put this issue in review for now and then we can close after I discuss this with a few people.

bartlettroscoe · 2020-01-22T14:26:02Z

FYI: As shown in comments above, this random failure may have also likely occurred not just in Power9+GPU builds built also in x86+CUDA builds because we saw these bounds view errors in tests in the builds Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug for example.

bartlettroscoe · 2020-01-23T21:35:54Z

@bathmatt reports this is fixed so we can close. Yea!

fryeguy52 added this to the Keep promoted "ATDM" builds of Trilinos clean milestone May 15, 2019

rppawlo mentioned this issue May 15, 2019

Fix warnings and set Werror for Triutils in GCC 7.2.0 build #5153

Merged

7 tasks

bartlettroscoe changed the title ~~Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM builds on waterman~~ Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM opt builds on waterman May 20, 2019

fryeguy52 changed the title ~~Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM opt builds on waterman~~ Panzer: PanzerAdaptersSTK_MixedPoisson tests failing in ATDM cuda 9.2 builds May 28, 2019

mhoemmen pushed a commit that referenced this issue Aug 27, 2019

MueLu: DO NOT COMMIT

cda0ebc

atrocities to make CUDA error in github issue #5179 reproducible. now even worse

mhoemmen mentioned this issue Aug 28, 2019

Tpetra: More permanent fixes for FixedHashTable issue in #5179 #5811

Merged

mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Sep 4, 2019

Tpetra::Import: Add debug checks to createRemoteOnlyImport

e810173

@trilinos/tpetra This is part of trilinos#5179 debugging.

This was referenced Sep 4, 2019

Tpetra::Import: Add debug checks to createRemoteOnlyImport #5857

Merged

ROL - fix failed examples #5858

Merged

mhoemmen added a commit that referenced this issue Sep 5, 2019

Merge pull request #5811 from mhoemmen/Fix-5179-28aug2019

154f6fd

Tpetra: More permanent fixes for FixedHashTable issue in #5179

mhoemmen mentioned this issue Sep 11, 2019

Tpetra: Add verbose output to Directory & createOneToOne #5894

Merged

bartlettroscoe mentioned this issue Sep 12, 2019

Does "an illegal memory access was encountered" on CUDA ever occur due to overloading the GPU memory? kokkos/kokkos#2330

Closed

bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Dec 12, 2019

bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Jan 22, 2020

bartlettroscoe closed this as completed Jan 23, 2020

jjellio mentioned this issue Feb 27, 2020

Fix up and document handling of CUDA-aware MPI with Tpetra (CDOFA-100, #6902) #6904

Merged

bartlettroscoe mentioned this issue Feb 27, 2020

ATDM: ATS-2 testing documentation needs environment variable info #6902

Closed

Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15 #5179

Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15 #5179

Comments

fryeguy52 commented May 15, 2019 • edited by bartlettroscoe Loading

Bug Report

Next Action Status

Description

Current Status on CDash

Steps to Reproduce

rppawlo commented May 15, 2019

bartlettroscoe commented May 15, 2019

kyungjoo-kim commented May 15, 2019

bartlettroscoe commented May 15, 2019

kyungjoo-kim commented May 15, 2019

mhoemmen commented May 15, 2019

bathmatt commented May 15, 2019

rppawlo commented May 15, 2019

mhoemmen commented May 15, 2019

mhoemmen commented May 15, 2019

bathmatt commented May 16, 2019

bathmatt commented May 16, 2019

rppawlo commented May 20, 2019

bartlettroscoe commented May 20, 2019

rppawlo commented May 20, 2019

bartlettroscoe commented May 20, 2019

fryeguy52 commented May 28, 2019

rppawlo commented May 29, 2019

mhoemmen commented May 29, 2019

bathmatt commented May 29, 2019

bathmatt commented May 29, 2019

mhoemmen commented Aug 21, 2019

jjellio commented Aug 21, 2019 • edited Loading

bathmatt commented Aug 21, 2019

nmhamster commented Aug 21, 2019

bathmatt commented Aug 22, 2019 via email

bartlettroscoe commented Oct 3, 2019

rppawlo commented Oct 3, 2019

bartlettroscoe commented Dec 11, 2019

bartlettroscoe commented Jan 15, 2020

kddevin commented Jan 15, 2020

bartlettroscoe commented Jan 22, 2020

bartlettroscoe commented Jan 22, 2020

bartlettroscoe commented Jan 22, 2020

bartlettroscoe commented Jan 23, 2020

fryeguy52 commented May 15, 2019 •

edited by bartlettroscoe

Loading

jjellio commented Aug 21, 2019 •

edited

Loading