Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEACASExodus tests failing in new cuda 9.2 ATDM build on white/ride #3288

Closed
fryeguy52 opened this issue Aug 13, 2018 · 20 comments
Closed

SEACASExodus tests failing in new cuda 9.2 ATDM build on white/ride #3288

fryeguy52 opened this issue Aug 13, 2018 · 20 comments
Labels
ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: seacas type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

fryeguy52 commented Aug 13, 2018

CC: @trilinos/seacas , @kddevin (Trilinos Data Services Product Lead), @gsjaardema , @bartlettroscoe

Next Action Status

PR #3418 which switched from module netcdf-exo/4.6.1/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 to netcdf/4.6.1/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 on white/ride merged on 9/10/2018 and SEACAS tests fully passed on 9/11/2018.

Description

As shown in this query the tests:

  • SEACASExodus_exodus_unit_tests
  • SEACASExodus_exodus_unit_tests_nc5_env

are failing in the builds:

  • Trilinos-atdm-white-ride-cuda-9.2-opt
  • Trilinos-atdm-white-ride-cuda-9.2-debug

The output looks similar to what we were seeing in #2815

================================================================================

TEST_0

Running: "/bin/bash" "/home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-opt/SRC_AND_BUILD/BUILD/packages/seacas/libraries/exodus/test/testall" "netcdf5"

  Writing output to file "/home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-opt/SRC_AND_BUILD/BUILD/packages/seacas/libraries/exodus/test/exodus_unit_tests.out"

--------------------------------------------------------------------------------

testwt - single precision write test...
Exodus Library Warning/Error: [ex_put_name]
	ERROR: element block id 10 not found in file id 65536
testrd - single precision read test...

...

--------------------------------------------------------------------------------

TEST_1: Return code = 1
TEST_1: Pass criteria = Zero return code [FAILED]
TEST_1: Result = FAILED

================================================================================

OVERALL FINAL RESULT: TEST FAILED (SEACASExodus_exodus_unit_tests_nc5_env)

in that issue SEACASExodus_exodus_unit_tests was also failing on mutrino and we ended up disabling the test for that build.

Steps to Reproduce

One should be able to reproduce this failure on the machine white as described in:

More specifically, the commands given for the system white are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-opt

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_SEACAS=ON \
 $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16
@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: seacas client: ATDM Any issue primarily impacting the ATDM project labels Aug 13, 2018
@bartlettroscoe bartlettroscoe changed the title SEACASExodus tests failing in cuda 9 ATDM builds SEACASExodus tests failing in new cuda 9.2 ATDM build on white/ride Aug 13, 2018
@fryeguy52
Copy link
Contributor Author

the same two tests are also failing on the waterman builds:

  • Trilinos-atdm-waterman-gnu-debug-openmp
  • Trilinos-atdm-waterman-gnu-opt-openmp
  • Trilinos-atdm-waterman-cuda-9.2-opt
  • Trilinos-atdm-waterman-cuda-9.2-debug

as shown here

@gsjaardema
Copy link
Contributor

This is not related to #2815 and looks to be an actual execution failure (#2815 is problem with extra output in the testing process and not execution problem). I will look into this failure, it does not look like a false positive and could indicate a real issue at some level.

@gsjaardema
Copy link
Contributor

Not sure where the "Steps to Reproduce" is generated, but the cmake step should have "-DTrilinos_ENABLE_SEACAS=ON" instead of "-DTrilinos_ENABLE_Seacas=ON" . (All uppercase SEACAS)

@gsjaardema
Copy link
Contributor

The failure seems to be related to the NetCDF library loaded by the module netcdf-exo/4.6.1/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 If I build SEACAS using my own NetCDF-4.6.1 version or if I use the NetCDF library loaded by the module netcdf-exo/4.6.1/openmpi/2.1.2/ibm/xl/16.1.0/cuda/9.2.88 (even without recompiling -- just changing LD_LIBRARY_PATH to point to other libraries) everything works correctly.

@nmhamster Would it be possible to rebuild the NetCDF library pointed to by the openmpi/3.1.0 module above. Note that as of 4.5.1, there is no need for "exodus-specific" modifications and you should be able to build with no modifications to the distributed source code.

@bartlettroscoe
Copy link
Member

@gsjaardema, can we resolve this by just switching to the module netcdf-exo/4.6.1/openmpi/2.1.2/ibm/xl/16.1.0/cuda/9.2.88 with the current env? If that is the case, we can just update the env file for that platform. If we can't use netcdf-exo/4.6.1/openmpi/2.1.2/ibm/xl/16.1.0/cuda/9.2.88 with the other OpenMP 3.1.0 libraries, then we should we just try using the OpenMPI 2.1.2 env, like we just did on 'waterman' (see #3363).

@bartlettroscoe
Copy link
Member

@gsjaardema said:

Not sure where the "Steps to Reproduce" is generated, but the cmake step should have "-DTrilinos_ENABLE_SEACAS=ON" instead of "-DTrilinos_ENABLE_Seacas=ON" . (All uppercase SEACAS)

It was just a typo. These are hand-generated copy-and-paste from:

The hope is that these instructions are so simple that mistakes like that will be easy to spot.

@bartlettroscoe
Copy link
Member

FYI: As shown here these tests are no longer failing on 'waterman' after the switch back to OpenMPI 2.1.2. Perhaps that si what we should do for 'white'/'ride' as well?

@gsjaardema
Copy link
Contributor

I am fine switching to the 2.1.2 version.

@nmhamster
Copy link
Contributor

@gsjaardema / @bartlettroscoe - where we can, I'd like us to continue to push forward on OpenMPI 3.1.0. Despite some of the challenges, this is where we need to go moving forward for a variety of platforms. I realize that NetCDF 4.6.1 does not require the Exodus changes now but we had continued to perform them in order to match with pNetCDF installs. However, I realize that this may be causing some problems, so we have created a NetCDF 4.6.1 (without Exodus) for testing.

Can you try the following:

module load devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88
module swap netcdf-exo/4.6.1/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 netcdf/4.6.1/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88

This will use the standard devpack but change the NetCDF files over to the unmodified (non Exodus) variants. For now, I'd like to try this on White and Ride and if this works successfully then we can make this a standard change moving forward. Overall this is great news as we can reduce the non-standard code we are running (thanks @gsjaardema for the help).

@nmhamster
Copy link
Contributor

@bartlettroscoe - just replying to your suggestion to use XL modules instead of GCC - in general we strongly recommend against doing so. This mixture of name mangling and function call dependencies makes this challenging.

@gsjaardema
Copy link
Contributor

gsjaardema commented Aug 29, 2018

Note that as of parallel-netcdf (PNetCDF) 1.9.0 there are no exodus-specific modifications requiredk for that library either. So recommendations are PNetCDF >= 1.9.0 and NetCDF >= 4.6.1

@gsjaardema
Copy link
Contributor

The exodus warnings that triggered this issue are fixed by using the new build of netcdf without the exodus mods. Whether the library had or didn't have the mods should make no difference, so maybe the now passing tests are the result of a newer or better build?

However, I am now getting failures related to Kokkos::initialize as shown below:

TEST_0

Running: "/ascldap/users/projects/ppc64le-pwr8-nvidia/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88/bin/mpiexec" "-np" "1" "-map-by" "socket:PE=4"
"/ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell" "/ascldap/users/gdsjaar/trilinos/packages/seacas
/libraries/ioss/src/main/test/8-block.g" "8-block32.g"

--------------------------------------------------------------------------------

[ride6:114345] mca_base_component_repository_open: unable to open mca_coll_hcoll: libsharp_coll.so.2: cannot open shared object file: No
such file or directory (ignored)
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[1478,1],0] (PID 114345)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetDeviceCount( & m_cudaDevCount ) error( cudaErrorUnknown): unknown error ../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Im
pl.cpp:203
Traceback functionality not available

[ride6:114345] *** Process received signal ***
[ride6:114345] Signal: Aborted (6)
[ride6:114345] Signal code:  (-6)
[ride6:114345] [ 0] [0x3fff97d30478]
[ride6:114345] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x3fff8e701f94]
[ride6:114345] [ 2] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x3fff8ea90
774]
[ride6:114345] [ 3] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(+0xab504)[0x3fff8ea8b504]
[ride6:114345] [ 4] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x3fff8ea8b5c0]
[ride6:114345] [ 5] /home/projects/ppc64le/gcc/7.2.0/lib64/libstdc++.so.6(__cxa_throw+0x7c)[0x3fff8ea8ba6c]
[ride6:114345] [ 6] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc4)[0x108b0c9c]
[ride6:114345] [ 7] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Impl25cuda_internal_error_throwE9cudaErrorPKcS3_i+0x170)[0x108bd66c]
[ride6:114345] [ 8] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Impl23cuda_internal_safe_callE9cudaErrorPKcS3_i+0x60)[0x1053c940]
[ride6:114345] [ 9] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Impl75_GLOBAL__N__51_tmpxft_00014792_00000000_6_Kokkos_Cuda_Impl_cpp1_ii_a8bc509719CudaInternalDevicesC1Ev+0x58)[0x108bd74c]
[ride6:114345] [10] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Impl75_GLOBAL__N__51_tmpxft_00014792_00000000_6_Kokkos_Cuda_Impl_cpp1_ii_a8bc509719CudaInternalDevices9singletonEv+0x90)[0x108bd8a0]
[ride6:114345] [11] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Impl12CudaInternal10initializeEii+0x118)[0x108be10c]
[ride6:114345] [12] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Cuda10initializeENS0_12SelectDeviceEm+0x48)[0x108bfae4]
[ride6:114345] [13] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos4Impl70_GLOBAL__N__46_tmpxft_0001467a_00000000_6_Kokkos_Core_cpp1_ii_8f3cc89319initialize_internalERKNS_13InitArgumentsE+0x180)[0x108b14cc]
[ride6:114345] [14] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(_ZN6Kokkos10initializeERiPPc+0xf60)[0x108b2a64]
[ride6:114345] [15] /ascldap/users/gdsjaar/trilinos/build/packages/seacas/libraries/ioss/src/main/io_shell(main+0x9c)[0x1033041c]
[ride6:114345] [16] /lib64/libc.so.6(+0x25100)[0x3fff8e6e5100]
[ride6:114345] [17] /lib64/libc.so.6(__libc_start_main+0xc4)[0x3fff8e6e52f4]
[ride6:114345] *** End of error message ***
-------------------------------------------------------

So, some improvement, but now new issues arising...

@bartlettroscoe
Copy link
Member

@gsjaardema, what is the next step here then? Do we need to get help from the Kokkos team about issues with Kokkos::initailize()?

@bartlettroscoe
Copy link
Member

@nmhamster said to try:

module load devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88
module swap netcdf-exo/4.6.1/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 netcdf/4.6.1/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88

@gsjaardema, is this what you meant by "new NetCDF" above?

@gsjaardema
Copy link
Contributor

@bartlettroscoe Yes.

@bartlettroscoe
Copy link
Member

@fryeguy52, can we go ahead and update the file Trilinos/cmake/std/atdm/ride/environment.sh to update the modules as per above?

fryeguy52 added a commit to fryeguy52/Trilinos that referenced this issue Sep 10, 2018
issue: trilinos#3288

switch the netcdf module that is loaded for the builds on ride/white
to address some failing tests
@gsjaardema
Copy link
Contributor

Note that my comment above about tests now failing in Kokkos::initialize is incorrect -- I was running on the front-end node instead of on the compute node. When I run on compute node via:

bsub -Is -n 32 bash
ctest -j8

Then all tests complete successfully when using the netcdf/4.6.1/... module instead of the netcdf-exo/4.6.1/... module

@bartlettroscoe
Copy link
Member

@gsjaardema, thanks for pointing that out. Indeed these tests and all SEACAS tests are now fully passing in these CUDA 9.2 builds on 'white'/'ride' as shown here and here (see the little -2 subscript and +2 superscript by the number of failing and passing tests for SEACAS which is shown explicitly, for example, here).

Looks like the problem is solved. We can now close this issue!

@bartlettroscoe
Copy link
Member

All fixed. Closing as complete!

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Sep 28, 2018
trilinos#3290)

This new env also has the correct netcdf build for SEACAS (see trilinos#3288).
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 2, 2018
trilinos#3290)

This new env also has the correct netcdf build for SEACAS (see trilinos#3288).
@bartlettroscoe
Copy link
Member

FYI: As part of installing a consistent GCC 7.2.0 + OpenMPI 2.1.2 + CUDA 9.2 + TPLs env on 'white' and 'ride' as part of #3549, @nmhamster determined that the SEACAS tests failing as described in this Issue were not due to a bad NetCDF configuration but were actually due to differences in roundoff on HDF5 when going from -O2 to -O3. That seems like a reasonable thing to do and code should not fail when you do such a thing.

@nmhamster and @gsjaardema, does this indicate a defect in HDF5, NetCDF, or SEACAS (or none of the above)?

@bartlettroscoe bartlettroscoe added the ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos label Nov 13, 2018
@bartlettroscoe bartlettroscoe added the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
issue: trilinos#3288

switch the netcdf module that is loaded for the builds on ride/white
to address some failing tests
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
trilinos#3290)

This new env also has the correct netcdf build for SEACAS (see trilinos#3288).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: seacas type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants