Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KokkosCore_UnitTest_[OpenMP|Serial]_MPI_1 tests fail in most of the GNU and Intel builds on hansen/shiller #2320

Closed
bartlettroscoe opened this issue Mar 2, 2018 · 9 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Kokkos

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 2, 2018

CC: @trilinos/kokkos, @fryeguy52

Next Action Status

Changed from KOKKOS_ARCH=BDW to HSW which fixed all of the test on all ATDM builds of Trilinos on shiller.

Description

The tests KokkosCore_UnitTest_OpenMP_MPI_1 and KokkosCore_UnitTest_Serial_MPI_1 failed in most of the GNU and Intel ATDM Trilinos builds on hansen today as shown at:

which shows:

Site Build Name Test Name Status Time Details Build Time
hansen/shiller Trilinos-atdm-hansen-shiller-intel-debug-openmp KokkosCore_UnitTest_OpenMP_MPI_1 Failed 3.72 Completed (Failed) 2018-03-01T10:29:09 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-intel-opt-openmp KokkosCore_UnitTest_OpenMP_MPI_1 Failed 4.24 Completed (Failed) 2018-03-01T10:41:40 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-gnu-debug-openmp KokkosCore_UnitTest_OpenMP_MPI_1 Failed 10.42 Completed (Failed) 2018-03-01T08:49:48 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-gnu-opt-openmp KokkosCore_UnitTest_OpenMP_MPI_1 Failed 9.96 Completed (Failed) 2018-03-01T11:49:31 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-intel-debug-openmp KokkosCore_UnitTest_Serial_MPI_1 Failed 3.72 Completed (Failed) 2018-03-01T10:29:09 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-intel-opt-openmp KokkosCore_UnitTest_Serial_MPI_1 Failed 3.74 Completed (Failed) 2018-03-01T10:41:40 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-gnu-debug-openmp KokkosCore_UnitTest_Serial_MPI_1 Failed 9.71 Completed (Failed) 2018-03-01T08:49:48 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-intel-debug-serial KokkosCore_UnitTest_Serial_MPI_1 Failed 3.82 Completed (Failed) 2018-03-01T07:36:18 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-gnu-opt-serial KokkosCore_UnitTest_Serial_MPI_1 Failed 9.77 Completed (Failed) 2018-03-01T07:15:18 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-intel-opt-serial KokkosCore_UnitTest_Serial_MPI_1 Failed 3.19 Completed (Failed) 2018-03-01T08:50:33 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-gnu-debug-serial KokkosCore_UnitTest_Serial_MPI_1 Failed 9.9 Completed (Failed) 2018-03-01T07:10:58 UTC
hansen/shiller Trilinos-atdm-hansen-shiller-gnu-opt-openmp KokkosCore_UnitTest_Serial_MPI_1 Failed 9.16 Completed (Failed) 2018-03-01T11:49:31 UTC

The ony builds that the

All of the test failures show:

okkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
[==========] Running 84 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 84 tests from openmp
[ RUN      ] openmp.atomic_operations
[       OK ] openmp.atomic_operations (3 ms)
[ RUN      ] openmp.atomic_views_integral
[       OK ] openmp.atomic_views_integral (295 ms)
[ RUN      ] openmp.atomic_views_nonintegral
[       OK ] openmp.atomic_views_nonintegral (176 ms)
[ RUN      ] openmp.atomic_view_api
[       OK ] openmp.atomic_view_api (0 ms)
[ RUN      ] openmp.atomics
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 58263 on node hansen02 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------

This is for the new configuration of ATDM Trilinos that uses the KOKKOS_ARCH=BDW. Not sure why this is an issue but I think taht is all that really changed from the previous ATDM configuration that manually set compiler options.

We seeing a lot of other test failures with various types of segmentation faults or other types of non-clean crashes at:

But it seems produent to address these failures first since they could be realted to many of the other failures that fail in bad ways like this.

Steps to Reproduce:

The instructions to reproduce these build failures can be found starting at:

and clicking "Reproducing ATDM builds locally" which takes you to:

Basically, on hansen or shiller, you just clone the Trilinos repo (with location depicted as $TRILINOS_DIR below), get on the develop branch. Then create a build directory and do the configure and build as:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-openmp

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Kokkos=ON \
  $TRILINOS_DIR

$ make NP=16

$ ctest -j16
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Kokkos client: ATDM Any issue primarily impacting the ATDM project labels Mar 2, 2018
@bartlettroscoe
Copy link
Member Author

@nmhamster,

Have you seen Kokkos-related tests fail with messages like:

mpiexec noticed that process rank 0 with PID 58263 on node hansen02 exited on signal 4 (Illegal instruction).

shown in this issue?

@nmhamster
Copy link
Contributor

@bartlettroscoe - can you run in gdb and send me the instruction it traps on at all? This usually means we have compiled some bad instruction into the code. It could possibly be use of transactional memory.

@bartlettroscoe
Copy link
Member Author

@nmhamster, what module do I need to load that will load a gdb executable that is compatible with the version of GCC loaded with the module devpack/openmpi/2.1.1/gcc/4.9.3/cuda/8.0.61?

@bartlettroscoe
Copy link
Member Author

I loaded the module:

moudle load gdb/7.11.0

and ran GDB (from inside emacs) on the KokkosCore_UnitTest_Serial.exe executable and it gave the below output:

Starting program: /home/rabartl/Trilinos.base/BUILDS/GCC-4.9.3/ATDM_GNU_DEBUG_OPENMP/packages/kokkos/core/unit_test/KokkosCore_UnitTest_Serial.exe 
warning: File "/home/projects/x86-64/gcc/4.9.3/lib64/libstdc++.so.6.0.20-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "/usr/share/gdb/auto-load:/usr/lib/debug:/usr/bin/mono-gdb.py".
To enable execution of this file add
	add-auto-load-safe-path /home/projects/x86-64/gcc/4.9.3/lib64/libstdc++.so.6.0.20-gdb.py
line to your configuration file "/home/rabartl/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/rabartl/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
[New Thread 0x7ffff6baa700 (LWP 13274)]
[==========] Running 84 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 84 tests from serial
[ RUN      ] serial.atomic_operations
[       OK ] serial.atomic_operations (2 ms)
[ RUN      ] serial.atomic_views_integral
[       OK ] serial.atomic_views_integral (440 ms)
[ RUN      ] serial.atomic_views_nonintegral
[       OK ] serial.atomic_views_nonintegral (218 ms)
[ RUN      ] serial.atomic_view_api
[       OK ] serial.atomic_view_api (1 ms)
[ RUN      ] serial.atomics

Program received signal SIGILL, Illegal instruction.
0x000000000126e41e in Kokkos::Impl::lock_address_host_space(void*) () at /home/projects/x86-64/gcc/4.9.3/lib/gcc/x86_64-unknown-linux-gnu/4.9.3/include/rtmintrin.h:52
Missing separate debuginfos, use: debuginfo-install libudev-147-2.51.el6.x86_64

Note that the updated EMPIRE configuration of Trilinos is using KOKKOS_ARCH=BDW that that is what I used to match it. Is that the correct value to use for hansen and shiller with these modules?

@nmhamster
Copy link
Contributor

@bartlettroscoe - KOKKOS_ARCH=HSW should be used. The system is not Broadwell.

@bartlettroscoe
Copy link
Member Author

KOKKOS_ARCH=HSW should be used. The system is not Broadwell.

Yup, that fixed the KokkosCore tests at least. I will run the full set of tests and see how that goes.

If all goes well, a bunch of these failures will go away tomorrow!

bartlettroscoe added a commit that referenced this issue Mar 2, 2018
After talking wit Si H., he said that hansen and shiller are Haswell not
Boradwell.
@bartlettroscoe
Copy link
Member Author

I just pushed the commit fd96bc6 to change to KOKKOS_ARCH=HSW. On shiller I locally built and ran all of the tests for the intel-debug-serial configuration and the final results were:

99% tests passed, 3 tests failed out of 1693

Label Time Summary:
Amesos2          =  11.53 sec (8 tests)
Anasazi          = 114.93 sec (71 tests)
Belos            = 116.27 sec (72 tests)
Ifpack2          =  61.92 sec (35 tests)
Intrepid2        = 295.25 sec (144 tests)
Kokkos           = 168.15 sec (23 tests)
KokkosKernels    = 100.18 sec (4 tests)
MueLu            = 310.84 sec (88 tests)
NOX              = 203.86 sec (105 tests)
Panzer           = 485.08 sec (156 tests)
Phalanx          =  77.91 sec (27 tests)
Piro             =  24.86 sec (12 tests)
Rythmos          = 205.43 sec (83 tests)
SEACAS           =  21.00 sec (12 tests)
Sacado           = 632.70 sec (292 tests)
Stratimikos      =  57.17 sec (40 tests)
Teko             =  61.70 sec (19 tests)
Tempus           = 288.14 sec (35 tests)
Teuchos          = 340.25 sec (131 tests)
Thyra            = 135.38 sec (81 tests)
Tpetra           = 220.84 sec (144 tests)
Xpetra           =  29.07 sec (18 tests)
Zoltan2          = 145.86 sec (96 tests)

Total Test time (real) = 318.41 sec

The following tests FAILED:
        798 - Belos_resolve_gmres_hb_1_MPI_4 (Failed)
        1272 - NOX_Thyra_Heq_MPI_1 (Failed)
        1528 - Piro_MatrixFreeDecorator_UnitTests_MPI_4 (Failed)

This cleaned up three of the failing tests shown at:

I am going to now put this Issue "In review" and we will watch what happens tomorrow just to make sure this test is fixed. Then I will close this.

Thanks @nmhamster!

@bartlettroscoe bartlettroscoe added stage: in review Primary work is completed and now is just waiting for human review and/or test feedback and removed type: bug The primary issue is a bug in Trilinos code or tests labels Mar 2, 2018
@bartlettroscoe
Copy link
Member Author

Closing as complete for real this time (this is not JIRA so you have to add a comment to close).

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 3, 2018
@bartlettroscoe bartlettroscoe added the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Kokkos
Projects
None yet
Development

No branches or pull requests

2 participants