Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Multiple test failures with intel/2021.4, intel/2023.2 (icpc) #11968

Open
ndellingwood opened this issue Jun 12, 2023 · 23 comments
Open
Labels
MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@ndellingwood
Copy link
Contributor

Bug Report

Testing builds on Blake (SKX arch) with intel/2021.4 (icpc, intel classic compiler) report multiple test failures
@trilinos/tpetra

  189 - TpetraCore_MatrixMarket_Tpetra_Map_InOutTest_MPI_4 (Failed)
  190 - TpetraCore_Bug5800_MPI_1 (Failed)
  191 - TpetraCore_Bug6288_MPI_4 (Failed)
  218 - TpetraCore_MatrixMatrix_UnitTests_MPI_4 (Failed)

Steps to Reproduce

  1. SHA1: 7e0c759
  2. Reproducer (Blake testbed):
module load intel/oneAPI/hpc-toolkit/2021.4.0 intel/oneAPI/base-toolkit/2021.4.0 openmpi/4.0.5/intel-oneapi/2021.4.0 cmake/3.25.2 git
module swap gcc/7.2.0 gcc/10.2.0
module load openblas/0.3.21/gcc/10.2.0
module load boost/1.75.0/intel-oneapi/2021.2.0
module load hdf5/1.10.7/openmpi/4.0.5/intel-oneapi/2021.2.0 netcdf-c/4.7.4/openmpi/4.0.5/intel-oneapi/2021.2.0 zlib/1.2.11
export OMPI_CXX="icpc"
export OMPI_CC="icc"
export OMPI_FC="ifort"
export OMPI_F77="ifort"
export OMPI_F90="ifort"

cmake \
 -D CMAKE_CXX_COMPILER="`which mpicxx`" \
 -D CMAKE_C_COMPILER="`which mpicc`" \
 -D CMAKE_CXX_STANDARD="17" \
 -D CMAKE_CXX_FLAGS="-g -no-ip" \
 -D CMAKE_Fortran_COMPILER="mpif77" \
 -D CMAKE_INSTALL_PREFIX="${TRILINOS_INSTALL_DIR}" \
 -D CMAKE_BUILD_TYPE=RELEASE \
\
 -D TPL_ENABLE_MPI=ON \
  -D MPI_EXEC_POST_NUMPROCS_FLAGS:STRING="-bind-to;socket;-map-by;socket" \
\
 -D TPL_ENABLE_BLAS:STRING=ON \
  -D BLAS_LIBRARY_DIRS:FILEPATH=${BLAS_ROOT}/lib \
  -D BLAS_LIBRARY_NAMES:STRING="openblas" \
 -D TPL_ENABLE_LAPACK:STRING=ON \
  -D LAPACK_INCLUDE_DIRS:FILEPATH="${LAPACK_ROOT}/include" \
  -D LAPACK_LIBRARY_DIRS:FILEPATH=${LAPACK_ROOT}/lib \
  -D LAPACK_LIBRARY_NAMES:STRING="openblas" \
-D TPL_ENABLE_Boost=ON \
   -D Boost_INCLUDE_DIRS:PATH="${BOOST_ROOT}/include" \
   -D Boost_LIBRARY_DIRS:PATH="${BOOST_ROOT}/lib" \
-D TPL_ENABLE_BoostLib=ON \
   -D BoostLib_INCLUDE_DIRS:PATH="${BOOST_ROOT}/include" \
   -D BoostLib_LIBRARY_DIRS:PATH="${BOOST_ROOT}/lib" \
-D TPL_ENABLE_Netcdf=ON \
   -D Netcdf_INCLUDE_DIRS:PATH="${NETCDF_ROOT}/include" \
   -D Netcdf_LIBRARY_DIRS:PATH="${NETCDF_ROOT}/lib64" \
  -D TPL_Netcdf_LIBRARIES:PATH="${NETCDF_ROOT}/lib64/libnetcdf.a;${HDF5_ROOT}/lib/libhdf5_hl.a;${HDF5_ROOT}/lib/libhdf5.a;${ZLIB_ROOT}/lib/libz.a" \
  -D TPL_Netcdf_PARALLEL:BOOL=OFF \
-D TPL_ENABLE_HDF5=ON \
  -D HDF5_INCLUDE_DIRS:PATH="${HDF5_ROOT}/include" \
  -D TPL_HDF5_LIBRARIES:PATH="${HDF5_ROOT}/lib/libhdf5_hl.a;${HDF5_ROOT}/lib/libhdf5.a;${ZLIB_ROOT}/lib/libz.a" \
-D TPL_ENABLE_Zlib=ON \
  -D Zlib_INCLUDE_DIRS:PATH="${ZLIB_ROOT}/include" \
  -D TPL_Zlib_LIBRARIES:PATH="${ZLIB_ROOT}/lib/libz.a" \
-D TPL_ENABLE_DLlib=ON \
-D TPL_ENABLE_Matio=OFF \
-D TPL_ENABLE_X11=OFF \
\
 -D Trilinos_ENABLE_TESTS=OFF \
 -D Trilinos_ENABLE_EXAMPLES=OFF \
 -D Trilinos_ENABLE_COMPLEX=ON \
 -D Trilinos_ENABLE_OpenMP=ON \
\
  -D Trilinos_ENABLE_Amesos=ON \
   -D Amesos_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Kokkos=ON \
  -D Kokkos_ENABLE_SERIAL=ON \
  -D Kokkos_ENABLE_OPENMP=ON \
  -D Kokkos_ARCH_SKX=ON \
  -D Trilinos_ENABLE_Intrepid=ON \
   -D Intrepid_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_ROL=ON \
   -D ROL_ENABLE_TESTS=OFF \
 \
  -D Trilinos_ENABLE_Ifpack2=ON \
   -D Ifpack2_ENABLE_TESTS=ON \
  -D Trilinos_ENABLE_Amesos2=ON \
   -D Amesos2_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Kokkos=ON \
  -D Kokkos_ENABLE_SERIAL=ON \
  -D Kokkos_ARCH_SKX=ON \
   -D Kokkos_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_KokkosKernels=ON \
   -D KokkosKernels_ENABLE_TESTS=ON \
  -D Trilinos_ENABLE_Tpetra=ON \
   -D Tpetra_ENABLE_TESTS=ON \
  -D Trilinos_ENABLE_Sacado=ON \
   -D Sacado_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Stokhos=ON \
   -D Stokhos_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Zoltan2=ON \
   -D Zoltan2_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Intrepid2=OFF \
   -D Intrepid2_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Belos=ON \
   -D Belos_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Anasazi=ON \
   -D Anasazi_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_Teuchos=ON \
   -D Teuchos_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_MueLu=ON \
   -D MueLu_ENABLE_TESTS=ON \
  -D Trilinos_ENABLE_Panzer=ON \
   -D Panzer_ENABLE_TESTS=ON \
  -D Trilinos_ENABLE_Phalanx=ON \
   -D Phalanx_ENABLE_TESTS=OFF \
  -D Trilinos_ENABLE_STKMesh:BOOL=ON \
  -D Trilinos_ENABLE_STKSimd:BOOL=ON \
  -D Trilinos_ENABLE_STKTransfer:BOOL=ON \
  -D Trilinos_ENABLE_STKSearch:BOOL=ON \
  -D Trilinos_ENABLE_STKUtil:BOOL=ON \
  -D Trilinos_ENABLE_STKTopology:BOOL=ON \
  -D Trilinos_ENABLE_STKIO:BOOL=OFF \
\
  -D Trilinos_ENABLE_SEACAS=OFF \
$TRILINOS_DIR
@csiefer2
Copy link
Member

There's a ton of tests which hang with 2021.3 w/ OpenMP (which is what I have easy access to).

@csiefer2
Copy link
Member

@ndellingwood I was looking at Bug5800 and something is making Kokkos::parallel_scan() hang. Not sure how to proceed.

@ndellingwood
Copy link
Contributor Author

@csiefer2 can you distill it down to a simple reproducer to submit as an issue to Kokkos?

@ndellingwood
Copy link
Contributor Author

@csiefer2 here is the output I had for the Bug5800 test when I posted, it did not hang but the failure output may be dated:

190/489 Test #190: TpetraCore_Bug5800_MPI_1 ....................................................................***Failed  Required regular expression not found. Regex=[End Result: TEST PASSED
]  0.60 sec
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              blake13
  Local adapter:           hfi1_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   blake13
  Local device: hfi1_0
--------------------------------------------------------------------------
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name blake13.sandia.gov and rank 0!

***
*** Unit test suite ...
***


Sorting tests by group name then by the order they were added ... (time = 2.31e-06)

Running unit tests ...

Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false

0. Tpetra_MatrixMarket_int_longlong_MultiVector_Output_Perm_UnitTest ...
 Test with 1 process

 p=0: *** Caught standard std::exception of type 'std::runtime_error' :

  /ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/inout/MatrixMarket_Tpetra.hpp:4708:

  Throw number = 2

  Throw test that evaluated to true: globalReadDataSuccess == 0

  Failed to read the multivector's data: /ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/inout/MatrixMarket_Tpetra.hpp:4526:

  Throw number = 1

  Throw test that evaluated to true: count >= X_view.size()

  The Matrix Market input stream has more data in it than its metadata reported.  Current line number is 4.
 [FAILED]  (0.012 sec) Tpetra_MatrixMarket_int_longlong_MultiVector_Output_Perm_UnitTest
 Location: /ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/test/inout/Bug5800.cpp:278


The following tests FAILED:
    0. Tpetra_MatrixMarket_int_longlong_MultiVector_Output_Perm_UnitTest ...

@ndellingwood
Copy link
Contributor Author

@csiefer2 that was also on Blake before the DST and hardware + module overhaul

@csiefer2
Copy link
Member

@csiefer2 can you distill it down to a simple reproducer to submit as an issue to Kokkos?

I suspect that won't work. But I suppose I can try.

@ndellingwood
Copy link
Contributor Author

@csiefer2 I tested a serial build on (new) Blake using icpc (intel classic compiler) and intel/oneapi/2023.2.0 and reproduced the TpetraCore_Bug5800 failure and TpetraCore_MatrixMarket_Tpetra_Map_InOutTest (this was a build without MPI, all single proc), both tests failed with similar output (posted below). I'll add my configuration script, hopefully this is useful for a reproducer:

TpetraCore_MatrixMarket_Tpetra_Map_InOutTest

432/754 Test #432: TpetraCore_MatrixMarket_Tpetra_Map_InOutTest ....................................................***Failed  Required regular expression not found. Regex=[End Result: TEST PASSED
...
 p=0: *** Caught standard std::exception of type 'std::runtime_error' :

  /home/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/inout/MatrixMarket_Tpetra.hpp:5568:

  Throw number = 3

  Throw test that evaluated to true: readSuccess != 1

  Tpetra::MatrixMarket::readMap: Reading the Map failed with the following exception message: /home/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/inout/MatrixMarket_Tpetra.hpp:4708:

  Throw number = 2

  Throw test that evaluated to true: globalReadDataSuccess == 0

  Failed to read the multivector's data: /home/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/inout/MatrixMarket_Tpetra.hpp:4526:

  Throw number = 1

  Throw test that evaluated to true: count >= X_view.size()

  The Matrix Market input stream has more data in it than its metadata reported.  Current line number is 11.

 [FAILED]  (0.00453 sec) MapOutputInput_int_longlong_ContigUniformIndexBase0_UnitTest
...

TpetraCore_Bug5800

433/754 Test #433: TpetraCore_Bug5800 ..............................................................................***Failed  Required regular expression not found. Regex=[End Result: TEST PASSED
...
0. Tpetra_MatrixMarket_int_longlong_MultiVector_Output_Perm_UnitTest ...
 Test with 1 process

 p=0: *** Caught standard std::exception of type 'std::runtime_error' :

  /home/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/inout/MatrixMarket_Tpetra.hpp:4708:

  Throw number = 2

  Throw test that evaluated to true: globalReadDataSuccess == 0

  Failed to read the multivector's data: /home/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/inout/MatrixMarket_Tpetra.hpp:4526:

  Throw number = 1

  Throw test that evaluated to true: count >= X_view.size()

  The Matrix Market input stream has more data in it than its metadata reported.  Current line number is 4.
 [FAILED]  (0.0115 sec) Tpetra_MatrixMarket_int_longlong_MultiVector_Output_Perm_UnitTest
 Location: /home/ndellin/trilinos/Trilinos-pristine/packages/tpetra/core/test/inout/Bug5800.cpp:278


The following tests FAILED:
    0. Tpetra_MatrixMarket_int_longlong_MultiVector_Output_Perm_UnitTest ...

Trilinos SHA ff92fc9
Blake (all queue) reproducer, Serial backend:

module load cmake intel-oneapi-compilers/2023.2.0 intel-oneapi-mkl/2023.2.0

export BLAS_LIBRARIES="-mkl;${MKLROOT}/lib/intel64/libmkl_intel_lp64.a;${MKLROOT}/lib/intel64/libmkl_intel_thread.a;${MKLROOT}/lib/intel64/libmkl_core.a"
export LAPACK_LIBRARIES=${BLAS_LIBRARIES}
cmake \
  -D CMAKE_CXX_COMPILER="`which icpc`" \
  -D CMAKE_C_COMPILER="`which icc`" \
  -D CMAKE_Fortran_COMPILER="`which ifort`" \
  -D CMAKE_CXX_FLAGS="-g -no-ip" \
  -D CMAKE_C_FLAGS="-g -no-ip" \
  -DTPL_ENABLE_MPI=OFF \
  -DTPL_ENABLE_BLAS:BOOL=ON \
  -DTPL_BLAS_LIBRARIES:PATH="${BLAS_LIBRARIES}" \
  -DTPL_LAPACK_LIBRARIES:PATH="${LAPACK_LIBRARIES}" \
  -DTPL_ENABLE_LAPACK:BOOL=ON \
  -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
  -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
  -DTrilinos_ENABLE_TESTS=ON \
  -DTrilinos_MUST_FIND_ALL_TPL_LIBS=TRUE \
  -DTrilinos_ENABLE_OpenMP=OFF \
  -DTrilinos_ENABLE_Kokkos=ON \
  -D Kokkos_ENABLE_SERIAL=ON \
   -D Kokkos_ENABLE_TESTS=ON \
  -D Kokkos_ARCH_SKX=ON \
  -DTrilinos_ENABLE_KokkosKernels=ON \
   -D KokkosKernels_ENABLE_TESTS=ON \
  -DTrilinos_ENABLE_Tpetra=ON \
   -D Tpetra_ENABLE_TESTS=ON \
\
  -DTPL_ENABLE_Matio=OFF \
\
$TRILINOS_DIR

@csiefer2
Copy link
Member

Well, there's always the hope that fixing the 2023.2 issue on blake will fix the 2021.3 hang :)

I'll see what I can do

@csiefer2
Copy link
Member

csiefer2 commented Sep 13, 2023

@ndellingwood This really looks like a compiler bug or a UMR:

The code:

     dims[1] = theNumCols; // Save the number of columns
     printf("CMS: theNumCols = %d dims[1] = %d\n",theNumCols,dims[1]);

The output:

CMS: theNumCols = 6 dims[1] = 0

I might add this dims[1] being wrong is what breaks the matrix reader.

@csiefer2
Copy link
Member

@ndellingwood Do you have a working valgrind on blake?

@ndellingwood
Copy link
Contributor Author

The output:

CMS: theNumCols = 6 dims[1] = 0

Yuck, that's odd... I'm curious what is the type of dims?

I peaked on Blake, the current versions of valgrind available are only for gcc compilers (the intel-oneapi installs are pretty new, I think adding additional compatible modules is still WIP)

@csiefer2
Copy link
Member

csiefer2 commented Sep 13, 2023

std::vector<GO> or Tuple<GO>. I tried both.

I'm not going to be able to answer compiler bug vs. UMR without a memory debugger and I can't get valgrind to build correctly myself.

@jhux2
Copy link
Member

jhux2 commented Sep 13, 2023

Does Intel have address sanitizer support?

@csiefer2
Copy link
Member

@jhux2 I don't think so.

@jhux2
Copy link
Member

jhux2 commented Sep 13, 2023

It looks like OneAPI's icx is based on LLVM and might support asan.

[edit]

https://www.intel.com/content/www/us/en/developer/articles/technical/getting-to-know-llvm-based-oneapi-compilers.html#gs.548awq

@ndellingwood
Copy link
Contributor Author

@jhux2 I hit these failures with the intel classic compiler (icpc) but have not tested with icpx. Let me try out a build, if they reproduce with icpx then the asan utils could be a good tool to explore

@jhux2
Copy link
Member

jhux2 commented Sep 13, 2023

@jhux2 I hit these failures with the intel classic compiler (icpc) but have not tested with icpx. Let me try out a build, if they reproduce with icpx then the asan utils could be a good tool to explore

@ndellingwood I should say that I'm not 100% sure whether icpx supports asan. Googling provided hints, but I didn't find any definitive documentation.

@csiefer2
Copy link
Member

The test valgrinds clean w/ gcc 10 on my desktop.

@ndellingwood
Copy link
Contributor Author

@csiefer2 that's good to know, I've only seen these test failures occur with intel icpc compilers so that might add a nudge toward some compiler wonkiness at play?

@csiefer2
Copy link
Member

Maybe? I'll try with the SEMS 2021.3 and see if that fails in the same way and if I can valgrind that.

@csiefer2
Copy link
Member

csiefer2 commented Sep 18, 2023

Those tests pass with 2021.3 on my desktop (Serial backend). So I'm more seriously thinking compiler bug. @ndellingwood

@ndellingwood
Copy link
Contributor Author

Just updating the issue, I'm seeing similar failures for Serial and OpenMP builds with the intel-oneapi-compilers/2023.2.0 and mkl modules on Blake when using icpc (intel classic compiler). I'm not sure why the file read/write is causing issues when using this compiler, but just noting here for reference

# Blake all queue - non-mpi build

# Environment
module load cmake intel-oneapi-compilers/2023.2.0 intel-oneapi-mkl/2023.2.0
module list

export TRILINOS_DIR=<path-to-source>

export BLAS_LIBRARIES="-mkl;${MKLROOT}/lib/intel64/libmkl_intel_lp64.a;${MKLROOT}/lib/intel64/libmkl_intel_thread.a;${MKLROOT}/lib/intel64/libmkl_core.a"
export LAPACK_LIBRARIES=${BLAS_LIBRARIES}

# Configure Trilinos
cmake \
  -D CMAKE_INSTALL_PREFIX="${PWD}/install" \
  -D CMAKE_CXX_COMPILER="`which icpc`" \
  -D CMAKE_C_COMPILER="`which icc`" \
  -D CMAKE_Fortran_COMPILER="`which ifort`" \
  -D CMAKE_CXX_FLAGS="-g -no-ip" \
  -D CMAKE_C_FLAGS="-g -no-ip" \
  -DTPL_ENABLE_MPI=OFF \
  -DTPL_ENABLE_BLAS:BOOL=ON \
  -DTPL_BLAS_LIBRARIES:PATH="${BLAS_LIBRARIES}" \
  -DTPL_LAPACK_LIBRARIES:PATH="${LAPACK_LIBRARIES}" \
  -DTPL_ENABLE_LAPACK:BOOL=ON \
  -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
  -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
  -DTrilinos_ENABLE_TESTS=OFF \
  -DTrilinos_MUST_FIND_ALL_TPL_LIBS=TRUE \
  -DTrilinos_ENABLE_COMPLEX=ON \
  -DTrilinos_ENABLE_OpenMP=OFF \
  -DTrilinos_ENABLE_Kokkos=ON \
   -D Kokkos_ENABLE_SERIAL=ON \
   -D Kokkos_ARCH_SKX=ON \
  -DTrilinos_ENABLE_KokkosKernels=ON \
  -DTrilinos_ENABLE_Tpetra=ON \
   -D Tpetra_ENABLE_TESTS=ON \
  -DTrilinos_ENABLE_Ifpack2=ON \
   -D Ifpack2_ENABLE_TESTS=ON \
\
  -DTPL_ENABLE_Matio=OFF \
\
  -DTrilinos_ENABLE_INSTALLATION_TESTING=OFF \
$TRILINOS_DIR

@ndellingwood ndellingwood changed the title Tpetra: Multiple test failures with intel/2021.4 (icpc) and OpenMP backend Tpetra: Multiple test failures with intel/2021.4, intel/2023.2 (icpc) Nov 17, 2023
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Feb 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Backlog
Development

No branches or pull requests

3 participants