Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Three ShyLU_DDFROSch_test_frosch_XXX tests failing in new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2691

Closed
bartlettroscoe opened this issue May 8, 2018 · 48 comments
Assignees
Labels
ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: ShyLU pkg: Xpetra Stalled Issue may have been worked some but is not completed and/or is otherwise stalled for some reason type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented May 8, 2018

CC: @trilinos/shylu, @trilinos/framework , @srajama1

Description

As shown at:

the tests:

  • ShyLU_DDFROSch_test_frosch_interfacesets_2D_MPI_4
  • ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_gdsw_MPI_4
  • ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_rgdsw_MPI_4

are failing in the new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build (as on the SNL COE RHEL6 machine crf450 which is submitted to CDash).

This build is getting cleaned up to provide the GCC 4.8.4 auto PR build described in #2317 and #2462.

These tests all fail by throwing the exception shown below:

terminate called after throwing an instance of 'Xpetra::Exceptions::RuntimeError'
Xpetra::Exceptions::RuntimeError'
  what():  /ascldap/users/rabartl/Trilinos.base/NightlyBuilds/SRC_AND_BUILD/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_EpetraCrsMatrix.hpp:222:

Throw number = 1

Throw test that evaluated to true: true

Xpetra::EpetraCrsMatrix only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)

This then terminates the test program.

Steps to reproduce

One should be able to reproduce these failing tests on any SNL COE RHEL6 machine that has the SEMS env. For example, on the CEE machine 'ceerws1113', I reproduced this by updating Trilinos and then doing:

$ cd <some-build-dir>/

$ source <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP_env.sh

$ module list
Currently Loaded Modulefiles:
  1) sems-env
  2) atdm-env
  3) sems-python/2.7.9
  4) atdm-cmake/3.11.1
  5) sems-git/2.10.1
  6) atdm-ninja_fortran/1.7.2
  7) sems-gcc/4.8.4
  8) sems-openmpi/1.10.1
  9) sems-boost/1.63.0/base
 10) sems-zlib/1.2.8/base
 11) sems-hdf5/1.8.12/parallel
 12) sems-netcdf/4.4.1/exo_parallel
 13) sems-parmetis/4.0.3/parallel
 14) sems-scotch/6.0.3/nopthread_64bit_parallel
 15) sems-superlu/4.3/base

$ which cmake
/projects/sems/install/rhel6-x86_64/atdm/binary-install/cmake-3.11.1-Linux-x86_64/bin/cmake

$ rm -r CMake*

$ time cmake \
  -C <trilinos-dir>/cmake/std/GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ShyLU_DD=ON \
  <trilinos-dir> \
  &> configure.out

real    0m22.379s
user    0m13.932s
sys     0m5.872s

$ time make -j16 &> make.out

real    34m48.506s
user    310m18.610s
sys     19m41.674s

$ time ctest -j16 &> ctest.out

real    0m4.584s
user    0m17.113s
sys     0m4.140s

This produced the test results:

$ grep -A 100 "tests failed out of" ctest.out 
40% tests passed, 3 tests failed out of 5

Label Time Summary:
ShyLU_DD    =  14.19 sec (5 tests)

Total Test time (real) =   4.56 sec

The following tests FAILED:
          1 - ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_gdsw_MPI_4 (Failed)
          2 - ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_rgdsw_MPI_4 (Failed)
          5 - ShyLU_DDFROSch_test_frosch_interfacesets_2D_MPI_4 (Failed)
Errors while running CTest

The output from these failing tests seem to show the same throws and terminate:

terminate called after throwing an instance of 'Xpetra::Exceptions::RuntimeError'
  what():  /scratch/rabartl/Trilinos.base/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_EpetraCrsMatrix.hpp:222:

Throw number = 1

Throw test that evaluated to true: true

Xpetra::EpetraCrsMatrix only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)

Related Issues

@bartlettroscoe bartlettroscoe added pkg: ShyLU client: ATDM Any issue primarily impacting the ATDM project labels May 8, 2018
@bartlettroscoe
Copy link
Member Author

@trilinos/shylu or @trilinos/xpetra developers,

Any idea what is causing the exception:

terminate called after throwing an instance of 'Xpetra::Exceptions::RuntimeError'
  what():  /scratch/rabartl/Trilinos.base/Trilinos/packages/xpetra/src/CrsMatrix/Xpetra_EpetraCrsMatrix.hpp:222:

Throw number = 1

Throw test that evaluated to true: true

Xpetra::EpetraCrsMatrix only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)

which is causing these tests to fail?

@srajama1
Copy link
Contributor

srajama1 commented May 8, 2018

This is strange as the ordinals configured are int and long long.

@trilinos/xpetra
@searhein

@srajama1
Copy link
Contributor

srajama1 commented May 8, 2018

The test uses int, int
See https://github.com/trilinos/Trilinos/blob/master/packages/shylu/shylu_dd/frosch/test/TestInterfaceSets/main.cpp

The error seems to be a catch all. This seems like an Xpetra configuration issue to me.

@bartlettroscoe
Copy link
Member Author

@srajama1 said:

The test uses int, int
See https://github.com/trilinos/Trilinos/blob/master/packages/shylu/shylu_dd/frosch/test/TestInterfaceSets/main.cpp

The error seems to be a catch all. This seems like an Xpetra configuration issue to me.

@trilinos/xpetra developers,

What does the error:

Xpetra::EpetraCrsMatrix only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)

mean? Is that saying that if doing a Kokkos Serial build, then GlobalOrdinal must be int and if doing a Kokkos OpenMP build, then you have to use GlobalOrdinal of type long long int? I don't really understand what is going on at:

Do we need to turn on the "Refactor" option (see #2674) to get this to work?

@mhoemmen
Copy link
Contributor

mhoemmen commented May 8, 2018

@trilinos/muelu

@tawiesn
Copy link
Contributor

tawiesn commented May 8, 2018

The error

Xpetra::EpetraCrsMatrix only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)

means that you can build Epetra with only one of the following configuration combinations at the same time:

GO=int with SerialNode (default)
GO=long long with Serial Node (this would use Epetra64)
GO=int with OpenMP node (standard Epetra with the internal OpenMP extensions enabled (EPETRA_HAVE_OPENMP = true)
GO=long long with OpenMP node (Epetra64 with EPETRA_HAVE_OPENMP set to true)

If you enable more valid configuration combinations you should get a default one (e.g. if you enable OpenMP and Serial node, i think you get an OpenMP enabled Epetra, etc...).

This also means that you cannot enable Epetra unless at least one of above configuration combinations is enabled. You cannot build e.g. Epetra with GO=int with Cuda node only enabled (since Epetra is not working on Cuda). You would need to enable OpenMP or Serial node in addition to Cuda.

Does that make sense? How could we improve above error message?

@tawiesn
Copy link
Contributor

tawiesn commented May 8, 2018

@trilinos/xpetra

@bartlettroscoe
Copy link
Member Author

This also means that you cannot enable Epetra unless at least one of above configuration combinations is enabled. You cannot build e.g. Epetra with GO=int with Cuda node only enabled (since Epetra is not working on Cuda). You would need to enable OpenMP or Serial node in addition to Cuda.

@tawiesn, thanks for the explanation. Does that means that these ShyLU_DD tests should be disabled for OpenMP builds or how else would we fix these ShyLU_DD tests so they would run?

How could we improve above error message?

It would be good to report the actually types being used. You should know that at compile-time.

@srajama1
Copy link
Contributor

srajama1 commented May 8, 2018

@bartlettroscoe : I would rather not disable the test, but find the root cause and solve it.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe : I would rather not disable the test, but find the root cause and solve it.

@srajama1, I agree. But who has the time to fix this? These three tests (and one Teko test that I will post an issue for next) are the only failing tests blocking us from getting an automated PR test running that actually enables and runs with OpenMP so we can't let this sit too long. See #2317 and #2462.

@srajama1
Copy link
Contributor

srajama1 commented May 8, 2018

If it is automated PR testing it is even more important that we don't disable these tests. We have gone without PR testing for so long, waiting few days for this shouldn't be a problem. Let me know if that is the case.

Why do we enable Epetra OpenMP in the first place ? This is really an experimental feature that we tried at some point for different use cases. Can we disable OpenMP for Epetra ?

@tawiesn : Is the fix that if Trilinos OpenMP is enabled we should use Xpetra with Epetra matrices and OpenMP from ShyLU ? Do I understand that right ?

@bartlettroscoe
Copy link
Member Author

If it is automated PR testing it is even more important that we don't disable these tests. We have gone without PR testing for so long, waiting few days for this shouldn't be a problem. Let me know if that is the case.

@srajama1, these tests would get run in the Intel 17.0.1 build that uses a Serial Kokkos node. Right now we have zero threaded testing in PR testing.

@tawiesn
Copy link
Contributor

tawiesn commented May 8, 2018

@srajama1 I just looked at the source code of one of the ShyLU Frosch tests and found things like

typedef unsigned UN;
typedef double SC;
typedef int LO;
typedef int GO;
typedef Kokkos::Compat::KokkosSerialWrapperNode EpetraNode; // Hier Default verwenden???
typedef EpetraNode NO;

That is, the test only works/compiles if the GO=int and NO=SerialNode are enabled in the Trilinos configuration. I guess the easiest fix would be to add appropriate guards in the CMakeLists.txt file for the test which enables the test only if both GO=int and NO=Serial are enabled in the Trilinos configuration. We did something similar for some very old MueLu tests. The goal should always be to write code with Xpetra independent from the concrete underlying Linear algebra package. However, that is not possible if you start using Epetra_Comm etc. in your tests...

Anyway: without the guards the test will not compile for all Trilinos test configurations where either GO=int or NO=Serial is missing...

@srajama1
Copy link
Contributor

srajama1 commented May 8, 2018

@tawiesn : Yes, this was my understanding as well. However, I was thinking of adding guards not in CMakeLists.txt but in the test to allow NO as OpenMP node. I think this should work, but let us wait for @searhein to comment which path he wants to take.

@srajama1
Copy link
Contributor

srajama1 commented May 8, 2018

@bartlettroscoe : Understood. Having these tests run even on threaded mode will be useful.

@bartlettroscoe
Copy link
Member Author

@tawiesn said:

I guess the easiest fix would be to add appropriate guards in the CMakeLists.txt file for the test which enables the test only if both GO=int and NO=Serial are enabled in the Trilinos configuration.

That is basically disabling these three tests at configure time when OpenMP is used.

@srajama1 said:

@bartlettroscoe : Understood. Having these tests run even on threaded mode will be useful.

I agree. Does a ShyLU developer have time to get these tests running for OpenMP as well? Just go with the default Kokkos node which should be known at configure time and compile time.

@srajama1
Copy link
Contributor

srajama1 commented May 8, 2018

@bartlettroscoe : We will take care of this.

@searhein
Copy link
Contributor

searhein commented May 9, 2018

@bartlettroscoe @srajama1 I will start working on this issue on Friday and hopefully fix it until the beginning of next week. Tomorrow is a holiday in Germany, and therefore, I will not be in the office.
Is this OK?

@srajama1
Copy link
Contributor

srajama1 commented May 9, 2018

@searhein Next week is totally fine. I was asking your input mainly to see which how much work is to add the OpenMP Epetra Node. I can work on this as well, if you are busy.

@bartlettroscoe
Copy link
Member Author

FYI: With the failing Teko test test now resolved (see #2712), these failing ShyLU_DD tests are now the only failing tests for the new GCC 4.8.4 + OpenMP build as shown today at:

@srajama1
Copy link
Contributor

@bartlettroscoe : thanks ! We will take care of this.

@searhein
Copy link
Contributor

@bartlettroscoe @ndellingwood I also think that disabling the tests is Ok for now. Since I do not have access to the SNL COE RHEL6 SEMS machines, it is hard for me to reproduce and fix the errors. I hope that @srajama1 and I can figure this out soon.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 30, 2018
These disables will allows this build to be promoted to the CI build and an
auto PR build (see trilinos#2462).
@bartlettroscoe
Copy link
Member Author

@trilinos/shylu,

I created PR #2841 that implements these targeted disabled. Can someone on the ShyLU team please approve that PR?

Also note that five ShyLU_DD tests are not getting enabled because NUM_MPI_PROCS='8' > MPI_EXEC_MAX_NUMPROCS='4'. If you want these tests to run as part of auto PR testing, the value of MPI_EXEC_MAX_NUMPROCS will be need to be raised to at least 8. This is something you would have to ask the @trilinos/framework team about. And we would need to test all of Trilinos with this raised value of MPI_EXEC_MAX_NUMPROCS=8. Personally, given modern machines, I think that is quite reasonable.

@mhoemmen
Copy link
Contributor

@bartlettroscoe I just approved the PR but @ndellingwood is on it too :D

@mhoemmen
Copy link
Contributor

@bartlettroscoe This issue of tests that require > 4 MPI processes came up recently in Tpetra; see #2564. My approach was to build the executable once, but set up two separate tests, one that requires at most 4 MPI processes and another that requires more. I can see reasons to have tests that need more than 4 MPI processes, but it's important to have some tests that will always run by default, even on laptops etc.

mhoemmen pushed a commit that referenced this issue May 30, 2018
These disables will allows this build to be promoted to the CI build and an
auto PR build (see #2462).
@mhoemmen
Copy link
Contributor

@bartlettroscoe I just merged your PR :)

@bartlettroscoe bartlettroscoe added Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue Stalled Issue may have been worked some but is not completed and/or is otherwise stalled for some reason labels May 30, 2018
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented May 30, 2018

The build GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP is now 100% clean after the disable of these three tests as shown at:

and in more detail at:

See the little -3 subscript under the 0 in the the "Fail" test column in the "ShyLU_DD" row.

And these three tests as shown as Status "Missing" at:

I will now mark this issue with the labels "Disabled Tests" and "Stalled". This can now be fixed offline.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 1, 2018

@trilinos/shylu

To re-enable these failing tests (and then fix them), create a local branch called something like 2691-shylu-dd-fix and then revert the commit 06bbebd as:

$ git revert 06bbebdd9d87308712a8e6d881ea1d7b800037c0

Then follow the "Steps to Reproduce" instructions above and then one can verify if the tests are fixed. If anyone has a question about this, please let me know.

prwolfe added a commit to prwolfe/Trilinos that referenced this issue Jun 21, 2018
Note that this commit include disabling tests documented
in issue trilinos#2712 and trilinos#2691 and that those should be re-enabled
when those issues are resolved.
prwolfe added a commit to prwolfe/Trilinos that referenced this issue Jun 29, 2018
Note that this commit include disabling tests documented
in issue trilinos#2712 and trilinos#2691 and that those should be re-enabled
when those issues are resolved.
@bartlettroscoe
Copy link
Member Author

Now that PR #2761 is merged, do we need to re-enable the tests disabled in 06bbebd? For instructions on how to do that and test on the test bed machine, see above.

@searhein
Copy link
Contributor

@bartlettroscoe It would be great if we could re-enable the tests. Thanks for reminding me. Unfortunately, I currently don't have access to the test bed machine. I will discuss with @srajama1 about this.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe It would be great if we could re-enable the tests. Thanks for reminding me. Unfortunately, I currently don't have access to the test bed machine. I will discuss with @srajama1 about this.

@searhein, I was mistaken. You don't need access to any test bed machine. These errors occurred in the basic SEMS RHEL6 GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build. Reproducability instructions are given above. You should be able to reproduce this on any Sandia COE RHEL6 machine that has the SEMS env either on the SRN or SON.

@bartlettroscoe bartlettroscoe added the type: bug The primary issue is a bug in Trilinos code or tests label Nov 13, 2018
@jhux2 jhux2 removed the pkg: MueLu label Nov 14, 2018
@bartlettroscoe bartlettroscoe added PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates and removed ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates labels Nov 30, 2018
@github-actions
Copy link

github-actions bot commented Jun 6, 2021

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 6, 2021
@github-actions
Copy link

github-actions bot commented Jul 7, 2021

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jul 7, 2021
@github-actions github-actions bot closed this as completed Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: ShyLU pkg: Xpetra Stalled Issue may have been worked some but is not completed and/or is otherwise stalled for some reason type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

7 participants