Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 randomly failing in Trilinos-atdm-white-ride-gnu-debug-openmp build #2633

Closed
bartlettroscoe opened this issue Apr 24, 2018 · 5 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Anasazi type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Apr 24, 2018

Test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 randomly failing in some ATDM builds

CC: @trilinos/anasazi, @fryeguy52

Next Action Status:

No errors observed in any promoted ATDM Trilinos builds since 4/26/2018.

Description

As shown in the query:

the test Anasazi_Epetra_LOBPCG_solvertest_MPI_4 looks to be randomly failing in the following builds:

  • Trilinos-atdm-hansen-shiller-cuda-opt
  • Trilinos-atdm-hansen-shiller-gnu-debug-serial
  • Trilinos-atdm-hansen-shiller-gnu-opt-serial
  • Trilinos-atdm-white-ride-gnu-debug-openmp

The most recent failure was on 2018-04-023. The only failures since 2018-03-17 where with the build Trilinos-atdm-white-ride-gnu-debug-openmp run on 'white' and 'ride'. The failures on 2018-03-17 and before that all look like:

Anasazi in Trilinos 12.13 (Dev)

Testing solver(default,default) with standard eigenproblem...

libgomp: Thread creation failed: Resource temporarily unavailable

libgomp: Thread creation failed: Resource temporarily unavailable

libgomp: Thread creation failed: Resource temporarily unavailable

libgomp: Thread creation failed: Resource temporarily unavailable
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40799,1],2]
  Exit code:    1
--------------------------------------------------------------------------

Since we have not seen any failures like that since 2018-03-17, I think those issues got solved by adjusting the way tests are run on that system (likely the commit 114ca53 and/or the commit d852fa3).

The more recent three failures for the build Trilinos-atdm-white-ride-gnu-debug-openmp which occurred on 2018-04-03, 2018-04-17, and 2018-04-23 show the same output:

Anasazi in Trilinos 12.13 (Dev)

Testing solver(default,default) with standard eigenproblem...
Testing solver(default,default) with generalized eigenproblem...
Testing solver(nev,false) with standard eigenproblem...
Testing solver(nev,true) with standard eigenproblem...
Testing solver(nev,false) with generalized eigenproblem...
Testing solver(nev,true) with generalized eigenproblem...
Testing solver(2*nev,false) with standard eigenproblem...
Testing solver(2*nev,true) with standard eigenproblem...
[ride13:114533] *** Process received signal ***
[ride13:114533] Signal: Segmentation fault (11)
[ride13:114533] Signal code: Address not mapped (1)
[ride13:114533] Failing at address: 0x10036020010
[ride13:114533] [ 0] [0x100000050478]
[ride13:114533] [ 1] [0x3ff0000000000000]
[ride13:114533] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 114533 on node ride13 exited on signal 11 (Segmentation fault).

Therefore, I think that the problem with this test is the segfaults on this build.

Steps to reproduce

One may be able to reproduce this failure on 'white' (SON) or 'ride' (SRN) as described in:

The exact comamnds should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-debug-openmp

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Anasazi=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16  ctest -R Anasazi_Epetra_LOBPCG_solvertest_MPI_4

NOTE: Since this is not a CUDA build, you should not need to run on a compute node. But to reproduce the failure, you may need to. Also since this test seems to be randomly failing, one may not be able to reproduce the failure.

@bartlettroscoe bartlettroscoe added pkg: Anasazi client: ATDM Any issue primarily impacting the ATDM project labels Apr 24, 2018
@bartlettroscoe
Copy link
Member Author

@hkthorn, any idea what might be causing this one test to segfault randomly for this build on white/ride?

We can watch this as wait to see if any more failures occur. If they do, then we can disable this test for this build and then let it be fixed offline.

@hkthorn
Copy link
Contributor

hkthorn commented Apr 25, 2018

@bartlettroscoe I agree that the issue that was causing the test failures for the ModalSolvers and OrthoManager is not affecting this test. There is something else going on. I need to find a second to check this out on white/ride. If you feel the need to disable it temporarily to make sure your testing for ATDM is clean, I understand. I will let you know when I have figured out the issue.

@bartlettroscoe
Copy link
Member Author

If you feel the need to disable it temporarily to make sure your testing for ATDM is clean, I understand. I will let you know when I have figured out the issue.

@hkthorn, this test only failed three times in last month and we are not yet at the stage where we are running automated processes requiring 100% passing tests. Therefore, I think it is okay to leave this test on on this build and wait to see if it can be fixed and if it fails more.

Let us know when you figure this out.

@bartlettroscoe
Copy link
Member Author

A shown in this query this test has not failed a single time in any of the promoted ATDM builds since 4/26/2018.

I don't know why this test failed but it has not failed in 1.5 months so I think it is okay to close this issue.

@bartlettroscoe
Copy link
Member Author

Closing the issue for real this time.

@bartlettroscoe bartlettroscoe added the PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area label Nov 30, 2018
@bartlettroscoe bartlettroscoe added the type: bug The primary issue is a bug in Trilinos code or tests label Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Anasazi type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants