Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 randomly failing in CI and PR GCC 4.8.4 + OpenMP builds #3263

Closed
bartlettroscoe opened this issue Aug 9, 2018 · 13 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: ShyLU type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

@trilinos/shylu, @trilinos/framework, @srajama1 (Trilinos Linear Solvers Product Lead)

Expectations

A test should not fail unless a changes is made to break it. A test should not randomly fail.

Current Behavior

Looking at the four most recent failures of the test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 in this query the test appears to be randomly failing in the GCC 4.8.4 OpenMPI builds. In the most recent case, this test broke the auto PR GCC 4.8.4 + OpenMP build in PR #3260. IN each of the last for failures of this test dating back to 6/28/2018, they all fail showing:

....
[ RUN      ] CrsMatrixBase.matrixmarket
unknown file: Failure
C++ exception with description "View bounds error of view ap ( 13 < 13 )
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] CrsMatrixBase.matrixmarket (23 ms)
...
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CrsMatrixBase.matrixmarket

 1 FAILED TEST

Motivation and Context

Definition of Done

The test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 is fixed to make it so that it does not randomly fail or is removed for CI and auto PR testing.

Possible Solution

Fix it so that it does not randomly fail or remove it from CI and auto PR testing.

Steps to Reproduce

See https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors.

Your Environment

Standard SEMS GCC 4.8.4 auto PR build env (see above).

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests Framework tasks Framework tasks (used internally by Framework team) pkg: ShyLU labels Aug 9, 2018
@bartlettroscoe
Copy link
Member Author

NOTE: Getting #3133 completed (which includes merging PR #3258) would make less PR builds vulnerable to this randomly failing test. If #3133 would have been completed before I posted my PR #3260.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Aug 9, 2018

@bartlettroscoe Oh.. I did mistake checking a wrong branch. Checking develop branch again, the last commits are made in May. Still I am wondering what triggers this failures and those failures are seen from June. Any clue ?

      1 commit 162645a2aebd7c1b851386e64adf23cd14f17a5e
      2 Author: Mauro Perego <[email protected]>
      3 Date:   Thu May 17 07:25:50 2018 -0600
      4 
      5     ShyLU: fix issued with Kokkos finalization in test ShyLU_NodeTacho_Tacho_TestUtil_MPI_1
      6 

@bartlettroscoe
Copy link
Member Author

@kyungjoo-kim, I am not sure of the trigger. But since this looks like it a randomly failing test that does not fail all that frequently, the change that caused this behavior may have been pushed weeks or more before the first failure.

I am just very unlucky when it comes to random Trilinos PR failures :-(

@kyungjoo-kim
Copy link
Contributor

@bartlettroscoe I followed the reproducing procedure on my workstation. First, somehow the sem sems environment does not work on my workstation and it does not take --bind-to argument. Am I doing wrong ? According to the mpiexec location, it seems that I did right.

test 1
    Start 1: ShyLU_NodeTacho_Tacho_TestUtil_MPI_1

1: Test command: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.6.5/bin/mpiexec "--bind-to" "none" "-np" "1" "/ascldap/users/kyukim/Work/lib/trilinos/build/tacho/test/packages/shylu/shylu_node/tacho/unit-test/Tacho_TestUtil.exe" "PrintItAll"
1: Test timeout computed to be: 9.99988e+06
1: --------------------------------------------------------------------------
1: mpiexec was unable to launch the specified application as it could not find an executable:
1: 
1: Executable: --bind-to
1: Node: bread.sandia.gov
1: 
1: while attempting to start process rank 0.
1: --------------------------------------------------------------------------
1/5 Test #1: ShyLU_NodeTacho_Tacho_TestUtil_MPI_1 ..............***Failed    0.11 sec

So, I just remove "bind-to" argument and run the code again.

[kyukim @bread] unit-test > /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.6.5/bin/mpiexec -np 1 ./Tacho_TestSerial_double.exe 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
     DeviceSpace::         HostSpace::  Kokkos::OpenMP thread_pool_topology[ 1 x 2 x 1 ]
[==========] Running 58 tests from 7 test cases.
[----------] Global test environment set-up.
[----------] 2 tests from CrsMatrixBase
[ RUN      ] CrsMatrixBase.constructor
[       OK ] CrsMatrixBase.constructor (0 ms)
[ RUN      ] CrsMatrixBase.matrixmarket
[       OK ] CrsMatrixBase.matrixmarket (1 ms)
[----------] 2 tests from CrsMatrixBase (1 ms total)

Luckily (or Unluckily) I have the test passed. Ummmmmmmm..... This is very weird. How come tests are failed randomly ? Do we just disable the problematic test not to bother others for PR tests ? If I cannot reproduce the problem, it is not possible for me to fix it.

@bartlettroscoe
Copy link
Member Author

First, somehow the sem sems environment does not work on my workstation and it does not take --bind-to argument.

@kyungjoo-kim, you will need to ask @trilinos/framework about that. They control those scripts and that process.

How come tests are failed randomly ?

Likely there is a race condition or something. Any idea about the error:

C++ exception with description "View bounds error of view ap ( 13 < 13 )

?

Since when is 13 < 13?

Do we just disable the problematic test not to bother others for PR tests ? If I cannot reproduce the problem, it is not possible for me to fix it.

I guess that is up to @srajama1 as the Linear Solver Product Lead. This test fails infrequently enough that it might not justify disabling it just yet. I think it is worth looking in what could case that strange error message.

@kyungjoo-kim
Copy link
Contributor

@bartlettroscoe This is not a race condition. This happens on Serial execution space and everything should be executed deterministically.

@bartlettroscoe
Copy link
Member Author

This is not a race condition. This happens on Serial execution space and everything should be executed deterministically.

That is interesting because we are only seeing this test randomly failing in an OpenMP build.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Aug 9, 2018

Ah.....this might be something more kokkos related issue then. According to your observation,

  • this random failure is not detected in Serial only build
  • when OpenMP is built together, this failure is even detected on Serial execution space.
    Are we right ?

Do we have a kokkos snapshot between May (last commit in Tacho, probably these commits are also for kokkos integration test) and late June (error is detected) ? If so, our culprit might be in the kokkos.

Anyway, this is a random test failure; still, I don't understand why it happens.

@srajama1
Copy link
Contributor

srajama1 commented Aug 9, 2018

@kyungjoo-kim : You can find the Kokkos snapshots in git log looking up version number 2.x ... They all have the same template for the snapshot message.

@bartlettroscoe
Copy link
Member Author

$ git log --grep=Snapshot -- packages/kokkos

@kyungjoo-kim
Copy link
Contributor

We have a new kokkos on 24 May. It looks like that we might have a very very tricky bug in Kokkos.

      1 commit 626e32a79fd143d32afba9f5b50f02643fd82d3a
      2 Author: Nathan Ellingwood <[email protected]>
      3 Date:   Thu May 24 23:30:13 2018 -0600
      4 
      5     Snapshot of kokkos.git from commit d3a941925cbfb71785d8ea68259123ed52d3f9da

I still cannot say if this is a kokkos problem or a tacho problem. As I cannot reproduce the error and cannot see the entire call stack, I cannot debug this. Would anyone who can reproduce (or encounter) this error report the call stack in GDB ?

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Aug 14, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: ShyLU type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants