Test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 randomly failing in CI and PR GCC 4.8.4 + OpenMP builds #3263

bartlettroscoe · 2018-08-09T03:21:34Z

@trilinos/shylu, @trilinos/framework, @srajama1 (Trilinos Linear Solvers Product Lead)

Expectations

A test should not fail unless a changes is made to break it. A test should not randomly fail.

Current Behavior

Looking at the four most recent failures of the test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 in this query the test appears to be randomly failing in the GCC 4.8.4 OpenMPI builds. In the most recent case, this test broke the auto PR GCC 4.8.4 + OpenMP build in PR #3260. IN each of the last for failures of this test dating back to 6/28/2018, they all fail showing:

....
[ RUN      ] CrsMatrixBase.matrixmarket
unknown file: Failure
C++ exception with description "View bounds error of view ap ( 13 < 13 )
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] CrsMatrixBase.matrixmarket (23 ms)
...
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CrsMatrixBase.matrixmarket

 1 FAILED TEST

Motivation and Context

Definition of Done

The test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 is fixed to make it so that it does not randomly fail or is removed for CI and auto PR testing.

Possible Solution

Fix it so that it does not randomly fail or remove it from CI and auto PR testing.

Steps to Reproduce

See https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors.

Your Environment

Standard SEMS GCC 4.8.4 auto PR build env (see above).

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2018-08-09T03:24:41Z

NOTE: Getting #3133 completed (which includes merging PR #3258) would make less PR builds vulnerable to this randomly failing test. If #3133 would have been completed before I posted my PR #3260.

kyungjoo-kim · 2018-08-09T17:09:32Z

@bartlettroscoe Oh.. I did mistake checking a wrong branch. Checking develop branch again, the last commits are made in May. Still I am wondering what triggers this failures and those failures are seen from June. Any clue ?

      1 commit 162645a2aebd7c1b851386e64adf23cd14f17a5e
      2 Author: Mauro Perego <[email protected]>
      3 Date:   Thu May 17 07:25:50 2018 -0600
      4 
      5     ShyLU: fix issued with Kokkos finalization in test ShyLU_NodeTacho_Tacho_TestUtil_MPI_1
      6

bartlettroscoe · 2018-08-09T18:18:55Z

@kyungjoo-kim, I am not sure of the trigger. But since this looks like it a randomly failing test that does not fail all that frequently, the change that caused this behavior may have been pushed weeks or more before the first failure.

I am just very unlucky when it comes to random Trilinos PR failures :-(

kyungjoo-kim · 2018-08-09T18:25:15Z

@bartlettroscoe I followed the reproducing procedure on my workstation. First, somehow the sem sems environment does not work on my workstation and it does not take --bind-to argument. Am I doing wrong ? According to the mpiexec location, it seems that I did right.

test 1
    Start 1: ShyLU_NodeTacho_Tacho_TestUtil_MPI_1

1: Test command: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.6.5/bin/mpiexec "--bind-to" "none" "-np" "1" "/ascldap/users/kyukim/Work/lib/trilinos/build/tacho/test/packages/shylu/shylu_node/tacho/unit-test/Tacho_TestUtil.exe" "PrintItAll"
1: Test timeout computed to be: 9.99988e+06
1: --------------------------------------------------------------------------
1: mpiexec was unable to launch the specified application as it could not find an executable:
1: 
1: Executable: --bind-to
1: Node: bread.sandia.gov
1: 
1: while attempting to start process rank 0.
1: --------------------------------------------------------------------------
1/5 Test #1: ShyLU_NodeTacho_Tacho_TestUtil_MPI_1 ..............***Failed    0.11 sec

So, I just remove "bind-to" argument and run the code again.

[kyukim @bread] unit-test > /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.6.5/bin/mpiexec -np 1 ./Tacho_TestSerial_double.exe 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
     DeviceSpace::         HostSpace::  Kokkos::OpenMP thread_pool_topology[ 1 x 2 x 1 ]
[==========] Running 58 tests from 7 test cases.
[----------] Global test environment set-up.
[----------] 2 tests from CrsMatrixBase
[ RUN      ] CrsMatrixBase.constructor
[       OK ] CrsMatrixBase.constructor (0 ms)
[ RUN      ] CrsMatrixBase.matrixmarket
[       OK ] CrsMatrixBase.matrixmarket (1 ms)
[----------] 2 tests from CrsMatrixBase (1 ms total)

Luckily (or Unluckily) I have the test passed. Ummmmmmmm..... This is very weird. How come tests are failed randomly ? Do we just disable the problematic test not to bother others for PR tests ? If I cannot reproduce the problem, it is not possible for me to fix it.

bartlettroscoe · 2018-08-09T19:01:10Z

First, somehow the sem sems environment does not work on my workstation and it does not take --bind-to argument.

@kyungjoo-kim, you will need to ask @trilinos/framework about that. They control those scripts and that process.

How come tests are failed randomly ?

Likely there is a race condition or something. Any idea about the error:

C++ exception with description "View bounds error of view ap ( 13 < 13 )

?

Since when is 13 < 13?

Do we just disable the problematic test not to bother others for PR tests ? If I cannot reproduce the problem, it is not possible for me to fix it.

I guess that is up to @srajama1 as the Linear Solver Product Lead. This test fails infrequently enough that it might not justify disabling it just yet. I think it is worth looking in what could case that strange error message.

kyungjoo-kim · 2018-08-09T19:03:00Z

@bartlettroscoe This is not a race condition. This happens on Serial execution space and everything should be executed deterministically.

bartlettroscoe · 2018-08-09T19:12:32Z

This is not a race condition. This happens on Serial execution space and everything should be executed deterministically.

That is interesting because we are only seeing this test randomly failing in an OpenMP build.

kyungjoo-kim · 2018-08-09T19:20:52Z

Ah.....this might be something more kokkos related issue then. According to your observation,

this random failure is not detected in Serial only build
when OpenMP is built together, this failure is even detected on Serial execution space.
Are we right ?

Do we have a kokkos snapshot between May (last commit in Tacho, probably these commits are also for kokkos integration test) and late June (error is detected) ? If so, our culprit might be in the kokkos.

Anyway, this is a random test failure; still, I don't understand why it happens.

srajama1 · 2018-08-09T19:33:21Z

@kyungjoo-kim : You can find the Kokkos snapshots in git log looking up version number 2.x ... They all have the same template for the snapshot message.

bartlettroscoe · 2018-08-09T19:35:10Z

$ git log --grep=Snapshot -- packages/kokkos

kyungjoo-kim · 2018-08-09T19:59:42Z

We have a new kokkos on 24 May. It looks like that we might have a very very tricky bug in Kokkos.

      1 commit 626e32a79fd143d32afba9f5b50f02643fd82d3a
      2 Author: Nathan Ellingwood <[email protected]>
      3 Date:   Thu May 24 23:30:13 2018 -0600
      4 
      5     Snapshot of kokkos.git from commit d3a941925cbfb71785d8ea68259123ed52d3f9da

I still cannot say if this is a kokkos problem or a tacho problem. As I cannot reproduce the error and cannot see the entire call stack, I cannot debug this. Would anyone who can reproduce (or encounter) this error report the call stack in GDB ?

github-actions · 2021-08-14T12:23:24Z

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions · 2021-09-15T12:25:47Z

This issue was closed due to inactivity for 395 days.

bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests Framework tasks Framework tasks (used internally by Framework team) pkg: ShyLU labels Aug 9, 2018

bartlettroscoe mentioned this issue Aug 9, 2018

Get Trilinos auto PR testing code to use TriBITS-based package enable logic #3133

Closed

This was referenced Jan 27, 2020

Tacho - disable to use a recursive function #6649

Merged

Patching in team-level bitonic sort #6646

Merged

github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Aug 14, 2021

github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Sep 15, 2021

github-actions bot closed this as completed Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 randomly failing in CI and PR GCC 4.8.4 + OpenMP builds #3263

Test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 randomly failing in CI and PR GCC 4.8.4 + OpenMP builds #3263

bartlettroscoe commented Aug 9, 2018

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018 •

edited

Loading

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018 •

edited

Loading

srajama1 commented Aug 9, 2018

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018

github-actions bot commented Aug 14, 2021

github-actions bot commented Sep 15, 2021

Test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 randomly failing in CI and PR GCC 4.8.4 + OpenMP builds #3263

Test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 randomly failing in CI and PR GCC 4.8.4 + OpenMP builds #3263

Comments

bartlettroscoe commented Aug 9, 2018

Expectations

Current Behavior

Motivation and Context

Definition of Done

Possible Solution

Steps to Reproduce

Your Environment

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018 • edited Loading

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018 • edited Loading

srajama1 commented Aug 9, 2018

bartlettroscoe commented Aug 9, 2018

kyungjoo-kim commented Aug 9, 2018

github-actions bot commented Aug 14, 2021

github-actions bot commented Sep 15, 2021

kyungjoo-kim commented Aug 9, 2018 •

edited

Loading

kyungjoo-kim commented Aug 9, 2018 •

edited

Loading