-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1 randomly failing in CI and PR GCC 4.8.4 + OpenMP builds #3263
Comments
@bartlettroscoe Oh.. I did mistake checking a wrong branch. Checking develop branch again, the last commits are made in May. Still I am wondering what triggers this failures and those failures are seen from June. Any clue ?
|
@kyungjoo-kim, I am not sure of the trigger. But since this looks like it a randomly failing test that does not fail all that frequently, the change that caused this behavior may have been pushed weeks or more before the first failure. I am just very unlucky when it comes to random Trilinos PR failures :-( |
@bartlettroscoe I followed the reproducing procedure on my workstation. First, somehow the sem sems environment does not work on my workstation and it does not take --bind-to argument. Am I doing wrong ? According to the mpiexec location, it seems that I did right.
So, I just remove "bind-to" argument and run the code again.
Luckily (or Unluckily) I have the test passed. Ummmmmmmm..... This is very weird. How come tests are failed randomly ? Do we just disable the problematic test not to bother others for PR tests ? If I cannot reproduce the problem, it is not possible for me to fix it. |
@kyungjoo-kim, you will need to ask @trilinos/framework about that. They control those scripts and that process.
Likely there is a race condition or something. Any idea about the error:
? Since when is
I guess that is up to @srajama1 as the Linear Solver Product Lead. This test fails infrequently enough that it might not justify disabling it just yet. I think it is worth looking in what could case that strange error message. |
@bartlettroscoe This is not a race condition. This happens on Serial execution space and everything should be executed deterministically. |
That is interesting because we are only seeing this test randomly failing in an OpenMP build. |
Ah.....this might be something more kokkos related issue then. According to your observation,
Do we have a kokkos snapshot between May (last commit in Tacho, probably these commits are also for kokkos integration test) and late June (error is detected) ? If so, our culprit might be in the kokkos. Anyway, this is a random test failure; still, I don't understand why it happens. |
@kyungjoo-kim : You can find the Kokkos snapshots in git log looking up version number 2.x ... They all have the same template for the snapshot message. |
|
We have a new kokkos on 24 May. It looks like that we might have a very very tricky bug in Kokkos.
I still cannot say if this is a kokkos problem or a tacho problem. As I cannot reproduce the error and cannot see the entire call stack, I cannot debug this. Would anyone who can reproduce (or encounter) this error report the call stack in GDB ? |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
@trilinos/shylu, @trilinos/framework, @srajama1 (Trilinos Linear Solvers Product Lead)
Expectations
A test should not fail unless a changes is made to break it. A test should not randomly fail.
Current Behavior
Looking at the four most recent failures of the test
ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1
in this query the test appears to be randomly failing in the GCC 4.8.4 OpenMPI builds. In the most recent case, this test broke the auto PR GCC 4.8.4 + OpenMP build in PR #3260. IN each of the last for failures of this test dating back to 6/28/2018, they all fail showing:Motivation and Context
Definition of Done
The test
ShyLU_NodeTacho_Tacho_TestSerial_double_MPI_1
is fixed to make it so that it does not randomly fail or is removed for CI and auto PR testing.Possible Solution
Fix it so that it does not randomly fail or remove it from CI and auto PR testing.
Steps to Reproduce
See https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors.
Your Environment
Standard SEMS GCC 4.8.4 auto PR build env (see above).
The text was updated successfully, but these errors were encountered: