-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Three ShyLU_DDFROSch_test_frosch_XXX tests failing in new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2691
Comments
@trilinos/shylu or @trilinos/xpetra developers, Any idea what is causing the exception:
which is causing these tests to fail? |
This is strange as the ordinals configured are int and long long. @trilinos/xpetra |
The test uses int, int The error seems to be a catch all. This seems like an Xpetra configuration issue to me. |
@srajama1 said:
@trilinos/xpetra developers, What does the error:
mean? Is that saying that if doing a Kokkos Serial build, then
Do we need to turn on the "Refactor" option (see #2674) to get this to work? |
@trilinos/muelu |
The error
means that you can build Epetra with only one of the following configuration combinations at the same time: GO=int with SerialNode (default) If you enable more valid configuration combinations you should get a default one (e.g. if you enable OpenMP and Serial node, i think you get an OpenMP enabled Epetra, etc...). This also means that you cannot enable Epetra unless at least one of above configuration combinations is enabled. You cannot build e.g. Epetra with GO=int with Cuda node only enabled (since Epetra is not working on Cuda). You would need to enable OpenMP or Serial node in addition to Cuda. Does that make sense? How could we improve above error message? |
@trilinos/xpetra |
@tawiesn, thanks for the explanation. Does that means that these ShyLU_DD tests should be disabled for OpenMP builds or how else would we fix these ShyLU_DD tests so they would run?
It would be good to report the actually types being used. You should know that at compile-time. |
@bartlettroscoe : I would rather not disable the test, but find the root cause and solve it. |
@srajama1, I agree. But who has the time to fix this? These three tests (and one Teko test that I will post an issue for next) are the only failing tests blocking us from getting an automated PR test running that actually enables and runs with OpenMP so we can't let this sit too long. See #2317 and #2462. |
If it is automated PR testing it is even more important that we don't disable these tests. We have gone without PR testing for so long, waiting few days for this shouldn't be a problem. Let me know if that is the case. Why do we enable Epetra OpenMP in the first place ? This is really an experimental feature that we tried at some point for different use cases. Can we disable OpenMP for Epetra ? @tawiesn : Is the fix that if Trilinos OpenMP is enabled we should use Xpetra with Epetra matrices and OpenMP from ShyLU ? Do I understand that right ? |
@srajama1, these tests would get run in the Intel 17.0.1 build that uses a Serial Kokkos node. Right now we have zero threaded testing in PR testing. |
@srajama1 I just looked at the source code of one of the ShyLU Frosch tests and found things like
That is, the test only works/compiles if the GO=int and NO=SerialNode are enabled in the Trilinos configuration. I guess the easiest fix would be to add appropriate guards in the CMakeLists.txt file for the test which enables the test only if both GO=int and NO=Serial are enabled in the Trilinos configuration. We did something similar for some very old MueLu tests. The goal should always be to write code with Xpetra independent from the concrete underlying Linear algebra package. However, that is not possible if you start using Epetra_Comm etc. in your tests... Anyway: without the guards the test will not compile for all Trilinos test configurations where either GO=int or NO=Serial is missing... |
@bartlettroscoe : Understood. Having these tests run even on threaded mode will be useful. |
@tawiesn said:
That is basically disabling these three tests at configure time when OpenMP is used. @srajama1 said:
I agree. Does a ShyLU developer have time to get these tests running for OpenMP as well? Just go with the default Kokkos node which should be known at configure time and compile time. |
@bartlettroscoe : We will take care of this. |
@bartlettroscoe @srajama1 I will start working on this issue on Friday and hopefully fix it until the beginning of next week. Tomorrow is a holiday in Germany, and therefore, I will not be in the office. |
@searhein Next week is totally fine. I was asking your input mainly to see which how much work is to add the OpenMP Epetra Node. I can work on this as well, if you are busy. |
FYI: With the failing Teko test test now resolved (see #2712), these failing ShyLU_DD tests are now the only failing tests for the new GCC 4.8.4 + OpenMP build as shown today at: |
@bartlettroscoe : thanks ! We will take care of this. |
@bartlettroscoe @ndellingwood I also think that disabling the tests is Ok for now. Since I do not have access to the SNL COE RHEL6 SEMS machines, it is hard for me to reproduce and fix the errors. I hope that @srajama1 and I can figure this out soon. |
These disables will allows this build to be promoted to the CI build and an auto PR build (see trilinos#2462).
@trilinos/shylu, I created PR #2841 that implements these targeted disabled. Can someone on the ShyLU team please approve that PR? Also note that five ShyLU_DD tests are not getting enabled because |
@bartlettroscoe I just approved the PR but @ndellingwood is on it too :D |
@bartlettroscoe This issue of tests that require > 4 MPI processes came up recently in Tpetra; see #2564. My approach was to build the executable once, but set up two separate tests, one that requires at most 4 MPI processes and another that requires more. I can see reasons to have tests that need more than 4 MPI processes, but it's important to have some tests that will always run by default, even on laptops etc. |
These disables will allows this build to be promoted to the CI build and an auto PR build (see #2462).
@bartlettroscoe I just merged your PR :) |
The build and in more detail at: See the little And these three tests as shown as Status "Missing" at: I will now mark this issue with the labels "Disabled Tests" and "Stalled". This can now be fixed offline. |
@trilinos/shylu To re-enable these failing tests (and then fix them), create a local branch called something like
Then follow the "Steps to Reproduce" instructions above and then one can verify if the tests are fixed. If anyone has a question about this, please let me know. |
Note that this commit include disabling tests documented in issue trilinos#2712 and trilinos#2691 and that those should be re-enabled when those issues are resolved.
Note that this commit include disabling tests documented in issue trilinos#2712 and trilinos#2691 and that those should be re-enabled when those issues are resolved.
@bartlettroscoe It would be great if we could re-enable the tests. Thanks for reminding me. Unfortunately, I currently don't have access to the test bed machine. I will discuss with @srajama1 about this. |
@searhein, I was mistaken. You don't need access to any test bed machine. These errors occurred in the basic SEMS RHEL6 GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build. Reproducability instructions are given above. You should be able to reproduce this on any Sandia COE RHEL6 machine that has the SEMS env either on the SRN or SON. |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
CC: @trilinos/shylu, @trilinos/framework , @srajama1
Description
As shown at:
the tests:
ShyLU_DDFROSch_test_frosch_interfacesets_2D_MPI_4
ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_gdsw_MPI_4
ShyLU_DDFROSch_test_frosch_laplacian_epetra_2d_rgdsw_MPI_4
are failing in the new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build (as on the SNL COE RHEL6 machine crf450 which is submitted to CDash).
This build is getting cleaned up to provide the GCC 4.8.4 auto PR build described in #2317 and #2462.
These tests all fail by throwing the exception shown below:
This then terminates the test program.
Steps to reproduce
One should be able to reproduce these failing tests on any SNL COE RHEL6 machine that has the SEMS env. For example, on the CEE machine 'ceerws1113', I reproduced this by updating Trilinos and then doing:
This produced the test results:
The output from these failing tests seem to show the same throws and terminate:
Related Issues
The text was updated successfully, but these errors were encountered: