-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amesos2::SuperLU_DIST 4.3 segfault #124
Comments
Notes: This was seen with superlu_dist debug > 1 only. Which requires modification . Changed #ifdef(>) to #ifdef debug |
@amklinv I wanted to make sure you see the latest comment on this. Please see the full issue body. |
To clarify Josh's comment, when you try to build superlu_dist with debug > 1, it won't even build. Josh had to modify the superlu_dist code to get it to build at all. I contacted Sherry on Friday to let her know about this issue, but I have not heard back from her yet. I compiled using a slightly different version of gcc 4.9 and openmpi 1.8 with debugging turned off, and the test passed with that configuration. |
@jwillenbring For the benefit of Jim, I will also post what I told @amklinv . There are numerous valgrind reported errors for this build. I started to try to track them down to see if they were because of how we are freeing the superlu structure or in superlu_dist. As a first step, I turned on debug for both trilinos and superludist to see where these warning/errors where occurring. However, superlu_dist 4.3 will not compile in debug unless change some of the ifdefs yourself (This is not true for 4.2). If you do changes these and run it, you will get a random segfault (Again, not true of 4.2). |
Jim, If it is important for xSDK can you ask Sherry to check the usage in Amesos2 to confirm everything is ok. We will be happy to help. |
I got a response from Sherry. "A lot of warnings are related to printing format of (long long int). I thought I had those fixed, but apparently still a lot. It will take me some time to clean up the warnings. Meanwhile, you can ignore the warnings, and run the code, see what happens. [I told her when I compiled the code without debugging enabled, I got a whole bunch of warnings.] The complex version with high DEBUGlevel / PRNTlevel are not fully tested. I will fix those errors in next release. [This is referring to the build errors Josh was seeing.]" |
In response to Sherry's reply:
|
Is the correct course for this issue wait for SuperLU_dist 4.4 ? We can close the issue and document this error with debug turned on in SuperLU_Dist. Josh : How do you say this memory error is in Amesos2 ? |
I'd like to clarify a point re: the valgrind errors. My valgrind traces show that essentially the same valgrind errors are reported for a build against 4.2 as against 4.3. This is a claim that might be wrong; it would be good if someone else would confirm or contradict that there are meaningful valgrind errors in builds against 4.2. Helpfully, I find that a valgrind trace of the unit test shows essentially the same valgrind messages as a trace of a practical problem, so I believe we can focus on an analysis of just the unit test. |
In regard to @srajama1 , I marked it as ameso2 because we are the one supporting integration of SuperLU_Dist with Trilinos. Yes, I planned to close this error, once I had time to investigate all of it and write up some notes to document it. In regards to @ambrad , the valgrind warnings are a different issue. Currently we are working with Superlu_dist to try to get rid of them. |
@jdbooth : It is hard to separate out memory errors and segfaults. It is easier to fix the memory errors and see if we have the same segfault. @ambrad. If it is reproducible with 4.2 then it makes the job easier as you can see from the above comments the debug options are not working with 4.3. Either Josh or I will try this out. |
Not sure whether this belongs here or in a separate issue, but the SuperLU_Dist test times out for me with this configuration script. Sometimes, it times out during the first test; sometimes the first test completes quickly and it times out during the second. |
@amklinv, that is essentially the behavior I noticed that started this thread. A test in an application's test suite intermittently times out on some platforms (timing out at 1500s when a successful run of this test takes ~30s). Intermittent failures are often associated with uninitialized memory or worse, so I ran valgrind on that ctest. Then I wanted to see if essentially the same valgrind messages would show up in the Trilinos ctests; they do. Etc. |
Josh : Can I ask why this is closed, if same errors are even in 4.2 ? |
@ambrad, were you seeing this timeout behavior with 4.2 as well, or is that new to 4.3? |
WIth these changes MPI, off and hacking the Kokkos::Experiment::HIP::memory_space typedef to be Kokkos::Experimental::HIPHostPinnedSpace we get these failures: 124/124 Test trilinos#124: TpetraCore_TsqrAdaptor .......................................... Passed 0.55 sec 95% tests passed, 5 tests failed out of 124 Label Time Summary: Tpetra = 147.76 sec*proc (124 tests) Total Test time (real) = 147.96 sec The following tests FAILED: 19 - TpetraCore_idot (Subprocess aborted) 83 - TpetraCore_MatrixMatrix_UnitTests (NUMERICAL) 121 - TpetraCore_RowMatrixTransposer_test (Subprocess aborted) 122 - TpetraCore_RowMatrixTransposer_UnitTests (Failed) 123 - TpetraCore_CrsMatrix_transpose_sortedRows (Failed) This IS running on the AMD GPU ...
WIth these changes MPI, off and hacking the Kokkos::Experiment::HIP::memory_space typedef to be Kokkos::Experimental::HIPHostPinnedSpace we get these failures: 124/124 Test #124: TpetraCore_TsqrAdaptor .......................................... Passed 0.55 sec 95% tests passed, 5 tests failed out of 124 Label Time Summary: Tpetra = 147.76 sec*proc (124 tests) Total Test time (real) = 147.96 sec The following tests FAILED: 19 - TpetraCore_idot (Subprocess aborted) 83 - TpetraCore_MatrixMatrix_UnitTests (NUMERICAL) 121 - TpetraCore_RowMatrixTransposer_test (Subprocess aborted) 122 - TpetraCore_RowMatrixTransposer_UnitTests (Failed) 123 - TpetraCore_CrsMatrix_transpose_sortedRows (Failed) This IS running on the AMD GPU ...
WIth these changes MPI, off and hacking the Kokkos::Experiment::HIP::memory_space typedef to be Kokkos::Experimental::HIPHostPinnedSpace we get these failures: 124/124 Test #124: TpetraCore_TsqrAdaptor .......................................... Passed 0.55 sec 95% tests passed, 5 tests failed out of 124 Label Time Summary: Tpetra = 147.76 sec*proc (124 tests) Total Test time (real) = 147.96 sec The following tests FAILED: 19 - TpetraCore_idot (Subprocess aborted) 83 - TpetraCore_MatrixMatrix_UnitTests (NUMERICAL) 121 - TpetraCore_RowMatrixTransposer_test (Subprocess aborted) 122 - TpetraCore_RowMatrixTransposer_UnitTests (Failed) 123 - TpetraCore_CrsMatrix_transpose_sortedRows (Failed) This IS running on the AMD GPU ...
Tested SuperLU_DIST 4.3 (Newest as of December 31 2015).
GCC 4.9 with both Trilinos DEBUG + -O0 and SuperLU_DIST debuglvl = 2
Segfaults if using more than 1 mpi-rank.
Memory error most likely on Amesos2 side.
@trilinos/amesos2
The text was updated successfully, but these errors were encountered: