Kokkos: segfault in Kokkos::Impl::CudaInternal::finalize() #9645
Labels
MARKED_FOR_CLOSURE
Issue or PR is marked for auto-closure by the GitHub Actions bot.
type: bug
The primary issue is a bug in Trilinos code or tests
Bug Report
@trilinos/kokkos
Description
This happens in adagio with the GDSW solver running on the device via kokkos. The program appears to execute correctly, and then segfaults while trying to exit (see backtrace from core file):
(gdb) bt
#0 0x0000000011b961e0 in Kokkos::Impl::CudaInternal::finalize() ()
#1 0x0000000011b963c8 in Kokkos::Impl::CudaInternal::finalize() [clone .part.52] ()
#2 0x0000000011b7f415 in Kokkos::Impl::ExecSpaceManager::finalize_spaces(bool)
()
#3 0x0000000011b8178e in Kokkos::Impl::(anonymous namespace)::finalize_internal(bool) ()
#4 0x00007feabf294cd9 in __run_exit_handlers () from /lib64/libc.so.6
#5 0x00007feabf294d27 in exit () from /lib64/libc.so.6
#6 0x00007feabf27d54c in __libc_start_main () from /lib64/libc.so.6
#7 0x0000000005182e1d in _start ()
Steps to Reproduce
This is quite difficult to reproduce - only fails on a particular test with a particular number of MPI ranks and with release or release-symbols build, after recent Sierra-Trilinos integration. We have tried various tools, and still aren't able to get line numbers. We are adding some print statements, but this is extremely slow going due to the way Trilinos is now built in Sierra via spack. At this point, we are looking for ideas and general debugging help. I note that the only commit in Trilinos touching the files associated with the above trace that lies (temporally) between the versions integrated into Sierra is 8482d7. It likely worked prior to that commit, but building and running the test to confirm that hypothesis is quite difficult. It is also possible that we (adagio or more generally) are doing something wrong in our usage of kokkos, but it seems that there is still something to be fixed even if that is the case.
The text was updated successfully, but these errors were encountered: