Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos: segfault in Kokkos::Impl::CudaInternal::finalize() #9645

Closed
shardest opened this issue Sep 1, 2021 · 3 comments
Closed

Kokkos: segfault in Kokkos::Impl::CudaInternal::finalize() #9645

shardest opened this issue Sep 1, 2021 · 3 comments
Labels
MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: bug The primary issue is a bug in Trilinos code or tests

Comments

@shardest
Copy link

shardest commented Sep 1, 2021

Bug Report

@trilinos/kokkos

Description

This happens in adagio with the GDSW solver running on the device via kokkos. The program appears to execute correctly, and then segfaults while trying to exit (see backtrace from core file):

(gdb) bt
#0 0x0000000011b961e0 in Kokkos::Impl::CudaInternal::finalize() ()
#1 0x0000000011b963c8 in Kokkos::Impl::CudaInternal::finalize() [clone .part.52] ()
#2 0x0000000011b7f415 in Kokkos::Impl::ExecSpaceManager::finalize_spaces(bool)
()
#3 0x0000000011b8178e in Kokkos::Impl::(anonymous namespace)::finalize_internal(bool) ()
#4 0x00007feabf294cd9 in __run_exit_handlers () from /lib64/libc.so.6
#5 0x00007feabf294d27 in exit () from /lib64/libc.so.6
#6 0x00007feabf27d54c in __libc_start_main () from /lib64/libc.so.6
#7 0x0000000005182e1d in _start ()

Steps to Reproduce

This is quite difficult to reproduce - only fails on a particular test with a particular number of MPI ranks and with release or release-symbols build, after recent Sierra-Trilinos integration. We have tried various tools, and still aren't able to get line numbers. We are adding some print statements, but this is extremely slow going due to the way Trilinos is now built in Sierra via spack. At this point, we are looking for ideas and general debugging help. I note that the only commit in Trilinos touching the files associated with the above trace that lies (temporally) between the versions integrated into Sierra is 8482d7. It likely worked prior to that commit, but building and running the test to confirm that hypothesis is quite difficult. It is also possible that we (adagio or more generally) are doing something wrong in our usage of kokkos, but it seems that there is still something to be fixed even if that is the case.

@shardest shardest added the type: bug The primary issue is a bug in Trilinos code or tests label Sep 1, 2021
@crtrott
Copy link
Member

crtrott commented Sep 2, 2021

This looks like a finalization happening after exit from main, i.e. via some static object or so, which isn't generally safe since order of destruction of static objects depends potentially on stuff like link order etc. So some change in the build system, an addition of a new object file, etc. may suddenly break this.

@shardest
Copy link
Author

shardest commented Sep 7, 2021

@crtrott - would you like to reference the issue you created and close this? We still have another failing test, but the mechanism appears to be different.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Sep 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants