Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When GPU cannot initialize (OOM) per-stream-info need cleanup #636

Open
abouteiller opened this issue Feb 14, 2024 · 0 comments
Open

When GPU cannot initialize (OOM) per-stream-info need cleanup #636

abouteiller opened this issue Feb 14, 2024 · 0 comments
Labels
bug Something isn't working
Milestone

Comments

@abouteiller
Copy link
Contributor

  If I simulate being unable to allocate memory on the device, both for data and for streams, I get the following stack:
#5  0x00007ffff7e57995 in parsec_list_destruct (list=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_list.c:45
#6  0x00007ffff7e5bdaa in parsec_obj_run_destructors (object=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#7  0x00007ffff7e5c102 in parsec_info_destructor (obj=0x7ffff7fbd260 <parsec_per_stream_infos>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/info.c:34
#8  0x00007ffff7eb0ceb in parsec_obj_run_destructors (object=0x7ffff7fbd260 <parsec_per_stream_infos>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#9  0x00007ffff7eb35bd in parsec_mca_device_fini () at /home/bosilca/unstable/parsec/parsec/parsec/mca/device/device.c:572
#10 0x00007ffff7e764d0 in parsec_fini (pcontext=0x7fffffff49a0) at /home/bosilca/unstable/parsec/parsec/parsec/parsec.c:1235
#11 0x000000000040374f in main (argc=1, argv=0x7fffffff4b38)
    at /home/bosilca/unstable/parsec/parsec/tests/dsl/dtd/dtd_test_allreduce.c:237

The issue seems to be during the release of parsec_per_stream_infos because there are still infos registered inside. The CUDA code seems to perform actually really well, the devices failing to allocate memory are removed, and the execution unfolds without them.

Originally posted by @bosilca in #630 (comment)

@abouteiller abouteiller added the bug Something isn't working label Feb 14, 2024
@abouteiller abouteiller added this to the v4.1 milestone Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant