Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory allocation far exceeds requested size #630

Closed
abouteiller opened this issue Feb 1, 2024 · 3 comments
Closed

GPU memory allocation far exceeds requested size #630

abouteiller opened this issue Feb 1, 2024 · 3 comments
Labels
bug Something isn't working
Milestone

Comments

@abouteiller
Copy link
Contributor

Describe the bug

Rarely, the GPU memory allocator will try to obtain a very large amount of memory, far in excess of what is physically available or requested by the mca params.

To Reproduce

This is not particular to allreduce:mp, it just happened on this on this time

104/111 Test #104: dsl/dtd/allreduce:mp ............................................***Failed   13.36 sec
W@00000 /tmp/parsec/parsec/parsec/mca/device/cuda/device_cuda_module.c:428 cudaStreamCreate out of memory
[line repeats multiple times]

W@00001 /tmp/parsec/parsec/parsec/mca/device/cuda/device_cuda_module.c:348 cudaMallocout of memory
W@00001 GPU[cuda(6)] Allocating 79855878144 bytes of memory on the GPU device failed


W@00003 /tmp/parsec/parsec/parsec/mca/device/cuda/device_cuda_module.c:428 cudaStreamCreate out of memory
W@00000 /tmp/parsec/parsec/parsec/mca/device/cuda/device_cuda_module.c:428 cudaStreamCreate out of memory
Root: 0; value=780

My rank: 1, bcast recv data: 780
My rank: 2, bcast recv data: 780
My rank: 3, bcast recv data: 780
[c53340475b50:333063] *** Process received signal ***
[c53340475b50:333063] Signal: Segmentation fault (11)
[c53340475b50:333063] Signal code: Address not mapped (1)
[c53340475b50:333063] Failing at address: 0x10
[c53340475b50:333063] [ 0] /lib64/libc.so.6(+0x54df0)[0x7f929adf1df0]
[c53340475b50:333063] [ 1] /tmp/parsec/parsec/build/Release/shared_ON/profile_OFF/parsec/libparsec.so.4(+0x29d40)[0x7f929b69fd40]
[c53340475b50:333063] [ 2] /tmp/parsec/parsec/build/Release/shared_ON/profile_OFF/parsec/libparsec.so.4(parsec_cuda_module_fini+0x149)[0x7f929b6d6aa9]
[c53340475b50:333063] [ 3] /tmp/parsec/parsec/build/Release/shared_ON/profile_OFF/parsec/libparsec.so.4(+0x5f297)[0x7f929b6d5297]
[c53340475b50:333063] [ 4] /tmp/parsec/parsec/build/Release/shared_ON/profile_OFF/parsec/libparsec.so.4(parsec_mca_device_fini+0x66)[0x7f929b6d0e46]
[c53340475b50:333063] [ 5] /tmp/parsec/parsec/build/Release/shared_ON/profile_OFF/parsec/libparsec.so.4(parsec_fini+0x1e3)[0x7f929b6b2103]
[c53340475b50:333063] [ 6] dsl/dtd/dtd_test_allreduce[0x402829]
[c53340475b50:333063] [ 7] /lib64/libc.so.6(+0x3feb0)[0x7f929addceb0]
[c53340475b50:333063] [ 8] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f929addcf60]
[c53340475b50:333063] [ 9] dsl/dtd/dtd_test_allreduce[0x402b75]
[c53340475b50:333063] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node c53340475b50 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Environment (please complete the following information):

  • CI runner

Additional context

https://github.com/ICLDisco/parsec/actions/runs/7733854871/job/21086769192?pr=629

@abouteiller abouteiller added the bug Something isn't working label Feb 1, 2024
@abouteiller abouteiller added this to the v4.0 milestone Feb 1, 2024
@bosilca
Copy link
Contributor

bosilca commented Feb 2, 2024

I am unable to replicate this as indicated here. However, if I simulate being unable to allocate memory on the device, both for data and for streams, I get the following stack:

#5  0x00007ffff7e57995 in parsec_list_destruct (list=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_list.c:45
#6  0x00007ffff7e5bdaa in parsec_obj_run_destructors (object=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#7  0x00007ffff7e5c102 in parsec_info_destructor (obj=0x7ffff7fbd260 <parsec_per_stream_infos>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/info.c:34
#8  0x00007ffff7eb0ceb in parsec_obj_run_destructors (object=0x7ffff7fbd260 <parsec_per_stream_infos>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#9  0x00007ffff7eb35bd in parsec_mca_device_fini () at /home/bosilca/unstable/parsec/parsec/parsec/mca/device/device.c:572
#10 0x00007ffff7e764d0 in parsec_fini (pcontext=0x7fffffff49a0) at /home/bosilca/unstable/parsec/parsec/parsec/parsec.c:1235
#11 0x000000000040374f in main (argc=1, argv=0x7fffffff4b38)
    at /home/bosilca/unstable/parsec/parsec/tests/dsl/dtd/dtd_test_allreduce.c:237

The issue seems to be during the release of parsec_per_stream_infos because there are still infos registered inside. The CUDA code seems to perform actually really well, the devices failing to allocate memory are removed, and the execution unfolds without them.

@abouteiller
Copy link
Contributor Author

The puzzling output from the log is W@00001 GPU[cuda(6)] Allocating 79855878144 bytes of memory on the GPU device failed, as we requested that, at most, 10% of GPU memory should be allocated, we should not be requesting what looks like 100% of the 80GB VRAM.

However given that #633 fixes a problem were the environment variable that sets the 10% limit would not be consistently visible and be ignored at random, this may be a false positive. Let's re-open if we see it again.

@abouteiller
Copy link
Contributor Author

created a tracking issue for the per-stream-info cleanup problem. Closing this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants