-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory allocation far exceeds requested size #630
Comments
I am unable to replicate this as indicated here. However, if I simulate being unable to allocate memory on the device, both for data and for streams, I get the following stack:
The issue seems to be during the release of |
The puzzling output from the log is However given that #633 fixes a problem were the environment variable that sets the 10% limit would not be consistently visible and be ignored at random, this may be a false positive. Let's re-open if we see it again. |
created a tracking issue for the per-stream-info cleanup problem. Closing this again. |
Describe the bug
Rarely, the GPU memory allocator will try to obtain a very large amount of memory, far in excess of what is physically available or requested by the mca params.
To Reproduce
This is not particular to allreduce:mp, it just happened on this on this time
Environment (please complete the following information):
Additional context
https://github.com/ICLDisco/parsec/actions/runs/7733854871/job/21086769192?pr=629
The text was updated successfully, but these errors were encountered: