Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HOTFIX: make the default number of devices be all the devices seen by… #613

Merged
merged 1 commit into from
Jan 19, 2024

Conversation

therault
Copy link
Contributor

… cudaGetDeviceCount/hipGetDeviceCount/zeDeviceGet

Current code is testing if parsec_device_cuda_enabled_index < 0 (or similar counter for LZ), but parsec_device_cuda_enabled_index is the value returned by parsec_mca_param_reg_int_name() which is always positive, as it returns where that parameter is located in the parameter array.

The code wanted to check if testing if parsec_device_cuda_enabled < 0, so that if the user passes -1 or if the user omits to define the MCA parameter (default value is -1), then parsec takes the number of devices discovered by CUDA/HIP/LZ, as is claimed in the documentation of the MCA parameter.

… cudaGetDeviceCount/hipGetDeviceCount/zeDeviceGet

Current code is testing if parsec_device_cuda_enabled_index < 0 (or similar counter for LZ),
but parsec_device_cuda_enabled_index is the value returned by parsec_mca_param_reg_int_name()
which is always positive, as it returns where that parameter is located in the parameter
array.

The code wanted to check if testing if parsec_device_cuda_enabled < 0, so that if the
user passes -1 or if the user omits to define the MCA parameter (default value is -1),
then parsec takes the number of devices discovered by CUDA/HIP/LZ, as is claimed in
the documentation of the MCA parameter.
@therault therault requested a review from a team as a code owner January 18, 2024 21:09
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@therault
Copy link
Contributor Author

Looking at the failures, they occur because now we try to initialize CUDA for all tests, if there is a GPU available. On 'mp' runs, or more generally on all runs that run more than 1 rank, when they are run on 'guyot', we actually oversubscribe the node, so of course all processes try to grab all the memory on the same GPU, and some fail...

I can add --mca device_cuda_enabled 0 to all multiranks tests...

And I guess we should deactivate the multi-rank GPU test when there is only one GPU to share...

Is there a way to detect that we only have 1 node to do the test at the beginning of the CI, and to propagate this information to the testers so they don't try something that cannot succeed?

@abouteiller abouteiller added this to the v4.0 milestone Jan 19, 2024
@abouteiller abouteiller merged commit 97bd126 into ICLDisco:master Jan 19, 2024
3 of 4 checks passed
therault added a commit to therault/parsec that referenced this pull request Jan 22, 2024
PR ICLDisco#613 made all CI tests initialize the GPU if there is a GPU available.
When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time).

In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it.
This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory.

The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants