HOTFIX: make the default number of devices be all the devices seen by… #613

therault · 2024-01-18T21:09:56Z

… cudaGetDeviceCount/hipGetDeviceCount/zeDeviceGet

Current code is testing if parsec_device_cuda_enabled_index < 0 (or similar counter for LZ), but parsec_device_cuda_enabled_index is the value returned by parsec_mca_param_reg_int_name() which is always positive, as it returns where that parameter is located in the parameter array.

The code wanted to check if testing if parsec_device_cuda_enabled < 0, so that if the user passes -1 or if the user omits to define the MCA parameter (default value is -1), then parsec takes the number of devices discovered by CUDA/HIP/LZ, as is claimed in the documentation of the MCA parameter.

… cudaGetDeviceCount/hipGetDeviceCount/zeDeviceGet Current code is testing if parsec_device_cuda_enabled_index < 0 (or similar counter for LZ), but parsec_device_cuda_enabled_index is the value returned by parsec_mca_param_reg_int_name() which is always positive, as it returns where that parameter is located in the parameter array. The code wanted to check if testing if parsec_device_cuda_enabled < 0, so that if the user passes -1 or if the user omits to define the MCA parameter (default value is -1), then parsec takes the number of devices discovered by CUDA/HIP/LZ, as is claimed in the documentation of the MCA parameter.

devreal

LGTM

therault · 2024-01-19T14:40:51Z

Looking at the failures, they occur because now we try to initialize CUDA for all tests, if there is a GPU available. On 'mp' runs, or more generally on all runs that run more than 1 rank, when they are run on 'guyot', we actually oversubscribe the node, so of course all processes try to grab all the memory on the same GPU, and some fail...

I can add --mca device_cuda_enabled 0 to all multiranks tests...

And I guess we should deactivate the multi-rank GPU test when there is only one GPU to share...

Is there a way to detect that we only have 1 node to do the test at the beginning of the CI, and to propagate this information to the testers so they don't try something that cannot succeed?

PR ICLDisco#613 made all CI tests initialize the GPU if there is a GPU available. When running in oversubscribe mode, this can lead to falsely failing tests, that fail not because of a software issue, but because of a deployment issue (multiple processes trying to allocate 90% of the GPU memory at the same time). In general, since we don't know if the GPU will be used or not, we should not preemptively allocate all the memory on it. This PR makes memory allocation lazy: it is delayed until we do try to use some GPU memory. The drawback is that the first GPU task will also pay the cost of a large cuda_malloc / zmalloc etc...

therault requested a review from a team as a code owner January 18, 2024 21:09

devreal approved these changes Jan 19, 2024

View reviewed changes

abouteiller approved these changes Jan 19, 2024

View reviewed changes

abouteiller added this to the v4.0 milestone Jan 19, 2024

bosilca approved these changes Jan 19, 2024

View reviewed changes

abouteiller merged commit 97bd126 into ICLDisco:master Jan 19, 2024
3 of 4 checks passed

therault mentioned this pull request Jan 22, 2024

GPU: lazy memory allocation #615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOTFIX: make the default number of devices be all the devices seen by… #613

HOTFIX: make the default number of devices be all the devices seen by… #613

therault commented Jan 18, 2024

devreal left a comment

therault commented Jan 19, 2024

HOTFIX: make the default number of devices be all the devices seen by… #613

HOTFIX: make the default number of devices be all the devices seen by… #613

Conversation

therault commented Jan 18, 2024

devreal left a comment

Choose a reason for hiding this comment

therault commented Jan 19, 2024