Updating GPU (accelerator) support in MLCube. #351

sergey-serebryakov · 2024-01-18T01:05:41Z

This commit changes how MLCube runtime works with GPUs. Users have two options to provide information on required accelerators:

It can be done in MLCube configuration file in the platform section (platform.accelerator_count). This value is optional, and if present, may be an empy / non-set string, or an integer. This values is the number of required accelerators, and semantically, is equivalent to docker's --gpu=N CLI argument.
It can be done on a command line using --gpus argument, e.g., mlcube run ... --gpus=4. This parameter accepts same values as docker's --gpus CLI argument does. Concretely:
- When not set MLCube will use platform.accelerator_count value if present.
- When set, this overrides any value assigned to platform.accelerator_count. Could be empty (--gpus=) to disable GPUs, all (--gpus=all) to use all available GPUs, GPU count (--gpus=N) or list of concrete GPUs (--gpus="device=0,2"). In this list GPU indices or UUIDs can be used.

Docker runner will pretty much use this value unmodified and will pass it to docker run command, e.g., --gpus flag will present. Singularity runner will pass --nv command when GPUs are requested. No CUDA_VISIBLE_DEVICES, SINGULARITYENV_CUDA_VISIBLE_DEVICES or any other env variable will be set by MLCube runtime (docker NVIDIA runtime may set NVIDIA_VISIBLE_DEVICES).

To debug for possble issues, enable debug mode (e.g., mlcube --log-level=debug run ...) and search the output for the log lines that contain DEBUG Device spec (...) resolved to ... and INFO Device params ... resolved to .... They will provide additional information on how MLCube runtime determines how GPUs should be used.

This commit changes how MLCube runtime works with GPUs. Users have two options to provide information on required accelerators: - It can be done in MLCube configuration file in the `platform` section (platform.accelerator_count). This value is optional, and if present, may be an empy / non-set string, or an integer. This values is the number of required accelerators, and semantically, is equivalent to docker's `--gpu=N` CLI argument. - It can be done on a command line using `--gpus` argument, e.g., `mlcube run ... --gpus=4`. This parameter accepts same values as docker's `--gpus` CLI argument does. Concretely: - When not set MLCube will use platform.accelerator_count value if present. - When set, this overrides any value assigned to platform.accelerator_count. Could be empty (`--gpus=`) to disable GPUs, all (`--gpus=all`) to use all available GPUs, GPU count (`--gpus=N`) or list of concrete GPUs (`--gpus="device=0,2"`). In this list GPU indices or UUIDs can be used. Docker runner will pretty much use this value unmodified and will pass it to docker run command, e.g., `--gpus` flag will present. Singularity runner will pass `--nv` command when GPUs are requested. No CUDA_VISIBLE_DEVICES, SINGULARITYENV_CUDA_VISIBLE_DEVICES or any other env variable will be set by MLCube runtime (docker NVIDIA runtime may set NVIDIA_VISIBLE_DEVICES). To debug for possble issues, enable debug mode (e.g., `mlcube --log-level=debug run ...`) and search the output for the log lines that contain `DEBUG Device spec (...) resolved to ...` and `INFO Device params ... resolved to ...`. They will provide additional information on how MLCube runtime determines how GPUs should be used.

github-actions · 2024-01-18T01:05:55Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sergey-serebryakov requested a review from a team as a code owner January 18, 2024 01:05

sergey-serebryakov merged commit 8833ec0 into mlcommons:master Jan 18, 2024
2 checks passed

sergey-serebryakov deleted the bugfix/device-specs branch January 18, 2024 01:08

github-actions bot locked and limited conversation to collaborators Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating GPU (accelerator) support in MLCube. #351

Updating GPU (accelerator) support in MLCube. #351

sergey-serebryakov commented Jan 18, 2024

github-actions bot commented Jan 18, 2024

Updating GPU (accelerator) support in MLCube. #351

Updating GPU (accelerator) support in MLCube. #351

Conversation

sergey-serebryakov commented Jan 18, 2024

github-actions bot commented Jan 18, 2024