Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating GPU (accelerator) support in MLCube. #351

Merged

Conversation

sergey-serebryakov
Copy link
Contributor

This commit changes how MLCube runtime works with GPUs. Users have two options to provide information on required accelerators:

  • It can be done in MLCube configuration file in the platform section (platform.accelerator_count). This value is optional, and if present, may be an empy / non-set string, or an integer. This values is the number of required accelerators, and semantically, is equivalent to docker's --gpu=N CLI argument.
  • It can be done on a command line using --gpus argument, e.g., mlcube run ... --gpus=4. This parameter accepts same values as docker's --gpus CLI argument does. Concretely:
    • When not set MLCube will use platform.accelerator_count value if present.
    • When set, this overrides any value assigned to platform.accelerator_count. Could be empty (--gpus=) to disable GPUs, all (--gpus=all) to use all available GPUs, GPU count (--gpus=N) or list of concrete GPUs (--gpus="device=0,2"). In this list GPU indices or UUIDs can be used.

Docker runner will pretty much use this value unmodified and will pass it to docker run command, e.g., --gpus flag will present. Singularity runner will pass --nv command when GPUs are requested. No CUDA_VISIBLE_DEVICES, SINGULARITYENV_CUDA_VISIBLE_DEVICES or any other env variable will be set by MLCube runtime (docker NVIDIA runtime may set NVIDIA_VISIBLE_DEVICES).

To debug for possble issues, enable debug mode (e.g., mlcube --log-level=debug run ...) and search the output for the log lines that contain DEBUG Device spec (...) resolved to ... and INFO Device params ... resolved to .... They will provide additional information on how MLCube runtime determines how GPUs should be used.

This commit changes how MLCube runtime works with GPUs. Users have two options to provide information on required accelerators:
- It can be done in MLCube configuration file in the `platform` section (platform.accelerator_count). This value is optional, and if present, may be an empy / non-set string, or an integer. This values is the number of required accelerators, and semantically, is equivalent to docker's `--gpu=N` CLI argument.
- It can be done on a command line using `--gpus` argument, e.g., `mlcube run ... --gpus=4`. This parameter accepts same values as docker's `--gpus` CLI argument does. Concretely:
  - When not set MLCube will use platform.accelerator_count value if present.
  - When set, this overrides any value assigned to platform.accelerator_count. Could be empty (`--gpus=`) to disable GPUs, all (`--gpus=all`) to use all available GPUs, GPU count (`--gpus=N`) or list of concrete GPUs (`--gpus="device=0,2"`). In this list GPU indices or UUIDs can be used.

Docker runner will pretty much use this value unmodified and will pass it to docker run command, e.g., `--gpus` flag will present. Singularity runner will pass `--nv` command when GPUs are requested. No CUDA_VISIBLE_DEVICES, SINGULARITYENV_CUDA_VISIBLE_DEVICES or any other env variable will be set by MLCube runtime (docker NVIDIA runtime may set NVIDIA_VISIBLE_DEVICES).

To debug for possble issues, enable debug mode (e.g., `mlcube --log-level=debug run ...`) and search the output for the log lines that contain `DEBUG Device spec (...) resolved to ...` and `INFO Device params ... resolved to ...`. They will provide additional information on how MLCube runtime determines how GPUs should be used.
@sergey-serebryakov sergey-serebryakov requested a review from a team as a code owner January 18, 2024 01:05
Copy link
Contributor

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@sergey-serebryakov sergey-serebryakov merged commit 8833ec0 into mlcommons:master Jan 18, 2024
2 checks passed
@sergey-serebryakov sergey-serebryakov deleted the bugfix/device-specs branch January 18, 2024 01:08
@github-actions github-actions bot locked and limited conversation to collaborators Jan 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant