Skip to content

Commit 66dcdcf

Browse files
cmeestersjohanneskoestercoderabbitai[bot]
authored
feat!: improved GPU job support (#173)
In the light of more and more accelerator applications (AI, base mapping, ...) the fall-back onto `slurm_extra` becomes a bit tedious to use. Hence, the resource support for `gres`. Addresses issue #52 (and to a minor extent: #18 and #104). Supersedes PR #172 . <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Updated documentation section on "GPU Jobs" to clarify how to request GPU resources with new syntax examples. - **Bug Fixes** - Improved error handling and reporting for job submission processes. - Clarified error messages in test cases for better understanding. - **Dependency Updates** - Updated `snakemake-executor-plugin-slurm-jobstep` dependency version from `^0.2.0` to `^0.3.0`. - **Tests** - Streamlined test cases by removing less relevant tests and enhancing clarity of error messages. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Johannes Köster <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
1 parent 39fcd5a commit 66dcdcf

File tree

5 files changed

+197
-6
lines changed

5 files changed

+197
-6
lines changed

docs/further.md

+30
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,36 @@ $ snakemake --set-resources calc_pi:mpi="mpiexec" ...
8888

8989
To submit "ordinary" MPI jobs, submitting with `tasks` (the MPI ranks) is sufficient. Alternatively, on some clusters, it might be convenient to just configure `nodes`. Consider using a combination of `tasks` and `cpus_per_task` for hybrid applications (those that use ranks (multiprocessing) and threads). A detailed topology layout can be achieved using the `slurm_extra` parameter (see below) using further flags like `--distribution`.
9090

91+
### GPU Jobs
92+
93+
SLURM allows to specify GPU request with the `--gres` or `--gpus` flags and Snakemake takes a similar approach. Resources can be asked for with
94+
95+
- The resource `gpu` can be used, e.g. by just requesting the number of GPUs like `gpu=2`. This can be combined with the `gpu_model` resource, i.e. `gpu_model=a100` or independently. The combination will result in a flag to `sbatch` like `--gpus=a100:2`. The Snakemake `gpu` resource has to be number.
96+
- Alternatively, the resource `gres`, the syntax is `<string>:<number>` or `<string>:<model>:<number>`, i.e. `gres=gpu:1` or `gres=gpu:a100:2` (assuming GPU model).
97+
98+
.. note:: Internally, Snakemake knows the resource `gpu_manufacturer`, too. However, SLURM does not know the distinction between model and manufacturer. Essentially, the preferred way to request an accelerator will depend on your specific cluster setup.
99+
Also, to be consistent within Snakemake, the resource is called `gpu` not `gpus`.
100+
101+
Additionally, `cpus_per_gpu` can be set - Snakemakes `threads` settings will otherwise be used to set `cpus_per_gpu`. If `cpus_per_gpu` is lower or equal to zero, no CPU is requested from SLURM (and cluster defaults will kick in, if any).
102+
103+
A sample workflow profile might look like:
104+
105+
```YAML
106+
set-resources:
107+
gres_request_rule:
108+
gres: "gpu:1"
109+
110+
multi_gpu_rule:
111+
gpu: 2
112+
gpu_model: "a100"
113+
cpus_per_gpu: 4
114+
115+
no_cpu_gpu_rule:
116+
gpu: 1
117+
cpus_per_gpu: 0 # Values <= 0 indicate that NO CPU request string
118+
# will be issued.
119+
```
120+
91121
### Running Jobs locally
92122

93123
Not all Snakemake workflows are adapted for heterogeneous environments, particularly clusters. Users might want to avoid the submission of _all_ rules as cluster jobs. Non-cluster jobs should usually include _short_ jobs, e.g. internet downloads or plotting rules.

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ keywords = ["snakemake", "plugin", "executor", "cluster", "slurm"]
1717
python = "^3.11"
1818
snakemake-interface-common = "^1.13.0"
1919
snakemake-interface-executor-plugins = "^9.1.1"
20-
snakemake-executor-plugin-slurm-jobstep = "^0.2.0"
20+
snakemake-executor-plugin-slurm-jobstep = "^0.3.0"
2121
throttler = "^1.2.2"
2222

2323
[tool.poetry.group.dev.dependencies]

snakemake_executor_plugin_slurm/__init__.py

+12-5
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,9 @@
2626
JobExecutorInterface,
2727
)
2828
from snakemake_interface_common.exceptions import WorkflowError
29-
from snakemake_executor_plugin_slurm_jobstep import get_cpus_per_task
29+
from snakemake_executor_plugin_slurm_jobstep import get_cpu_setting
3030

31-
from .utils import delete_slurm_environment, delete_empty_dirs
31+
from .utils import delete_slurm_environment, delete_empty_dirs, set_gres_string
3232

3333

3434
@dataclass
@@ -217,6 +217,8 @@ def run_job(self, job: JobExecutorInterface):
217217
if self.workflow.executor_settings.requeue:
218218
call += " --requeue"
219219

220+
call += set_gres_string(job)
221+
220222
if job.resources.get("clusters"):
221223
call += f" --clusters {job.resources.clusters}"
222224

@@ -247,7 +249,11 @@ def run_job(self, job: JobExecutorInterface):
247249

248250
# fixes #40 - set ntasks regardless of mpi, because
249251
# SLURM v22.05 will require it for all jobs
250-
call += f" --ntasks={job.resources.get('tasks', 1)}"
252+
gpu_job = job.resources.get("gpu") or "gpu" in job.resources.get("gres", "")
253+
if gpu_job:
254+
call += f" --ntasks-per-gpu={job.resources.get('tasks', 1)}"
255+
else:
256+
call += f" --ntasks={job.resources.get('tasks', 1)}"
251257
# MPI job
252258
if job.resources.get("mpi", False):
253259
if not job.resources.get("tasks_per_node") and not job.resources.get(
@@ -258,8 +264,9 @@ def run_job(self, job: JobExecutorInterface):
258264
"specified. Assuming 'tasks_per_node=1'."
259265
"Probably not what you want."
260266
)
261-
262-
call += f" --cpus-per-task={get_cpus_per_task(job)}"
267+
# we need to set cpus-per-task OR cpus-per-gpu, the function
268+
# will return a string with the corresponding value
269+
call += f" {get_cpu_setting(job, gpu_job)}"
263270

264271
if job.resources.get("slurm_extra"):
265272
self.check_slurm_extra(job)

snakemake_executor_plugin_slurm/utils.py

+62
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,14 @@
11
# utility functions for the SLURM executor plugin
22

33
import os
4+
import re
45
from pathlib import Path
56

7+
from snakemake_interface_executor_plugins.jobs import (
8+
JobExecutorInterface,
9+
)
10+
from snakemake_interface_common.exceptions import WorkflowError
11+
612

713
def delete_slurm_environment():
814
"""
@@ -40,3 +46,59 @@ def delete_empty_dirs(path: Path) -> None:
4046
except (OSError, FileNotFoundError) as e:
4147
# Provide more context in the error message
4248
raise OSError(f"Failed to remove empty directory {path}: {e}") from e
49+
50+
51+
def set_gres_string(job: JobExecutorInterface) -> str:
52+
"""
53+
Function to set the gres string for the SLURM job
54+
based on the resources requested in the job.
55+
"""
56+
# generic resources (GRES) arguments can be of type
57+
# "string:int" or "string:string:int"
58+
gres_re = re.compile(r"^[a-zA-Z0-9_]+(:[a-zA-Z0-9_]+)?:\d+$")
59+
# gpu model arguments can be of type "string"
60+
gpu_model_re = re.compile(r"^[a-zA-Z0-9_]+$")
61+
# The Snakemake resources can be only be of type "int",
62+
# hence no further regex is needed.
63+
64+
gpu_string = None
65+
if job.resources.get("gpu"):
66+
gpu_string = str(job.resources.get("gpu"))
67+
68+
gpu_model = None
69+
if job.resources.get("gpu_model"):
70+
gpu_model = job.resources.get("gpu_model")
71+
72+
# ensure that gres is not set, if gpu and gpu_model are set
73+
if job.resources.get("gres") and gpu_string:
74+
raise WorkflowError(
75+
"GRES and GPU are set. Please only set one of them.", rule=job.rule
76+
)
77+
elif not job.resources.get("gres") and not gpu_model and not gpu_string:
78+
return ""
79+
80+
if job.resources.get("gres"):
81+
# Validate GRES format (e.g., "gpu:1", "gpu:tesla:2")
82+
gres = job.resources.gres
83+
if not gres_re.match(gres):
84+
raise WorkflowError(
85+
f"Invalid GRES format: {gres}. Expected format: "
86+
"'<name>:<number>' or '<name>:<type>:<number>' "
87+
"(e.g., 'gpu:1' or 'gpu:tesla:2')"
88+
)
89+
return f" --gres={job.resources.gres}"
90+
91+
if gpu_model and gpu_string:
92+
# validate GPU model format
93+
if not gpu_model_re.match(gpu_model):
94+
raise WorkflowError(
95+
f"Invalid GPU model format: {gpu_model}."
96+
" Expected format: '<name>' (e.g., 'tesla')"
97+
)
98+
return f" --gpus={gpu_model}:{gpu_string}"
99+
elif gpu_model and not gpu_string:
100+
raise WorkflowError("GPU model is set, but no GPU number is given")
101+
elif gpu_string:
102+
# we assume here, that the validator ensures that the 'gpu_string'
103+
# is an integer
104+
return f" --gpus={gpu_string}"

tests/tests.py

+92
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
from typing import Optional
22
import snakemake.common.tests
33
from snakemake_interface_executor_plugins.settings import ExecutorSettingsBase
4+
from unittest.mock import MagicMock
5+
import pytest
46

57
from snakemake_executor_plugin_slurm import ExecutorSettings
8+
from snakemake_executor_plugin_slurm.utils import set_gres_string
9+
from snakemake_interface_common.exceptions import WorkflowError
610

711

812
class TestWorkflows(snakemake.common.tests.TestWorkflowsLocalStorageBase):
@@ -18,3 +22,91 @@ def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
1822
class TestWorkflowsRequeue(TestWorkflows):
1923
def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
2024
return ExecutorSettings(requeue=True)
25+
26+
27+
class TestGresString:
28+
"""Test cases for the set_gres_string function."""
29+
30+
@pytest.fixture
31+
def mock_job(self):
32+
"""Create a mock job with configurable resources."""
33+
34+
def _create_job(**resources):
35+
mock_resources = MagicMock()
36+
# Configure get method to return values from resources dict
37+
mock_resources.get.side_effect = lambda key, default=None: resources.get(
38+
key, default
39+
)
40+
# Add direct attribute access for certain resources
41+
for key, value in resources.items():
42+
setattr(mock_resources, key, value)
43+
44+
mock_job = MagicMock()
45+
mock_job.resources = mock_resources
46+
return mock_job
47+
48+
return _create_job
49+
50+
def test_no_gres_or_gpu(self, mock_job):
51+
"""Test with no GPU or GRES resources specified."""
52+
job = mock_job()
53+
assert set_gres_string(job) == ""
54+
55+
def test_valid_gres_simple(self, mock_job):
56+
"""Test with valid GRES format (simple)."""
57+
job = mock_job(gres="gpu:1")
58+
assert set_gres_string(job) == " --gres=gpu:1"
59+
60+
def test_valid_gres_with_model(self, mock_job):
61+
"""Test with valid GRES format including GPU model."""
62+
job = mock_job(gres="gpu:tesla:2")
63+
assert set_gres_string(job) == " --gres=gpu:tesla:2"
64+
65+
def test_invalid_gres_format(self, mock_job):
66+
"""Test with invalid GRES format."""
67+
job = mock_job(gres="gpu")
68+
with pytest.raises(WorkflowError, match="Invalid GRES format"):
69+
set_gres_string(job)
70+
71+
def test_invalid_gres_format_missing_count(self, mock_job):
72+
"""Test with invalid GRES format (missing count)."""
73+
job = mock_job(gres="gpu:tesla:")
74+
with pytest.raises(WorkflowError, match="Invalid GRES format"):
75+
set_gres_string(job)
76+
77+
def test_valid_gpu_number(self, mock_job):
78+
"""Test with valid GPU number."""
79+
job = mock_job(gpu="2")
80+
assert set_gres_string(job) == " --gpus=2"
81+
82+
def test_valid_gpu_with_name(self, mock_job):
83+
"""Test with valid GPU name and number."""
84+
job = mock_job(gpu="tesla:2")
85+
assert set_gres_string(job) == " --gpus=tesla:2"
86+
87+
def test_gpu_with_model(self, mock_job):
88+
"""Test GPU with model specification."""
89+
job = mock_job(gpu="2", gpu_model="tesla")
90+
assert set_gres_string(job) == " --gpus=tesla:2"
91+
92+
def test_invalid_gpu_model_format(self, mock_job):
93+
"""Test with invalid GPU model format."""
94+
job = mock_job(gpu="2", gpu_model="invalid:model")
95+
with pytest.raises(WorkflowError, match="Invalid GPU model format"):
96+
set_gres_string(job)
97+
98+
def test_gpu_model_without_gpu(self, mock_job):
99+
"""Test GPU model without GPU number."""
100+
job = mock_job(gpu_model="tesla")
101+
with pytest.raises(
102+
WorkflowError, match="GPU model is set, but no GPU number is given"
103+
):
104+
set_gres_string(job)
105+
106+
def test_both_gres_and_gpu_set(self, mock_job):
107+
"""Test error case when both GRES and GPU are specified."""
108+
job = mock_job(gres="gpu:1", gpu="2")
109+
with pytest.raises(
110+
WorkflowError, match="GRES and GPU are set. Please only set one of them."
111+
):
112+
set_gres_string(job)

0 commit comments

Comments
 (0)