Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmartSim Singularity Integration #204

Merged
merged 31 commits into from
Jun 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
9d28cfa
Base implementation of singularity support
Jun 2, 2022
f44e81b
test sing pull
Jun 2, 2022
b4b446e
fire off testing, actually
Jun 2, 2022
44f5d9d
singularity pull redis container
Jun 3, 2022
fdfdaa5
use custom image
Jun 3, 2022
336f68c
add https
Jun 3, 2022
1219985
https -> oras (OCI registry)
Jun 3, 2022
ad63071
switch to dockerhub because OCI not working on gh actions
Jun 3, 2022
2a7ee53
fix bind_paths bug
Jun 3, 2022
2ce6859
fix bind_paths bug further
Jun 3, 2022
8b71e1e
container test
Jun 3, 2022
3b9933b
fix slurm bug
Jun 3, 2022
a7582fe
fix serialization bug
Jun 3, 2022
a511e9a
fix bind_paths bug and docker path in test
Jun 3, 2022
2d41033
fix bind_paths typo
Jun 3, 2022
a472fc9
Shift run_command-based approach to Step-based modification
Jun 3, 2022
c5d792c
fix list exe bug
Jun 3, 2022
aa3c9c5
listify container cmds
Jun 3, 2022
766e5fd
Merge branch 'develop' of github.com:CrayLabs/SmartSim into singularity
Jun 5, 2022
0dcc2d4
Fix PBS bug
Jun 5, 2022
6ecd06d
update test to use smartredis
Jun 7, 2022
9e8f4a6
Introduce container-WLM test suite
Jun 8, 2022
3e6c1ef
Add Dockerfile used to generate container-testing image
Jun 8, 2022
ada7561
Split up WLM and non-WLM tests
Jun 8, 2022
ac378ab
Doc update + organizing tests in preparation for migration
Jun 8, 2022
4e3dfe6
Fix launcher check on WLM test:
Jun 8, 2022
8305ff7
Address some (but not all) feedback
Jun 8, 2022
c743639
Update to alrigazzi dockerhub repo and more feedback addressed
Jun 8, 2022
3b4bd97
Tests!
Jun 9, 2022
2cc1b98
Drop unused test file
Jun 9, 2022
6bee2cf
Add 1 more test
Jun 9, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ env:
HOMEBREW_NO_BOTTLE_SOURCE_FALLBACK: "ON"
HOMEBREW_NO_GITHUB_API: "ON"
HOMEBREW_NO_INSTALL_CLEANUP: "ON"

DEBIAN_FRONTEND: "noninteractive" # Disable interactive apt install sessions

jobs:
run_tests:
Expand Down Expand Up @@ -47,11 +47,29 @@ jobs:
sudo apt-get install -y wget

- name: Install GNU make for MacOS and set GITHUB_PATH
if: "contains( matrix.os, 'macos' )"
if: contains( matrix.os, 'macos' )
run: |
brew install make || true
echo "$(brew --prefix)/opt/make/libexec/gnubin" >> $GITHUB_PATH

- name: Build Singularity from source
if: contains( matrix.os, 'ubuntu' ) && matrix.py_v == 3.9 && matrix.rai == '1.2.5'
run: |
sudo apt-get install -y libseccomp-dev pkg-config squashfs-tools cryptsetup curl git # wget build-essential
echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
export VERSION=1.0.0 # Apptainer (singularity) version
wget https://github.com/apptainer/apptainer/releases/download/v${VERSION}/apptainer-${VERSION}.tar.gz
tar -xzf apptainer-${VERSION}.tar.gz
cd apptainer-${VERSION}
./mconfig
make -C builddir
sudo make -C builddir install

- name: singularity pull test container # This lets us time how long the pull takes
if: contains( matrix.os, 'ubuntu' ) && matrix.py_v == 3.9 && matrix.rai == '1.2.5'
run: singularity pull docker://alrigazzi/smartsim-testing

- name: Install SmartSim (with ML backends)
run: python -m pip install .[dev,ml,ray]

Expand Down
8 changes: 8 additions & 0 deletions docker/testing/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# syntax=docker/dockerfile:1
FROM ubuntu:21.10
ENV DEBIAN_FRONTEND noninteractive
RUN apt update && apt install -y python3 python3-pip python-is-python3 cmake git
RUN pip install torch==1.9.1
RUN git clone https://github.com/CrayLabs/SmartRedis.git
RUN cd SmartRedis && pip install . && make lib; cd ..

38 changes: 38 additions & 0 deletions docker/testing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# container-testing

This container is hosted on dockerhub to be used for SmartSim container
integration testing. Below are the commands to push an updated version of
the container.

## Building and interacting with container locally

```sh
# Build container
docker build -t container-testing .

# Start a shell on container to try things out
docker run -it container-testing bash
```

Within the container, you can verify that you can import packages like
smartredis or pytorch locally.

## Pushing container updates to DockerHub repository

Note: <version> is bumped each time an update is pushed.
Versions have no relation to SmartSim versions.

```sh
# See current versions to determine next version
docker image inspect --format '{{.RepoTags}}' alrigazzi/smartsim-testing

docker login

# Create tags for current build of container
docker image tag container-testing alrigazzi/smartsim-testing:latest
docker image tag container-testing alrigazzi/smartsim-testing:<version>

# Push current build of container with all tags created
docker image push --all-tags alrigazzi/smartsim-testing
```

3 changes: 3 additions & 0 deletions smartsim/_core/launcher/step/alpsStep.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ def get_launch_cmd(self):
launch_script_path = self.get_colocated_launch_script()
aprun_cmd.extend([bash, launch_script_path])

if self.run_settings.container:
aprun_cmd += self.run_settings.container._container_cmds()

aprun_cmd += self._build_exe()

# if its in a batch, redirect stdout to
Expand Down
3 changes: 3 additions & 0 deletions smartsim/_core/launcher/step/localStep.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ def get_launch_cmd(self):
launch_script_path = self.get_colocated_launch_script()
cmd.extend([bash, launch_script_path])

if self.run_settings.container:
cmd += self.run_settings.container._container_cmds()

# build executable
cmd.extend(self.run_settings.exe)
if self.run_settings.exe_args:
Expand Down
3 changes: 3 additions & 0 deletions smartsim/_core/launcher/step/slurmStep.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,9 @@ def get_launch_cmd(self):
launch_script_path = self.get_colocated_launch_script()
srun_cmd += [bash, launch_script_path]

if self.run_settings.container:
srun_cmd += self.run_settings.container._container_cmds()

srun_cmd += self._build_exe()
return srun_cmd

Expand Down
2 changes: 2 additions & 0 deletions smartsim/experiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -512,6 +512,7 @@ def create_run_settings(
run_command="auto",
run_args=None,
env_vars=None,
container=None,
**kwargs,
):
"""Create a ``RunSettings`` instance.
Expand Down Expand Up @@ -558,6 +559,7 @@ class in SmartSim. If found, the class corresponding
run_command=run_command,
run_args=run_args,
env_vars=env_vars,
container=container,
**kwargs,
)
except SmartSimError as e:
Expand Down
3 changes: 3 additions & 0 deletions smartsim/settings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .mpirunSettings import MpiexecSettings, MpirunSettings, OrterunSettings
from .pbsSettings import QsubBatchSettings
from .slurmSettings import SbatchSettings, SrunSettings
from .containers import Container, Singularity

__all__ = [
"AprunSettings",
Expand All @@ -18,4 +19,6 @@
"RunSettings",
"SbatchSettings",
"SrunSettings",
"Container",
"Singularity",
]
4 changes: 4 additions & 0 deletions smartsim/settings/alpsSettings.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@ def make_mpmd(self, aprun_settings):
raise SSUnsupportedError(
"Colocated models cannot be run as a mpmd workload"
)
if self.container:
raise SSUnsupportedError(
"Containerized MPMD workloads are not yet supported."
)
self.mpmd.append(aprun_settings)

def set_cpus_per_task(self, cpus_per_task):
Expand Down
24 changes: 17 additions & 7 deletions smartsim/settings/base.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# BSD 2-Clause License
#
# BSD 2-Clause License #
# Copyright (c) 2021-2022, Hewlett Packard Enterprise
# All rights reserved.
#
Expand Down Expand Up @@ -38,6 +37,7 @@ def __init__(
run_command="",
run_args=None,
env_vars=None,
container=None,
**kwargs,
):
"""Run parameters for a ``Model``
Expand Down Expand Up @@ -69,11 +69,19 @@ def __init__(
:type run_args: dict[str, str], optional
:param env_vars: environment vars to launch job with, defaults to None
:type env_vars: dict[str, str], optional
:param container: container type for workload (e.g. "singularity"), defaults to None
:type container: Container, optional
"""
self.exe = [expand_exe_path(exe)]
# Do not expand executable if running within a container
if container:
self.exe = [exe]
else:
self.exe = [expand_exe_path(exe)]

self.exe_args = self._set_exe_args(exe_args)
self.run_args = init_default({}, run_args, dict)
self.env_vars = init_default({}, env_vars, dict)
self.container = container
self._run_command = run_command
self.in_batch = False
self.colocated_db_settings = None
Expand Down Expand Up @@ -344,13 +352,15 @@ def run_command(self):
:returns: launch binary e.g. mpiexec
:type: str | None
"""
if self._run_command:
if is_valid_cmd(self._run_command):
cmd = self._run_command

if cmd:
if is_valid_cmd(cmd):
# command is valid and will be expanded
return expand_exe_path(self._run_command)
return expand_exe_path(cmd)
# command is not valid, so return it as is
# it may be on the compute nodes but not local machine
return self._run_command
return cmd
# run without run command
return None

Expand Down
116 changes: 116 additions & 0 deletions smartsim/settings/containers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
import shutil
from ..log import get_logger

logger = get_logger(__name__)

class Container():
'''Base class for container types in SmartSim.

Container types are used to embed all the information needed to
launch a workload within a container into a single object.

:param image: local or remote path to container image
:type image: str
:param args: arguments to container command
:type args: str | list[str], optional
:param mount: paths to mount (bind) from host machine into image.
:type mount: str | list[str] | dict[str, str], optional
'''

def __init__(self, image, args='', mount=''):
# Validate types
if not isinstance(image, str):
raise TypeError('image must be a str')
elif not isinstance(args, (str, list)):
raise TypeError('args must be a str | list')
elif not isinstance(mount, (str, list, dict)):
raise TypeError('mount must be a str | list | dict')

self.image = image
self.args = args
self.mount = mount

def _containerized_run_command(self, run_command: str):
'''Return modified run_command with container commands prepended.

:param run_command: run command from a RunSettings class
:type run_command: str
'''
raise NotImplementedError(f"Containerized run command specification not implemented for this Container type: {type(self)}")


class Singularity(Container):
'''Singularity (apptainer) container type.

.. note::

Singularity integration is currently tested with
`Apptainer 1.0 <https://apptainer.org/docs/user/1.0/index.html>`_
with slurm and PBS workload managers only.

Also, note that user-defined bind paths (``mount`` argument) may be
disabled by a
`system administrator <https://apptainer.org/docs/admin/1.0/configfiles.html#bind-mount-management>`_


:param image: local or remote path to container image, e.g. 'docker://sylabsio/lolcow'
:type image: str
:param args: arguments to 'singularity exec' command
:type args: str | list[str], optional
:param mount: paths to mount (bind) from host machine into image.
:type mount: str | list[str] | dict[str, str], optional
'''

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def _container_cmds(self):
'''Return list of container commands to be inserted before exe.
Container members are validated during this call.

:raises TypeError: if object members are invalid types
'''
serialized_args = ''
if self.args:
# Serialize args into a str
if isinstance(self.args, str):
serialized_args = self.args
elif isinstance(self.args, list):
serialized_args = ' '.join(self.args)
else:
raise TypeError('self.args must be a str | list')

serialized_mount = ''
if self.mount:
if isinstance(self.mount, str):
serialized_mount = self.mount
elif isinstance(self.mount, list):
serialized_mount = ','.join(self.mount)
elif isinstance(self.mount, dict):
paths = []
for host_path,img_path in self.mount.items():
if img_path:
paths.append(f'{host_path}:{img_path}')
else:
paths.append(host_path)
serialized_mount = ','.join(paths)
else:
raise TypeError('self.mount must be str | list | dict')

# Find full path to singularity
singularity = shutil.which('singularity')

# Some systems have singularity available on compute nodes only,
# so warn instead of error
if not singularity:
logger.warning('Unable to find singularity. Continuing in case singularity is available on compute node')

# Construct containerized launch command
cmd_list = [singularity, 'exec']
if serialized_args:
cmd_list.append(serialized_args)
if serialized_mount:
cmd_list.extend(['--bind', serialized_mount])
cmd_list.append(self.image)

return cmd_list
5 changes: 3 additions & 2 deletions smartsim/settings/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ def create_run_settings(
run_command="auto",
run_args=None,
env_vars=None,
container=None,
**kwargs,
):
"""Create a ``RunSettings`` instance.
Expand Down Expand Up @@ -163,9 +164,9 @@ def _detect_command(launcher):

# if user specified and supported or auto detection worked
if run_command and run_command in supported:
return supported[run_command](exe, exe_args, run_args, env_vars, **kwargs)
return supported[run_command](exe, exe_args, run_args, env_vars, container=container, **kwargs)

# 1) user specified and not implementation in SmartSim
# 2) user supplied run_command=None
# 3) local launcher being used and default of "auto" was passed.
return RunSettings(exe, exe_args, run_command, run_args, env_vars)
return RunSettings(exe, exe_args, run_command, run_args, env_vars, container=container)
4 changes: 4 additions & 0 deletions smartsim/settings/slurmSettings.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,10 @@ def make_mpmd(self, srun_settings):
raise SSUnsupportedError(
"Colocated models cannot be run as a mpmd workload"
)
if self.container:
raise SSUnsupportedError(
"Containerized MPMD workloads are not yet supported."
)
self.mpmd.append(srun_settings)

def set_hostlist(self, host_list):
Expand Down
Loading