Integrate zeusd into zeus.device.gpu (#85)

ml-energy · May 30, 2024 · f1857d3 · f1857d3
1 parent 9f9394f
commit f1857d3
Show file tree

Hide file tree

Showing 30 changed files with 1,054 additions and 562 deletions.
diff --git a/.github/workflows/check_homepage_build.yaml b/.github/workflows/check_homepage_build.yaml
@@ -37,4 +37,4 @@ jobs:
       - name: Install other homepage dependencies
         run: pip install '.[docs]'
       - name: Build homepage
-        run: mkdocs build --verbose
+        run: mkdocs build --verbose --strict
diff --git a/.github/workflows/deploy_homepage.yaml b/.github/workflows/deploy_homepage.yaml
@@ -17,7 +17,7 @@ env:
 jobs:
   deploy:
     runs-on: ubuntu-latest
-    if: github.event.repository.fork == false
+    if: github.repository_owner = 'ml-energy'
     steps:
       - name: Checkout repository
         uses: actions/checkout@v4

diff --git a/.github/workflows/publish_crates_io.yaml b/.github/workflows/publish_crates_io.yaml
@@ -0,0 +1,24 @@
+name: Release
+
+on:
+  push:
+    tags:
+      - zeusd-v*
+
+jobs:
+  cargo-publish:
+    if: github.repository_owner == 'ml-energy'
+    runs-on: ubuntu-latest
+    env:
+      CARGO_TERM_COLOR: always
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          sparse-checkout: zeusd
+      - name: Publish to crates.io
+        uses: katyo/publish-crates@v2
+        with:
+          path: zeusd
+          registry-token: ${{ secrets.CRATES_IO_TOKEN }}
+          check-repo: false
diff --git a/.github/workflows/publish_pypi.yaml b/.github/workflows/publish_pypi.yaml
@@ -3,12 +3,12 @@ name: Publish Python package to PyPI
 on:
   push:
     tags:
-      - v*
+      - zeus-v*
 
 jobs:
   publish:
     runs-on: ubuntu-latest
-    if: github.event.repository.fork == false
+    if: github.repository_owner = 'ml-energy'
     steps:
       - name: Checkout repository
         uses: actions/checkout@v3

diff --git a/.github/workflows/push_docker.yaml b/.github/workflows/push_docker.yaml
@@ -5,7 +5,7 @@ on:
     branches:
       - master
     tags:
-      - v*
+      - zeus-v*
     paths:
       - '.github/workflows/push_docker.yaml'
       - 'capriccio/**'
@@ -21,6 +21,7 @@ on:
 
 jobs:
   build_and_push:
+    if: github.repository_owner == 'ml-energy'
     runs-on: ubuntu-latest
     steps:
       - name: Remove unnecessary files

diff --git a/.github/workflows/fmt_lint_test.yaml → .github/workflows/zeus_fmt_lint_test.yaml b/.github/workflows/fmt_lint_test.yaml → .github/workflows/zeus_fmt_lint_test.yaml
@@ -1,9 +1,9 @@
-name: Check format, lint, and test
+name: (Zeus) Check format, lint, and test
 
 on:
   push:
     paths:
-      - '.github/workflows/fmt_lint_test.yaml'
+      - '.github/workflows/zeus_fmt_lint_test.yaml'
       - 'zeus/**'
       - 'tests/**'
       - 'capriccio/*.py'
@@ -14,7 +14,7 @@ on:
 
 # Jobs initiated by previous pushes get cancelled by a new push.
 concurrency:
-  group: ${{ github.ref }}-lint-and-test
+  group: ${{ github.ref }}-zeus-lint-and-test
   cancel-in-progress: true
 
 jobs:

diff --git a/.github/workflows/zeusd_fmt_lint_test.yaml b/.github/workflows/zeusd_fmt_lint_test.yaml
@@ -1,4 +1,4 @@
-name: Check format, lint, and test for Zeusd
+name: (Zeusd) Check format, lint, and test
 
 on:
   push:

diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md
@@ -83,32 +83,46 @@ docker build -t mlenergy/zeus:master --build-arg TARGETARCH=amd64 -f docker/Dock
 
 ## System privileges
 
-!!! Important "Nevermind if you're just measuring"
-    No special system-level privileges are needed if you are just measuring time and energy.
+!!! Important "Nevermind if you're just measuring GPU energy"
+    No special system-level privileges are needed if you are just measuring GPU time and energy.
     However, when you're looking into optimizing energy and if that method requires changing the GPU's power limit or SM frequency, special system-level privileges are required.
 
 ### When are extra system privileges needed?
 
 The Linux capability `SYS_ADMIN` is required in order to change the GPU's power limit or frequency.
 Specifically, this is needed by the [`GlobalPowerLimitOptimizer`][zeus.optimizer.power_limit.GlobalPowerLimitOptimizer] and the [`PipelineFrequencyOptimizer`][zeus.optimizer.pipeline_frequency.PipelineFrequencyOptimizer].
 
-### Obtaining privileges with Docker
+### Option 1: Running applications in a Docker container
 
 Using Docker, you can pass `--cap-add SYS_ADMIN` to `docker run`.
 Since this significantly simplifies running Zeus, we recommend users to consider this option first.
-Also, since Zeus is running inside a container, there is less potential for damage even if things go wrong.
+This is also possible for Kubernetes Pods with `securityContext.capabilities.add` in container specs ([docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container){.external}).
 
-### Obtaining privileges with `sudo`
+### Option 2: Deploying the Zeus daemon (`zeusd`)
 
-If you cannot use Docker, you can run your application with `sudo`.
-This is not recommended due to security reasons, but it will work.
+Granting `SYS_ADMIN` to the entire application just to be able to change the GPU's configuration is [granting too much](https://en.wikipedia.org/wiki/Principle_of_least_privilege){.external}.
+Instead, Zeus provides the [**Zeus daemon** or `zeusd`](https://github.com/ml-energy/zeus/tree/master/zeusd){.external}, which is a simple server/daemon process that is designed to run with admin privileges and exposes the minimal set of APIs wrapping NVML methods for changing the GPU's configuration.
+Then, an unprivileged (i.e., run normally by any user) application can ask `zeusd` via a Unix Domain Socket to change the local node's GPU configuration on its behalf.
 
-### GPU management server
+To deploy `zeusd`:
 
-It is fair to say that granting `SYS_ADMIN` to the application is itself giving too much privilege.
-We just need to be able to change the GPU's power limit or frequency, instead of giving the process privileges to administer the system.
-Thus, to reduce the attack surface, we are considering solutions such as a separate GPU management server process on a node ([tracking issue](https://github.com/ml-energy/zeus/issues/29)), which has `SYS_ADMIN`.
-Then, an unprivileged application process can ask the GPU management server via a UDS to change the GPU's configuration on its behalf.
+``` { .sh .annotate }
+# Install zeusd
+cargo install zeusd
+
+# Run zeusd with admin privileges
+sudo zeusd \
+    --socket-path /var/run/zeusd.sock \  # (1)!
+    --socket-permissions 666            # (2)!
+```
+
+1. Unix domain socket path that `zeusd` listens to.
+2. Applications need *write* access to the socket to be able to talk to `zeusd`. This string is interpreted as [UNIX file permissions](https://en.wikipedia.org/wiki/File-system_permissions#Numeric_notation).
+
+### Option 3: Running applications with `sudo`
+
+This is probably the worst option.
+However, if none of the options above work, you can run your application with `sudo`, which automatically has `SYS_ADMIN`.
 
 ## Next Steps
 

diff --git a/examples/huggingface/run_clm.py b/examples/huggingface/run_clm.py
@@ -56,7 +56,7 @@
 from transformers.utils.versions import require_version
 
 from zeus.monitor import ZeusMonitor
-from zeus.optimizer import HFGlobalPowerLimitOptimizer
+from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer
 
 # Will error if the minimal version of Transformers is not installed. Remove at your own risks.
 check_min_version("4.37.2")

diff --git a/pyproject.toml b/pyproject.toml
@@ -27,6 +27,7 @@ dependencies = [
     "pydantic",  # The `zeus.utils.pydantic_v1` compatibility layer allows us to unpin Pydantic in most cases.
     "rich",
     "tyro",
+    "httpx"
 ]
 dynamic = ["version"]
 
@@ -37,13 +38,13 @@ Documentation = "https://ml.energy/zeus"
 
 [project.optional-dependencies]
 # One day FastAPI will drop support for Pydantic V1. Then fastapi has to be pinned as well.
-pfo = ["pydantic<2", "httpx"]
-pfo-server = ["fastapi[all]", "pydantic<2", "lowtime", "aiofiles", "httpx", "torch"]
-bso = ["pydantic<2", "httpx"]
+pfo = ["pydantic<2"]
+pfo-server = ["fastapi[all]", "pydantic<2", "lowtime", "aiofiles", "torch"]
+bso = ["pydantic<2"]
 bso-server = ["fastapi[all]", "sqlalchemy", "pydantic<2", "python-dotenv"]
 migration = ["alembic", "sqlalchemy", "pydantic<2", "python-dotenv"]
 lint = ["ruff", "black==22.6.0", "pyright", "pandas-stubs", "transformers"]
-test = ["fastapi[all]", "sqlalchemy", "pydantic<2", "httpx", "pytest==7.3.2", "pytest-mock==3.10.0", "pytest-xdist==3.3.1", "anyio==3.7.1", "aiosqlite==0.20.0"]
+test = ["fastapi[all]", "sqlalchemy", "pydantic<2", "pytest==7.3.2", "pytest-mock==3.10.0", "pytest-xdist==3.3.1", "anyio==3.7.1", "aiosqlite==0.20.0"]
 docs = ["mkdocs-material[imaging]==9.5.19", "mkdocstrings[python]==0.25.0", "mkdocs-gen-files==0.5.0", "mkdocs-literate-nav==0.6.1", "mkdocs-section-index==0.3.9", "mkdocs-redirects==1.2.1", "urllib3<2", "black"]
 # greenlet is for supporting apple mac silicon for sqlalchemy(https://docs.sqlalchemy.org/en/20/faq/installation.html)
 dev = ["zeus-ml[pfo-server,bso,bso-server,migration,lint,test]", "greenlet"]

diff --git a/zeus/device/common.py b/zeus/device/common.py
@@ -0,0 +1,68 @@
+"""Common utilities for device management."""
+
+from __future__ import annotations
+
+import os
+import ctypes
+from functools import lru_cache
+
+from zeus.utils.logging import get_logger
+
+logger = get_logger(__name__)
+
+
+@lru_cache(maxsize=1)
+def has_sys_admin() -> bool:
+    """Check if the current process has `SYS_ADMIN` capabilities."""
+    # First try to read procfs.
+    try:
+        with open("/proc/self/status") as f:
+            for line in f:
+                if line.startswith("CapEff"):
+                    bitmask = int(line.strip().split()[1], 16)
+                    has = bool(bitmask & (1 << 21))
+                    logger.info(
+                        "Read security capabilities from /proc/self/status -- SYS_ADMIN: %s",
+                        has,
+                    )
+                    return has
+    except Exception:
+        logger.info("Failed to read capabilities from /proc/self/status", exc_info=True)
+
+    # If that fails, try to use the capget syscall.
+    class CapHeader(ctypes.Structure):
+        _fields_ = [("version", ctypes.c_uint32), ("pid", ctypes.c_int)]
+
+    class CapData(ctypes.Structure):
+        _fields_ = [
+            ("effective", ctypes.c_uint32),
+            ("permitted", ctypes.c_uint32),
+            ("inheritable", ctypes.c_uint32),
+        ]
+
+    # Attempt to load libc and set up capget
+    try:
+        libc = ctypes.CDLL("libc.so.6")
+        capget = libc.capget
+        capget.argtypes = [ctypes.POINTER(CapHeader), ctypes.POINTER(CapData)]
+        capget.restype = ctypes.c_int
+    except Exception:
+        logger.info("Failed to load libc.so.6", exc_info=True)
+        return False
+
+    # Initialize the header and data structures
+    header = CapHeader(version=0x20080522, pid=0)  # Use the current process
+    data = CapData()
+
+    # Call capget and check for errors
+    if capget(ctypes.byref(header), ctypes.byref(data)) != 0:
+        errno = ctypes.get_errno()
+        logger.info(
+            "capget failed with error: %s (errno %s)", os.strerror(errno), errno
+        )
+        return False
+
+    bitmask = data.effective
+    has = bool(bitmask & (1 << 21))
+    logger.info("Read security capabilities from capget -- SYS_ADMIN: %s", has)
+    return has
diff --git a/zeus/device/exception.py b/zeus/device/exception.py
@@ -8,3 +8,11 @@ class ZeusBaseGPUError(ZeusBaseError):
     def __init__(self, message: str) -> None:
         """Initialize Base Zeus Exception."""
         super().__init__(message)
+
+
+class ZeusdError(ZeusBaseGPUError):
+    """Exception class for Zeus daemon-related errors."""
+
+    def __init__(self, message: str) -> None:
+        """Initialize Zeusd error."""
+        super().__init__(message)
diff --git a/zeus/device/gpu/__init__.py b/zeus/device/gpu/__init__.py
@@ -1,37 +1,55 @@
-"""GPU device module for Zeus. Abstraction of GPU devices.
+"""Abstraction layer for GPU devices.
+
+The main function of this module is [`get_gpus`][zeus.device.gpu.get_gpus],
+which returns a GPU Manager object specific to the platform.
+
+!!! Important
+    In theory, any NVIDIA GPU would be supported.
+    On the other hand, for AMD GPUs, we currently only support ROCm 6.0 and later.
+
+## Getting handles to GPUs
+
+The main API exported from this module is the `get_gpus` function. It returns either
+[`NVIDIAGPUs`][zeus.device.gpu.nvidia.NVIDIAGPUs] or [`AMDGPUs`][zeus.device.gpu.amd.AMDGPUs]
+depending on the platform. 
 
-The main function of this module is [`get_gpus`][zeus.device.gpu.get_gpus], which returns a GPU Manager object specific to the platform.
-To instantiate a GPU Manager object, you can do the following:
-    
 ```python
 from zeus.device import get_gpus
-gpus = get_gpus() # Returns NVIDIAGPUs() or AMDGPUs() depending on the platform.
+gpus = get_gpus()  
 ```
 
-There exists a 1:1 mapping between specific library functions and methods implemented in the GPU Manager object.
-For example, for NVIDIA systems, if you wanted to do:
+## Calling GPU management APIs
+
+GPU management library APIs are mapped to methods on [`GPU`][zeus.device.gpu.common.GPU].
+
+For example, for NVIDIA GPUs (which uses `pynvml`), you would have called:
 
 ```python
 handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
 constraints = pynvml.nvmlDeviceGetPowerManagementLimitConstraints(handle)
 ```
 
-You can now do:
+With the Zeus GPU abstraction layer, you would now call:
 
 ```python
-gpus = get_gpus() # returns a NVIDIAGPUs object
-constraints =  gpus.getPowerManagementLimitConstraints(gpu_index)
+gpus = get_gpus() # returns an NVIDIAGPUs object
+constraints = gpus.getPowerManagementLimitConstraints(gpu_index)
 ```
 
-Class hierarchy:
+## Non-blocking calls
 
-- [`GPUs`][zeus.device.gpu.GPUs]: Abstract class for GPU managers.
-    - [`NVIDIAGPUs`][zeus.device.gpu.NVIDIAGPUs]: GPU manager for NVIDIA GPUs, initialize NVIDIAGPU objects.
-    - [`AMDGPUs`][zeus.device.gpu.AMDGPUs]: GPU manager for AMD GPUs, initialize AMDGPU objects.
-- [`GPU`][zeus.device.gpu.GPU]: Abstract class for GPU objects.
-    - [`NVIDIAGPU`][zeus.device.gpu.NVIDIAGPU]: GPU object for NVIDIA GPUs.
-    - [`AMDGPU`][zeus.device.gpu.AMDGPU]: GPU object for AMD GPUs.
+Some implementations of `GPU` support non-blocking calls to setters.
+If non-blocking calls are not supported, setting `block` will be ignored and the call will block.
+Check [`GPU.supports_non_blocking`][zeus.device.gpu.common.GPU.supports_nonblocking_setters]
+to see if non-blocking calls are supported.
+Note that non-blocking calls will not raise exceptions even if the call fails.
 
+Currently, only [`ZeusdNVIDIAGPU`][zeus.device.gpu.nvidia.ZeusdNVIDIAGPU] supports non-blocking calls
+to methods that set the GPU's power limit, GPU frequency, memory frequency, and persistence mode.
+This is possible because the Zeus daemon supports a `block: bool` parameter in HTTP requests,
+which can be set to `False` to make the call return immediately without checking the result.
+
+## Error handling
 
 The following exceptions are defined in this module:
 
@@ -55,9 +73,8 @@
 - [`ZeusGPULibRMVersionMismatchError`][zeus.device.gpu.ZeusGPULibRMVersionMismatchError]: Error for library version mismatch.
 - [`ZeusGPUMemoryError`][zeus.device.gpu.ZeusGPUMemoryError]: Error for memory issues.
 - [`ZeusGPUUnknownError`][zeus.device.gpu.ZeusGPUUnknownError]: Error for unknown issues.
-
-
 """
+
 from __future__ import annotations
 
 from zeus.device.gpu.common import *
@@ -70,17 +87,21 @@
 
 
 def get_gpus(ensure_homogeneous: bool = False) -> GPUs:
-    """Initialize and return a singleton GPU monitoring object for NVIDIA or AMD GPUs.
+    """Initialize and return a singleton object for GPU management.
+
+    This function returns a GPU management object that aims to abstract
+    the underlying GPU vendor and their specific monitoring library
+    (pynvml for NVIDIA GPUs and amdsmi for AMD GPUs). Management APIs
+    are mapped to methods on the returned [`GPUs`][zeus.device.gpu.GPUs] object.
 
-    The function returns a GPU management object that aims to abstract the underlying GPU monitoring libraries
-    (pynvml for NVIDIA GPUs and amdsmi for AMD GPUs), and provides a 1:1 mapping between the methods in the object and related library functions.
+    GPU availability is checked in the following order:
 
-    This function attempts to initialize GPU monitoring using the pynvml library for NVIDIA GPUs
-    first. If pynvml is not available or fails to initialize, it then tries to use the amdsmi
-    library for AMD GPUs. If both attempts fail, it raises a ZeusErrorInit exception.
+    1. NVIDIA GPUs using `pynvml`
+    1. AMD GPUs using `amdsmi`
+    1. If both are unavailable, a `ZeusGPUInitError` is raised.
 
     Args:
-        ensure_homogeneous (bool, optional): If True, ensures that all tracked GPUs have the same name. False by default.
+        ensure_homogeneous (bool): If True, ensures that all tracked GPUs have the same name.
     """
     global _gpus
     if _gpus is not None: