Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937

Discipe · 2024-05-02T19:15:33Z

We are running Bottlercocket on an AWS EKS g4dn instance. Because we are sharing a single GPU instance across multiple pods, we are specifying CPU limits only for our pods.
Example:

resources:
  limits:
    memory: 1000Mi
  requests:
    cpu: 100m
    memory: 1000Mi

It worked fine with Bottlerocket 1.17 and stopped working on 1.19 (we didn't test it on 1.18).

Image I'm using:
We have a minimal reproduction example that works on 1.17 and breaks on 1.19.
Python executable that is used in Dockerfile below:

import torch
import sys
import logging

log_format = f"[%(asctime)s][%(levelname)s] %(message)s"
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format=log_format)
logger = logging.getLogger()

class Model:
    def __init__(self):
        self.log = logger
        self.log.info('Init')
        self.log.info('Python version: {}'.format(sys.version))
        self.log.info('Pytorch version: {}'.format(torch.__version__))
        self.log.info('Cuda available: {}'.format(torch.cuda.is_available()))
        t1 = torch.rand(2, 3).to(torch.device('cuda'))
        t2 = torch.rand(3, 2).to(torch.device('cuda'))
        res = torch.matmul(t1, t2)
        self.log.info('Matmul successed (device {})'.format(res.device))


if __name__ == '__main__':
    Model()
    logger.info('Done.')

Dockerfile (yes, this is as short as you can get with all these CUDA stuff):

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

RUN DEBIAN_FRONTEND=noninteractive apt-get update --fix-missing && apt-get upgrade -y

RUN DEBIAN_FRONTEND=noninteractive apt-get install --fix-missing -y \
    software-properties-common \
    wget \
    curl

RUN DEBIAN_FRONTEND=noninteractive apt-get install -yy \
    python3 \
    python3-dev

RUN wget https://bootstrap.pypa.io/pip/get-pip.py && \
    python3 ./get-pip.py && rm ./get-pip.py
RUN pip3 install --upgrade pip

# CUDA 11.7, torch 1.13.1
RUN pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu117

COPY test_main.py .

ENTRYPOINT ["python3", "test_main.py"]

What is expected to happen:

[2024-05-01 23:04:25,810][INFO] Init
[2024-05-01 23:04:25,810][INFO] Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0]
[2024-05-01 23:04:25,810][INFO] Pytorch version: 1.13.1+cu117
[2024-05-01 23:04:25,954][INFO] Cuda available: True
[2024-05-01 23:04:28,416][INFO] Matmul successed (device cuda:0)
[2024-05-01 23:04:28,416][INFO] Done.

What actually happened:

{"levelname": "INFO", "time": "2024-05-02T19:12:41.737742Z", "message": "Init"}
{"levelname": "INFO", "time": "2024-05-02T19:12:41.737893Z", "message": "Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) \n[GCC 9.4.0]"}
{"levelname": "INFO", "time": "2024-05-02T19:12:41.737960Z", "message": "Pytorch version: 1.13.1+cu117"}
{"levelname": "INFO", "time": "2024-05-02T19:12:41.738159Z", "message": "Cuda available: False"}

<... skip long stack trace ...>

  File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

How to reproduce the problem:
Run container on Bottlerocket 1.19

The issue looks similar to #3916

The text was updated successfully, but these errors were encountered:

Discipe · 2024-05-02T19:16:51Z

I just tried to add following environment variables to our Dockerfile:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

and it didn't help.

ginglis13 · 2024-05-02T21:22:56Z

Hi @Discipe, thanks for the issue. I'll start taking a look into this.

ginglis13 · 2024-05-03T21:40:21Z

Confirmed the issue on my end on 1.19.5, will keep this separate from #3916 while we await a response from the author of that issue.

Using the test_main.py provided and including nvidia environment variables in Dockerfile:

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

RUN DEBIAN_FRONTEND=noninteractive apt-get update --fix-missing && apt-get upgrade -y

RUN DEBIAN_FRONTEND=noninteractive apt-get install --fix-missing -y \
    software-properties-common \
    wget \
    curl

RUN DEBIAN_FRONTEND=noninteractive apt-get install -yy \
    python3 \
    python3-dev

RUN wget https://bootstrap.pypa.io/pip/get-pip.py && \
    python3 ./get-pip.py && rm ./get-pip.py
RUN pip3 install --upgrade pip

# CUDA 11.7, torch 1.13.1
RUN pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu117

COPY test_main.py .

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

ENTRYPOINT ["python3", "test_main.py"]

I launched a simple cluster w/ g4dn.xlarge node:

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: issue-3937
  region: us-west-2
  version: '1.29'

nodeGroups:
  - name: ng-bottlerocket-3937
    instanceType: g4dn.xlarge
    desiredCapacity: 1
    amiFamily: Bottlerocket
    disableIMDSv1: true
    iam:
       attachPolicyARNs:
          - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
          - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
          - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
          - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    ssh:
        allow: true
        publicKeyName: <redacted>

and a basic pod:

---
apiVersion: v1
kind: Pod
metadata:
  name: 3937-test-pod
spec:
  containers:
  - name: test-container
    image: public.ecr.aws/o7o0w6s5/3937:latest
    resources:
      limits:
        memory: 1000Mi
      requests:
        cpu: 100m
        memory: 1000Mi

Observed:

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

ginglis13 · 2024-05-06T18:37:25Z

For completeness, I checked on Bottlerocket 1.19.0 (ami-00b3c0d9c67029782) and have not seen this issue:

[2024-05-06 18:08:44,778][INFO] Init
[2024-05-06 18:08:44,778][INFO] Python version: 3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0]
[2024-05-06 18:08:44,778][INFO] Pytorch version: 1.13.1+cu117
[2024-05-06 18:08:44,952][INFO] Cuda available: True

This might be related to #3718

Discipe · 2024-05-06T23:19:16Z

I would like to clear out something in the response.
Am I understanding it correctly, that breaking change could've been introduced by #3718 ? Which version of Bottlerocker was used to reproduce it in the first response?

ginglis13 · 2024-05-06T23:26:39Z

Hi @Discipe, my apologies for the confusion. The first response was using Bottlerocket 1.19.5, the latest release (I've update the comment to reflect that). My follow up was testing with Bottlerocket 1.19.0.

Am I understanding it correctly, that breaking change could've been introduced by #3718 ?

Yes, this is potentially causing the issue. We're working to expose settings that will enable users to avoid this issue.

arnaldo2792 · 2024-05-09T14:35:49Z

@Discipe , have you heard of TimeSlicing? We don't support it yet, but we are planning to support it. I'm asking because I want to know if this feature could fit your need to oversubscribe a GPU, while allowing the orchestrator (k8s) manage the resources. This feature will allow you to further control access to the GPUs instead of granting blank access to all the GPUs in all the pods that have NVIDIA_VISIBLE_DEVICES=all.

One caveat of NVIDIA_DEVICE_PLUGINS=all is that the orchestrator wouldn't track the usage of the GPUs. With TimeSlicing, you can oversubscribe and let the orchestrator track GPU usage. But again, I wonder if you already heard of it before, and decided not to use it because it doesn't fit your needs.

Discipe · 2024-05-09T23:51:08Z

Hello @arnaldo2792, We are aware of TimeSlicing, and I agree with your description of its benefits. We are currently testing time slicing outside of K8s cluster.
It's not the ideal solution, but it is the only existing option now to specify GPU resources for pods. The best alternative for us would be pod allocation based on GPU memory. Something like

      resources:
        limits:
          gpu/memory: 2048MiB

but I don't think it is supported in any way by GPU drivers or containers.

Let me provide some additional context :)
We are trying to solve two problems:

share a single GPU across many small apps for cost-efficiency
have an autoscaling based on GPU capacity and utilization
The GPU utilization and resource allocation by slices could still be tricky, and we'll need to figure out how to balance slices for a cluster with different GPU sizes, but it should be possible.
I don't think I understand all the implications of NVIDIA_DEVICE_PLUGINS=all, but I doubt we need it. And if it prevents us from tracking actual GPU usage, we want to avoid using it. We don't need instances with more than one GPU, so we can always specify device=1 .

There are two issues with GPU Slicing support: we need it supported by Bottlerocket and by AWS EKS (more specifically in Karpenter) to enable nodes and pods scaling based on GPU slices. As far as I'm aware, support from AWS is not guaranteed by the end of the year. But if it will be done - that should solve both of our problems. So yes, we would like to see TimeSlicing support in Bottlerocket! :)

I just noticed that NVidia device plugin supports resource allocation using MPS (which is what we are using implicitly right now, if I understand correctly): https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#with-cuda-mps
I haven't looked into the details yet; maybe it is a viable alternative to slicing...

rpkelly · 2024-11-26T23:11:47Z

A summarized update on this issue:

As of Bottlerocket 1.23.0 the old style GPU sharing could be re-enabled via the kubelet-device-plugin.nvidia API settings. How to do this, as well as cautionary information is available in the Bottlerocket security guidance document.

As of Bottlerocket 1.25.0 NVIDIA GPU TimeSlicing features are available in Bottlerocket as an alternative, again settable via the kubelet-device-plugin.nvidia settings. Cautionary information is available here.

Discipe added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels May 2, 2024

vigh-m added area/accelerated-computing Issues related to GPUs/ASICs and removed status/needs-triage Pending triage or re-evaluation labels May 14, 2024

arnaldo2792 mentioned this issue May 29, 2024

DCGM will not run on GPU nodes with Bottlerocket OS #3992

Closed

rpkelly closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937

Discipe commented May 2, 2024

Discipe commented May 2, 2024

ginglis13 commented May 2, 2024

ginglis13 commented May 3, 2024 •

edited

Loading

ginglis13 commented May 6, 2024

Discipe commented May 6, 2024

ginglis13 commented May 6, 2024

arnaldo2792 commented May 9, 2024 •

edited

Loading

Discipe commented May 9, 2024 •

edited

Loading

rpkelly commented Nov 26, 2024

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937

Comments

Discipe commented May 2, 2024

Discipe commented May 2, 2024

ginglis13 commented May 2, 2024

ginglis13 commented May 3, 2024 • edited Loading

ginglis13 commented May 6, 2024

Discipe commented May 6, 2024

ginglis13 commented May 6, 2024

arnaldo2792 commented May 9, 2024 • edited Loading

Discipe commented May 9, 2024 • edited Loading

rpkelly commented Nov 26, 2024

ginglis13 commented May 3, 2024 •

edited

Loading

arnaldo2792 commented May 9, 2024 •

edited

Loading

Discipe commented May 9, 2024 •

edited

Loading