-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937
Comments
I just tried to add following environment variables to our Dockerfile:
and it didn't help. |
Hi @Discipe, thanks for the issue. I'll start taking a look into this. |
Confirmed the issue on my end on 1.19.5, will keep this separate from #3916 while we await a response from the author of that issue. Using the
I launched a simple cluster w/ g4dn.xlarge node:
and a basic pod:
Observed:
|
For completeness, I checked on Bottlerocket 1.19.0 (ami-00b3c0d9c67029782) and have not seen this issue:
This might be related to #3718 |
I would like to clear out something in the response. |
Hi @Discipe, my apologies for the confusion. The first response was using Bottlerocket 1.19.5, the latest release (I've update the comment to reflect that). My follow up was testing with Bottlerocket 1.19.0.
Yes, this is potentially causing the issue. We're working to expose settings that will enable users to avoid this issue. |
@Discipe , have you heard of TimeSlicing? We don't support it yet, but we are planning to support it. I'm asking because I want to know if this feature could fit your need to oversubscribe a GPU, while allowing the orchestrator (k8s) manage the resources. This feature will allow you to further control access to the GPUs instead of granting blank access to all the GPUs in all the pods that have One caveat of |
Hello @arnaldo2792, We are aware of TimeSlicing, and I agree with your description of its benefits. We are currently testing time slicing outside of K8s cluster.
but I don't think it is supported in any way by GPU drivers or containers. Let me provide some additional context :)
There are two issues with GPU Slicing support: we need it supported by Bottlerocket and by AWS EKS (more specifically in Karpenter) to enable nodes and pods scaling based on GPU slices. As far as I'm aware, support from AWS is not guaranteed by the end of the year. But if it will be done - that should solve both of our problems. So yes, we would like to see TimeSlicing support in Bottlerocket! :) I just noticed that NVidia device plugin supports resource allocation using MPS (which is what we are using implicitly right now, if I understand correctly): https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#with-cuda-mps |
A summarized update on this issue: As of Bottlerocket 1.23.0 the old style GPU sharing could be re-enabled via the As of Bottlerocket 1.25.0 NVIDIA GPU TimeSlicing features are available in Bottlerocket as an alternative, again settable via the |
We are running Bottlercocket on an AWS EKS g4dn instance. Because we are sharing a single GPU instance across multiple pods, we are specifying CPU limits only for our pods.
Example:
It worked fine with Bottlerocket 1.17 and stopped working on 1.19 (we didn't test it on 1.18).
Image I'm using:
We have a minimal reproduction example that works on 1.17 and breaks on 1.19.
Python executable that is used in Dockerfile below:
Dockerfile (yes, this is as short as you can get with all these CUDA stuff):
What is expected to happen:
What actually happened:
How to reproduce the problem:
Run container on Bottlerocket 1.19
The issue looks similar to #3916
The text was updated successfully, but these errors were encountered: