Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Containers can lose access to the GPU #3601

Closed
OvervCW opened this issue Apr 11, 2023 · 5 comments
Closed

[BUG] Containers can lose access to the GPU #3601

OvervCW opened this issue Apr 11, 2023 · 5 comments

Comments

@OvervCW
Copy link

OvervCW commented Apr 11, 2023

Describe the bug
There is currently a known issue where containers can lose access to the GPU after a while and the only solution is to recreate the pod.

There are several possible workarounds:

However, all of these workarounds require a level of access that is not feasible on Azure AKS (like changing container daemon settings). Would it be possible for you to integrate these workarounds in the VM images?

@OvervCW OvervCW added the bug label Apr 11, 2023
@ghost ghost added the action-required label May 6, 2023
@ghost
Copy link

ghost commented May 11, 2023

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label May 11, 2023
@ghost
Copy link

ghost commented May 27, 2023

Issue needing attention of @Azure/aks-leads

@ghost ghost removed the action-required label Jun 2, 2023
@ghost ghost removed the Needs Attention 👋 Issues needs attention/assignee/owner label Jun 2, 2023
@palma21
Copy link
Member

palma21 commented Jun 2, 2023

We're working on it

@alexeldeib
Copy link
Contributor

#3680 (comment)

will be fixed in 202306.07.0 VM image from AKS.

Ensure you deploy nvidia device plugin with PASS_DEVICE_SPECS set to true via env or CLI.

@alexeldeib
Copy link
Contributor

think this is good...let me know if not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants