-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Containers can lose access to the GPU #3601
Labels
Comments
ghost
added
the
action-required
label
May 6, 2023
Action required from @Azure/aks-pm |
Issue needing attention of @Azure/aks-leads |
We're working on it |
will be fixed in 202306.07.0 VM image from AKS. Ensure you deploy nvidia device plugin with PASS_DEVICE_SPECS set to true via env or CLI. |
ghost
added
the
action-required
label
Jul 5, 2023
think this is good...let me know if not |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
There is currently a known issue where containers can lose access to the GPU after a while and the only solution is to recreate the pod.
There are several possible workarounds:
However, all of these workarounds require a level of access that is not feasible on Azure AKS (like changing container daemon settings). Would it be possible for you to integrate these workarounds in the VM images?
The text was updated successfully, but these errors were encountered: