nVidia drivers change in memory management #1285
Replies: 82 comments 223 replies
-
Noticed this on 535. Nice that it doesn't CUDA out sometimes, but a lot more system stress. Was wondering how I was able to suddenly to 2048x2048 tiling... |
Beta Was this translation helpful? Give feedback.
-
This sounds very much like amds smart access memory, but that has to be enabled in the bios. Any chance that this could be prevented with a bios change? |
Beta Was this translation helpful? Give feedback.
-
My SD is getting too slow specially with CN, I guess that's happened after Nvdia latest update, how I can fix it, please |
Beta Was this translation helpful? Give feedback.
-
Any downgraded NVIDIA driver? The 535 NVIDIA driver has slowed my SD speed by 50 times |
Beta Was this translation helpful? Give feedback.
-
actually my gut feeling is that they did it as a follow-up to nVidia CEO statement defending choice that just-launched RTX 4060Ti only has 8GB while some latest game titles require up to 10GB to even load - without force-enabling use of shared memory, their latest GPU would not work with latest games. that's ok if you can say "older cards don't support latest games", but you cant really say that for just-launched card. actual quote from Computex 2023: Huang defended the 8GB of VRAM and told gamers to focus more on how that VRAM is managed: “Remember the frame buffer is not the memory of the computer — it is a cache. And how you manage the cache is a big deal. It is like any other cache. And yes, the bigger the cache is, the better. However, you’re trading off against so many things.” and on why latest games require more than 8GB of VRAM? because they are developed for consoles and only ported to PC without optimizations - game studios are rushing release dates. and latest generation of consoles have 16GB of shared memory, so thats about 12GB for shaders. |
Beta Was this translation helpful? Give feedback.
-
The 532.03 driver's release notes has this:
Coincidence? |
Beta Was this translation helpful? Give feedback.
-
Sort of. It’s too easy to overdevelop on PC (and remember that all of these
games are developed on PC). Tons of games designed for the 4090s the
developer PCs are rocking then people run into trouble when they’re trying
to get their 1060 to get it to run.
Most AAA titles run great on high end hardware and the console they’re made
for. It’s everyone else that gets shafted.
…On Wed, Jun 7, 2023 at 12:14 Aptronymist ***@***.***> wrote:
and on why latest games require more than 8GB of VRAM? because they are
developed for consoles and only ported to PC without optimizations - game
studios are rushing release dates. and latest generation of consoles have
16GB of shared memory, so thats about 12GB for shaders.
That's so spot-on, it's the most idiotic thing that these companies
develop for consoles and then port to PC. It's got to be a lot easier to
downscale graphics to a console than to take your console graphics and
upscale them so they're actually good on a PC, (not to mention the
inevitable UI/control/camera issues), but they only give a crap about
raking in the quick cash with their new 2023 edition of the same game that
came out last year, and console games are a *huge* market. Ugh.
—
Reply to this email directly, view it on GitHub
<#1285 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOKLG42TZI7VRJJLBUS7LS3XKCZGPANCNFSM6AAAAAAYZKETOY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Is there a difference between Game Ready Driver and Studio Driver? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the heads up, I was starting to think my setup is cursed or something. Returning to 531.79, if anyone finds earlier versions work better please post in this thread 🙏 |
Beta Was this translation helpful? Give feedback.
-
Actually, I have two linux machines on 530.30.02 and I have started experiencing those slow downs on both suddenly in the past couple of days. Same thing on automatic1111. Something else has changed. I pull from git and upgrade torch nightly every day. |
Beta Was this translation helpful? Give feedback.
-
So? My 8G card can train the dreambooth now? |
Beta Was this translation helpful? Give feedback.
-
I am using RTX3060 12GB with -medvram apart from that, i can do far more higher upscale without OOM. |
Beta Was this translation helpful? Give feedback.
-
I was wondering why everything was feeling so slow... Effing nvidia.... |
Beta Was this translation helpful? Give feedback.
-
I have started experiencing freezes where image generation stops at 100% but never returns the finished image. Not sure if this is because of the driver update or something else. |
Beta Was this translation helpful? Give feedback.
-
Everyone here should put in a support ticket with nVidia asking them to it so this can be disabled in the driver settings. Otherwise we might be stuck with this forever and eventually need to use super slow drivers. |
Beta Was this translation helpful? Give feedback.
-
This appears to be an issue for me with an RTX3080Ti. It wasn't happening until a few days ago, at least not that I noticed, but then started happening. I often do image gen at 768x960 which is very comfortable for the 12GB of the 3080Ti, but maybe every third image is over 40% slower than it should be for absolutely no apparent reason and the GPU isn't even breaking a sweat at 7-8GB VRAM usage. |
Beta Was this translation helpful? Give feedback.
-
i'm experimenting with something, can multiple users post a line from their console log BEFORE and AFTER slowdown occurs:
for example, two runs with low and high resolution to trigger the bad behavior. |
Beta Was this translation helpful? Give feedback.
-
I've mentioned that 16 gigs of something are loaded into VRAM every second gen in the memleak bug you previously debugged with my logs, recent git pull never fixed that part of it...I managed to smooth it out slightly by setting torch garbage collection to 50...it's current behavior is first gen is fine, second gen has 16g of VRAM stuck in use...3rd gen is slow slow...4th gen dumps whatever is in VRAM starting the cycle over...setting gc to 50 has it dump basically every gen for me, it still consumes 16g but never triggers the slow down unless i do batches of 5 or more... |
Beta Was this translation helpful? Give feedback.
-
This is awesome. With driver 546.01 Nvidia introduced an option to disable shared memory for CUDA by simply ticking a box in the graphics driver application menu. https://nvidia.custhelp.com/app/answers/detail/a_id/5490 So for those who prefer crashing instead of slowing down, the option is there now. Thank you Nvidia for listening to the AI community! |
Beta Was this translation helpful? Give feedback.
-
So since Nvidia solved the problem on their side, issue is now just documenting that in installation notes to change the setting in control panel (and to have correct version of driver installed). |
Beta Was this translation helpful? Give feedback.
-
v546.01 VS Prefer No Sysmem Fallback 768x512, 3x (2304x1536) hires fix, RTX 2060 12GB. It seems to be a lot faster for me |
Beta Was this translation helpful? Give feedback.
-
I’ve been experiencing the same issue with my 3090 when I’m running hires in XL. I haven’t encountered this error in a long time, even with the previous driverSent from my iPhoneOn Oct 31, 2023, at 1:55 PM, Sinan Dinç ***@***.***> wrote:
Yes on my 3070 Ti and Prefer No Sysmem Fallback enabled, I can confirm now without medvram, I get cuda OoM error. With medvram, I can generate normally like before. But if I use hires.fix or img2img, it fails with cuda error again. Before this driver, it was working without errors. So this feature is a regression for me. Luckily I don't use them much, I upscale with Extras tab.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
My performance seems to have dropped by about 10% with the newest driver, even with sysmem fallback disabled. Anybody else noticing anything like that? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Ok, update to the above post. |
Beta Was this translation helpful? Give feedback.
-
The new driver does not do what the old driver did even with the added setting. If you are very close to the memory limit and the allocation is a certain size, it will still fallback to system even if you tell it not to in control panel (they do call it a preference in the settings). In some cases, this is actually good, and spike VRAM loads get absorbed for very small throughput hit. In other cases, like training, the VAE stage, very large image generations, or LLM generations of a certain size (2.5bit 70B LLMs), it can cause massive throughput reductions, whereas in the previous driver things would have been just fine. If you have a really big allocation, then it triggers OOM (assuming system fallback is not preferred). So you have to be in this narrow regime, close to the VRAM limit, for the swapping/throughput loss to occur. That makes it seem a little random. It works just fine if I reboot with the older drivers in these cases. It's really annoying if your workflow takes you to that regime. Basically NVIDIA still swaps to RAM a little too aggressively compared to the prior driver. But some workflows that may have OOMed due to spike loads now work better. I think that's where we're at, and probably where we'll be stuck for a while without Edit: If your GPU supports TCC mode that's still an option, though they disabled that for gaming GeForce (like 4090) for a while now. It works on Quadro. |
Beta Was this translation helpful? Give feedback.
-
Thanks for explaining it. But it only happens on my side, if I edit the prompt during active image generation - it never happens if I don't edit the prompt. Once I found that out, I can reproduce it every time. So I think, something else is going on. Maybe an edit is allocating vram memory in some way and that triggers the fallback. But this is a amateur guess as I am not a developer. Anyway, for me the issue is solved as I can just leave the prompt alone. But maybe others suffer from the same issue and like to try this as well. |
Beta Was this translation helpful? Give feedback.
-
any downsides on sticking with 525.147.05 (cu118) ? I'm on debian stable , I need the nvidia-toolkit-dev package for other projects, and this version is the only one on the official repos (up to sid). my only other option, beside changing distro , would be using the latest nvidia cuda toolkit 12.3 from nvidia, but then it would be too new for pythorch binaries (i guess to use that I'll need to compile manually) |
Beta Was this translation helpful? Give feedback.
-
Hey guys! Came here from reddit from a discussion on the best driver for Kohya. Can you recommend which 4090 driver is best for Lora training speed on win11? I know the best option is to go Ubuntu, but I'll leave it for later. I'm new to the party and only have 2.50s/it on 4090, butch size 5, xformers, gradient checkpoint on, bucketing on, default kohya settings, seems too slow... :/ |
Beta Was this translation helpful? Give feedback.
-
OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Of the allocated memory 7.00 GiB is allocated by PyTorch, and 257.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) i am not sure, is it the same type of problem? |
Beta Was this translation helpful? Give feedback.
-
it seems that nVidia changed memory management in the latest versions of drivers, specifically 532 and 535
new behavior is that once gpu vram is exhausted, it will actually use shared memory thus causing massive slowdown - easily 10x
good side (hey, have to look at it that way as well) as that OOM is far less likely - but at the cost of 10x performance drop, no chance
also, when spillover happens, memory pretty much spins out of control and returns to normal only after app restart (stopping generate does not do anything)
this feature is deep inside device drivers and completely outside of application control
even advanced gpu tuning utilities dont seem to have capabilities to turn it on/off or tune it
i've checked release notes and there is no mention of it, but there are too many reports (not just sd community) to ignore
version 531 seems to be the last unaffected version
Beta Was this translation helpful? Give feedback.
All reactions