-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU with Vulkan out of memory issue. #5848
Comments
Does GPU0 also run a desktop? It might be an issue with vram fragmentation then, maybe. Can you upload the output of I guess the error message about OOM should also contain the device that failed. |
Both GPUs have monitors plugged into them. |
That's gonna be a little harder to debug, since we don't even know which GPU is running out of memory. If you know a little bit about C++ we could add the GPU to the OOM error message and get that information. Apart from that you can also run the program with validation layers enabled by building with Let me know if you want me to provide the patch for the OOM error message or more info about the other steps. |
Thank you for taking time to help me. Here's the output from a run with
As for the custom error message, I have no idea how I would do that. |
Thank you. As a sidenote, Mixtral is not yet supported on Vulkan, but in this case it's crashing before that would be a problem. Something is not adding up with these allocation sizes. In your first q5_k_m example it manages to allocate the two compute buffers of size 113 and 139 MB, but fails a later allocation. In your debug outputs it tries to allocate a 1.3GB compute buffer, which is way too large. In my attempts to reproduce it it only tried to allocate around 320MB per GPU. Something is going wrong there. But I also see that's an IQ4 quant, and Imatrix quants are also not yet supported on Vulkan. Can you reproduce the issue with a combination of model and quant that's supported? For example, I think a Yi-34B q5k or q6k should be around this size and is supported. |
Running Dolphin 2.2 Yi 34b 200k I ran one that worked, just in case this could be useful. Worked: Q4_K_S, (19,598,649,632 bytes), dbg_Q4_K_S.txt.gz (The text file is 232MB :) ) Q5_K_S, (23,707,691,296 bytes) dbg_Q5_K_S.txt I have done these tests with as little other processes running to try and limit VRAM usage from whatever may be using HW acceleration or whatever. |
Apologies for not getting back to you sooner, I was too busy last week. Your logs show that the size of the the dequant buffer is the problem here. Because I didn't have proper matmul dequant shaders for the k-quants yet (and also didn't update the buffer size logic yet) they use quite a bit of vram. Too much for your setup, with q5_k and q6_k. Good news is that I have now implemented the k-quant matmul shaders and will update the buffer size logic to take this into account. That should save you a few hundred megabytes of VRAM and hopefully solve this issue. I'll let you know when you can test this. |
@lastrosade Please check if #6155 fixes your problem. |
No biggie. What (I think) should be working still does not work, same model as before, Q6_K, (28,213,926,176 bytes) dbg_Q6_K.txt |
That is still the old code. The new code even had an issue that would have prevented you from building it. I fixed that now. The PR is not merged yet, do you know how to check out the feature branch or do you need help with that? |
Sorry, didn't see that it wasn't upstream. So I built it the same way as before by including Q4_K_S worked as it did before. |
I think it's the 7900 XT that's running out of memory in q6_k. I added info which device is allocating to the debug output, can you run q5 and q6 again? No need to let q5 run through, just the prompt processing is enough. We can then figure out how much memory it tried to allocate before running out. |
Sorry for the wait, had IRL issues and forgot to check back. Reran Dolphin 2.2 Yi 34b 200k, 4096 token context. With commit 1fceeb9. Q5_K_S With --n-predict 12: dbg_Q5_K_S_small.txt (20MB) Q6_K With --n-predict 12: dbg_Q6_K_small.txt |
I wrote a small script to evaluate the VRAM use in those debug outputs: When I run the q6_k model on two of my GPUs (RTX 3090 24GB and Radeon Pro VII 16GB) it ends up at 18.7GiB on the 3090 and 9.7GiB on the Pro VII. During inference your q5_k_s output goes up to 16GiB and 10.1GiB, my q6_k output goes up to 18.7GiB and 11.4GiB. I suppose this is just too close to the VRAM limitations, together with some VRAM fragmentation from using the GPUs for other tasks. That's about as much information as I can get out of that. Only thing you can try is setting the environment variable |
Welp, thank you for your time. |
It was automatically closed by merging the PR. Feel free to reopen the issue if you don't think it's resolved yet. |
I cannot reopen the issue. https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github Setting GGML_VK_FORCE_MAX_ALLOCATION_SIZE to 268435456, doesn't appear to have done much but idk I can't really tell. I consider the issue unsolved since I cannot run the model even if a 20/11 split would technically fit. But maybe I misunderstand how this works and there's a base level of overhead or something. |
Sorry about that, I didn't know that depende on who closed the issue. I think it's a case of memory fragmentation and it would work if you ran it without running a GUI on the GPUs. But depending on your setup that might be difficult to try. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Running llama.cpp #5832 (9731134)
I'm trying to load a model on two GPUs with Vulkan.
My GPUs have 20 and 11 gigs of VRAM
Loading a Q6_K quant of size
26.27 GiB (6.56 BPW)
with-ts "20,11" -c 512
yields:The math doesn't seem to add up.
A Q5_K_M quant at
22.65 GiB (5.66 BPW)
works perfectly fine until I increase the context to 4096.This can't possibly be context, right? When using HIP on smaller models, I have to push it much harder to OOM, I should be fine with 31GB of VRAM.
Any idea why this happens?
The text was updated successfully, but these errors were encountered: