Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU with Vulkan out of memory issue. #5848

Closed
lastrosade opened this issue Mar 3, 2024 · 19 comments · Fixed by #6155
Closed

Multi GPU with Vulkan out of memory issue. #5848

lastrosade opened this issue Mar 3, 2024 · 19 comments · Fixed by #6155
Assignees
Labels
bug-unconfirmed stale Vulkan Issues specific to the Vulkan backend

Comments

@lastrosade
Copy link

Running llama.cpp #5832 (9731134)

I'm trying to load a model on two GPUs with Vulkan.

My GPUs have 20 and 11 gigs of VRAM

Loading a Q6_K quant of size 26.27 GiB (6.56 BPW) with -ts "20,11" -c 512 yields:

ggml ctx size =    0.62 MiB
offloading 60 repeating layers to GPU
offloading non-repeating layers to GPU
offloaded 61/61 layers to GPU
   Vulkan0 buffer size = 17458.44 MiB
   Vulkan1 buffer size =  9088.14 MiB
       CPU buffer size =   358.90 MiB

Vulkan0 KV buffer size =    80.00 MiB
Vulkan1 KV buffer size =    40.00 MiB

KV self size  =  120.00 MiB, K (f16):   60.00 MiB, V (f16):   60.00 MiB
Vulkan_Host input buffer size   =    16.01 MiB
   Vulkan0 compute buffer size =   113.00 MiB
   Vulkan1 compute buffer size =   139.00 MiB
Vulkan_Host compute buffer size =    14.00 MiB

ggml_vulkan: Device memory allocation of size 120422400 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory

The math doesn't seem to add up.

A Q5_K_M quant at 22.65 GiB (5.66 BPW) works perfectly fine until I increase the context to 4096.

This can't possibly be context, right? When using HIP on smaller models, I have to push it much harder to OOM, I should be fine with 31GB of VRAM.
Any idea why this happens?

@0cc4m 0cc4m self-assigned this Mar 3, 2024
@0cc4m 0cc4m added the Vulkan Issues specific to the Vulkan backend label Mar 5, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 5, 2024

Does GPU0 also run a desktop? It might be an issue with vram fragmentation then, maybe. Can you upload the output of vulkaninfo?

I guess the error message about OOM should also contain the device that failed.

@lastrosade
Copy link
Author

Both GPUs have monitors plugged into them.

vulkaninfo.txt

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 6, 2024

That's gonna be a little harder to debug, since we don't even know which GPU is running out of memory. If you know a little bit about C++ we could add the GPU to the OOM error message and get that information.

Apart from that you can also run the program with validation layers enabled by building with LLAMA_VULKAN_VALIDATE=1 and see if that outputs any validation issues before the OOM (but I think that's unlikely to be the problem). You can also build with LLAMA_VULKAN_DEBUG=1 which will put everything the Vulkan part of the program does into your console. This is extremely verbose and you should probably just pipe it into a file and upload it for me to take a look at.

Let me know if you want me to provide the patch for the OOM error message or more info about the other steps.

@lastrosade
Copy link
Author

lastrosade commented Mar 7, 2024

Thank you for taking time to help me.

Here's the output from a run with LLAMA_VULKAN_DEBUG=1
dbg.txt
And one with both LLAMA_VULKAN_DEBUG=1 and LLAMA_VULKAN_VALIDATE=1
dbg.txt
The model is a 8x7b of size 23.6 GB.

It fails to allocate 1.3 gigs of VRAM when I can easily use cublas on my 1080ti to fill it to about 11.8 gigs or HIP on my 7900XT up to 20 gigs without issues.
Or is that additive? idk.

As for the custom error message, I have no idea how I would do that.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 9, 2024

Thank you. As a sidenote, Mixtral is not yet supported on Vulkan, but in this case it's crashing before that would be a problem.

Something is not adding up with these allocation sizes. In your first q5_k_m example it manages to allocate the two compute buffers of size 113 and 139 MB, but fails a later allocation.

In your debug outputs it tries to allocate a 1.3GB compute buffer, which is way too large. In my attempts to reproduce it it only tried to allocate around 320MB per GPU. Something is going wrong there.

But I also see that's an IQ4 quant, and Imatrix quants are also not yet supported on Vulkan. Can you reproduce the issue with a combination of model and quant that's supported? For example, I think a Yi-34B q5k or q6k should be around this size and is supported.

@lastrosade
Copy link
Author

Running Dolphin 2.2 Yi 34b 200k

I ran one that worked, just in case this could be useful.

Worked: Q4_K_S, (19,598,649,632 bytes), dbg_Q4_K_S.txt.gz (The text file is 232MB :) )

Q5_K_S, (23,707,691,296 bytes) dbg_Q5_K_S.txt
Q6_K, (28,213,926,176 bytes) dbg_Q6_K.txt

I have done these tests with as little other processes running to try and limit VRAM usage from whatever may be using HW acceleration or whatever.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 19, 2024

Apologies for not getting back to you sooner, I was too busy last week. Your logs show that the size of the the dequant buffer is the problem here. Because I didn't have proper matmul dequant shaders for the k-quants yet (and also didn't update the buffer size logic yet) they use quite a bit of vram. Too much for your setup, with q5_k and q6_k.

Good news is that I have now implemented the k-quant matmul shaders and will update the buffer size logic to take this into account. That should save you a few hundred megabytes of VRAM and hopefully solve this issue. I'll let you know when you can test this.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 19, 2024

@lastrosade Please check if #6155 fixes your problem.

@lastrosade
Copy link
Author

Apologies for not getting back to you sooner

No biggie.

What (I think) should be working still does not work, same model as before,

Q6_K, (28,213,926,176 bytes) dbg_Q6_K.txt

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 19, 2024

Apologies for not getting back to you sooner

No biggie.

What (I think) should be working still does not work, same model as before,

Q6_K, (28,213,926,176 bytes) dbg_Q6_K.txt

That is still the old code. The new code even had an issue that would have prevented you from building it. I fixed that now.

The PR is not merged yet, do you know how to check out the feature branch or do you need help with that?

@lastrosade
Copy link
Author

lastrosade commented Mar 19, 2024

Sorry, didn't see that it wasn't upstream.

So I built it the same way as before by including -DLLAMA_VULKAN_DEBUG=1 -DLLAMA_VULKAN_VALIDATE=1
This time from branch 0cc4m/vulkan-improvements

Q4_K_S worked as it did before.
Q5_K_S now works! dbg_Q5_K_S.txt (I prematurely stopped it mid-generation.)
Q6_K does not work: dbg_Q6_K.txt
I should definitely have enough memory left for Q6_K to work, but I can't even really tell how much space 4K of context is really taking.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 20, 2024

I think it's the 7900 XT that's running out of memory in q6_k. I added info which device is allocating to the debug output, can you run q5 and q6 again? No need to let q5 run through, just the prompt processing is enough. We can then figure out how much memory it tried to allocate before running out.

@lastrosade
Copy link
Author

lastrosade commented Mar 23, 2024

Sorry for the wait, had IRL issues and forgot to check back.

Reran Dolphin 2.2 Yi 34b 200k, 4096 token context. With commit 1fceeb9.

Q5_K_S With --n-predict 12: dbg_Q5_K_S_small.txt (20MB)

Q6_K With --n-predict 12: dbg_Q6_K_small.txt

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 29, 2024

Sorry for the wait, had IRL issues and forgot to check back.

Reran Dolphin 2.2 Yi 34b 200k, 4096 token context. With commit 1fceeb9.

Q5_K_S With --n-predict 12: dbg_Q5_K_S_small.txt (20MB)

Q6_K With --n-predict 12: dbg_Q6_K_small.txt

I wrote a small script to evaluate the VRAM use in those debug outputs:
The q5_k_s example used approximately 16GiB on the 7900 XT and 8.3GiB on the 1080 Ti when it begins prompt processing.
The q6_k example used 17.7GiB on the 7900 XT and 9.2GiB on the 1080 Ti when it crashes trying to allocate 512MiB on the 7900 XT. It would have tried to build another 512MiB buffer on the 1080 Ti.

When I run the q6_k model on two of my GPUs (RTX 3090 24GB and Radeon Pro VII 16GB) it ends up at 18.7GiB on the 3090 and 9.7GiB on the Pro VII.

During inference your q5_k_s output goes up to 16GiB and 10.1GiB, my q6_k output goes up to 18.7GiB and 11.4GiB.

I suppose this is just too close to the VRAM limitations, together with some VRAM fragmentation from using the GPUs for other tasks. That's about as much information as I can get out of that.

Only thing you can try is setting the environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to something smaller (like 536870912 for 512MiB or 268435456 for 256MiB) and see if that lets you squeeze a little more into the VRAM by allocating smaller blocks.

@lastrosade
Copy link
Author

Welp, thank you for your time.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 30, 2024

Welp, thank you for your time.

It was automatically closed by merging the PR. Feel free to reopen the issue if you don't think it's resolved yet.

@lastrosade
Copy link
Author

lastrosade commented Mar 31, 2024

I cannot reopen the issue. https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github

Setting GGML_VK_FORCE_MAX_ALLOCATION_SIZE to 268435456, doesn't appear to have done much but idk I can't really tell.

I consider the issue unsolved since I cannot run the model even if a 20/11 split would technically fit. But maybe I misunderstand how this works and there's a base level of overhead or something.

@0cc4m 0cc4m reopened this Mar 31, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 31, 2024

I cannot reopen the issue. https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github

Setting GGML_VK_FORCE_MAX_ALLOCATION_SIZE to 268435456, doesn't appear to have done much but idk I can't really tell.

I consider the issue unsolved since I cannot run the model even if a 20/11 split would technically fit. But maybe I misunderstand how this works and there's a base level of overhead or something.

Sorry about that, I didn't know that depende on who closed the issue.

I think it's a case of memory fragmentation and it would work if you ran it without running a GUI on the GPUs. But depending on your setup that might be difficult to try.

@github-actions github-actions bot added the stale label May 1, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed stale Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants