-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: SYCL out of memory error #11044
Comments
Hi. Does running with reduced ctx length work such as |
@qnixsynapse No it does not because the prompt contains ~ 40k tokens. |
Try with |
@qnixsynapse It works with -nkvo, although much slower (factor 3-4) than VULKAN. The main question remains: Why does it work with VULKAN, but not with SYCL*? *(unless switching off KV offloading, which makes it very slow) |
@BenPortner My guess is that the KV cache size + model size increases more than the memory that has been reserved for your IGPU at >40,000 tokens (probably more than 6GB) which is causing the OOM. I think you can take a look at available GPU memory by using a program called GPU-Z in Windows and compare Buffer usages (including model sizes) for both SYCL and Vulkan backend from the logs. |
@qnixsynapse Thanks for the tip about GPU-Z. I will check it out. I still don't understand why the out of memory error occurs, though. Model + KV Buffer amount to ~6.5 GB. The available iGPU memory is ~15 GB as per the log output. Both model and KV buffer should easily fit. Also, both model and KV buffer are the same size when running VULKAN. So why does it work with VULKAN but not with SYCL? It seems that something memory-inefficient happens within the SYCL backend, which causes the error. |
@BenPortner The kernels are definitely unoptimized in SYCL. I have plans to optimize them this year if Intel is not interested in doing so. |
Hi @qnixsynapse, It would be amazing to see more optimizations in the SYCL backend! It is the fastest backend on Intel iGPU after all (with small contexts / when using KV buffer offloading). I'm not a C/C++ programmer, but let me know if I can help somehow. Perhaps I can open a an issue in one of their repos? For this, we would have to be sure that the issue is not within llama.cpp/ggml though. |
llama3-8-int4 with big context will take more 16Gb memory. Additionally, @qnixsynapse I also have plan to optimize the SYCL backend on Intel GPU in this year. Please go ahead if you want to do. Thank you! |
@BenPortner Thank you! |
Sounds great. Will love to collaborate together. Our first priority is to implement flash attention which can reduce memory usage. Vulkan backend currently has it with coopmat support. |
Although I now better understand why this error occurs, I wouldn't call it resolved. Perhaps it is useful to keep this issue open for further discussion and coordination of the development taks? I'll leave it up to you, though. You're the devs :) |
Perhaps this could be relevant: https://forums.developer.nvidia.com/t/why-is-cl-device-max-mem-alloc-size-never-larger-than-25-of-cl-device-global-mem-size-only-on-nvidia/47745/10 TL;DR: The OpenCL standard somewhat arbitrarily states that CL_DEVICE_MAX_MEM_ALLOC_SIZE can never be larger than 1/4 of the actual GPU memory. "Developers can try to allocate more memory than CL_DEVICE_MAX_MEM_ALLOC_SIZE, but the successful allocation is not guaranteed (this is same for any allocation call). The developers should check for error returned by clCreateBuffer and use the allocation only if the call returns CL_SUCCESS". I know SYCL is not the same as OpenCL, but since both are defined by the Khronos Group, perhaps the underlying limitation is the same? If yes, it might be worth ignoring this artificial limitation? |
That is not the problem here. In your case model weights are successfully loaded into memory. The problem happens when the gemm_batch kernel tries to calculate "batched matrix multiplication". It ends up using too much memory and fells short of 568 MB and crashes. Normally, I would prefer half of its job to be dedicated to an optimised flash attention kernel. You can test my theory by passing |
Hi @qnixsynapse, I kept investigating and it seems that the 4GB allocation limit is a problem for Intel+SYCL after all: intel/llvm#10946. If I understand the issue right, then even if you fix the batch matrix multiplication, I will eventually run into OOM if llama.cpp at any point tries to allocate >4GB buffer in memory. Unless there are any safe-guards implemented against this from your side? |
@BenPortner Hi, that is what I understand as well given that inside SYCL, you can not set the special allocation flags needed inside the memory allocation calls to make this happen from SYCL -> OpenCL/Level Zero backends even if you can set the appropriate compiler flags for the SYCL -> OpenCL/Level Zero backends. The problem is only an issue for older Intel GPUs, integrated and discrete, based on the original Xe architecture and doesn't seem to affect Ponte Vecchio as the only exception. The issue seems to have been fixed with Battlemage/Lunar Lake Xe2 GPUs. |
I'm still trying to wrap my head around things here. The fact that llama.cpp+VULKAN backend manages to allocate >4GB buffers on my Tiger Lake iGPU just fine makes me think that this is not a hardware limitation. @simonlui Thanks for chiming in! You mention that the >4GB buffer problem does not occur on newer Intel GPUs. Do you know why? Would they still require the "special allocation flags" to handle buffers >4GB? @qnixsynapse Does the llama.cpp VULKAN backend somehow split buffers internally into <4GB chunks before allocating them? If not, then it seems to me that the limitation is not imposed by the hardware but by the drivers or APIs. |
@BenPortner From what I understand from what Intel engineers have talked about with regards to this issue, it has to do with having int64 functionality natively and being able to do memory operations like addressing fast and not wanting to hit a performance issue with translation or etc. which is a restriction inside their compute runtime/drivers. This int64 functionality seems to correspond to FP64 functionality which was missing from Intel Xe except in HPC with Ponte Vecchio hence why it seems unaffected and I think some iGPUs after Alchemist had FP64 too. They have now re-implemented FP64 in hardware and fixed this issue by simply skipping over the entire issue. |
The issue answer will help other users with same issue. So, if we provide a workaround solution and it works in your case, please close it as possible. For more requirement, like want SYCL backend has same memory usage by flash attention, please create a feature issue to trace it. Thank you! |
Hello @NeoZhangJianyu I would like to turn this into a feature issue but unfortunately this is very hard for me as a user. I do not know the llama.cpp code enough to locate problems. Furthermore, I don't know enough about LLM engines to propose improvements, like the flash attention mechanism you mention. For me, llama.cpp is kind of a black box. I can throw inputs at it and compare the outputs. This is enough to report problems but not enough to create a feature ticket. That being said, I'll be happy if you or any of the involved devs turn this issue into a feature ticket :) |
Name and Version
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
version: 4404 (0827b2c)
built with MSVC 19.42.34435.0
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
libllama (core library)
Problem description & steps to reproduce
Problem
I run into memory errors when using the SYCL backend. No error appears when running the same setup with the VULKAN backend (same model, prompt, context length, batch size, etc.). In the example below, the error says that 568 MB could not be allocated. This is strange because I have 16 GB of GPU memory (shared system memory, not dedicated). It seems the error is not specific to llama-cli because it also occurs when I use the Python bindings (llama-cpp-python). The error also occurs in earlier versions (I tried b4311).
Hardware
Dell Latitude 5420
Windows 10 Enterprise
CPU: 11th Gen Intel i7-1185G7 @ 3.00GHz, 4 Cores, 8 Logical Processors x86_64
RAM: 2x16GB Hynix 3200MHz DDR4 PC4-25600
GPU: Intel Iris Xe iGPU
Storage: Western Digital PC SN530 NVMe WDC 512GB M.2 SSD
Minimum Error example
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: