Misc. bug: SYCL out of memory error #11044

BenPortner · 2025-01-02T14:34:17Z

Name and Version

ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
version: 4404 (0827b2c)
built with MSVC 19.42.34435.0

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

libllama (core library)

Problem description & steps to reproduce

Problem

I run into memory errors when using the SYCL backend. No error appears when running the same setup with the VULKAN backend (same model, prompt, context length, batch size, etc.). In the example below, the error says that 568 MB could not be allocated. This is strange because I have 16 GB of GPU memory (shared system memory, not dedicated). It seems the error is not specific to llama-cli because it also occurs when I use the Python bindings (llama-cpp-python). The error also occurs in earlier versions (I tried b4311).

Hardware

Dell Latitude 5420
Windows 10 Enterprise
CPU: 11th Gen Intel i7-1185G7 @ 3.00GHz, 4 Cores, 8 Logical Processors x86_64
RAM: 2x16GB Hynix 3200MHz DDR4 PC4-25600
GPU: Intel Iris Xe iGPU
Storage: Western Digital PC SN530 NVMe WDC 512GB M.2 SSD

Minimum Error example

rem create very long prompt
python -c "f = open('prompt.txt', 'w'); prompt = 'bla '*40000; f.write(prompt); f.close();"

rem run llama-cli
llama-cli.exe -m "C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf" --file prompt.txt -n 20 -ngl 99 -c 40100 --no-display-prompt

rem complete log attached
alloc: can't allocate 568118476 Bytes of memory on device/GPU
Enqueue process failed.
Exception caught at file:D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:3404, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3404
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:111: SYCL error

First Bad Commit

No response

Relevant log output

C:\...\llama.cpp\b4404\sycl>llama-cli.exe -m "C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf" --file prompt.txt -n 20 -ngl 99 -c 40100 --no-display-prompt
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 4404 (0827b2c1) with MSVC 19.42.34435.0 for
main: llama backend init
main: load the model and apply lora adapter, if any
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Iris(R) Xe Graphics) - 14658 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_0:  193 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.78 GiB (4.77 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =  1825.40 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
.........................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 40128
llama_new_context_with_model: n_ctx_per_seq = 40128
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (40128) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|   12.0|     96|     512|   32| 15370M|            1.3.29803|
llama_kv_cache_init: kv_size = 40128, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28
llama_kv_cache_init:      SYCL0 KV buffer size =  4389.00 MiB
llama_new_context_with_model: KV self size  = 4389.00 MiB, K (f16): 2194.50 MiB, V (f16): 2194.50 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1983.38 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    84.38 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 40128
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 3140513417
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 40128
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 40128, n_batch = 2048, n_predict = 20, n_keep = 1

alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
Enqueue process failed.
Exception caught at file:D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:3404, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3404
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:111: SYCL error

The text was updated successfully, but these errors were encountered:

qnixsynapse · 2025-01-02T16:10:06Z

Hi. Does running with reduced ctx length work such as -c 8192?

BenPortner · 2025-01-02T17:12:14Z

@qnixsynapse No it does not because the prompt contains ~ 40k tokens.

qnixsynapse · 2025-01-03T03:50:14Z

@qnixsynapse No it does not because the prompt contains ~ 40k tokens.

Try with -nkvo

BenPortner · 2025-01-03T13:01:35Z

@qnixsynapse It works with -nkvo, although much slower (factor 3-4) than VULKAN. The main question remains: Why does it work with VULKAN, but not with SYCL*?

*(unless switching off KV offloading, which makes it very slow)

qnixsynapse · 2025-01-03T13:40:31Z

@BenPortner My guess is that the KV cache size + model size increases more than the memory that has been reserved for your IGPU at >40,000 tokens (probably more than 6GB) which is causing the OOM. I think you can take a look at available GPU memory by using a program called GPU-Z in Windows and compare Buffer usages (including model sizes) for both SYCL and Vulkan backend from the logs.

BenPortner · 2025-01-03T15:23:52Z

@qnixsynapse Thanks for the tip about GPU-Z. I will check it out. I still don't understand why the out of memory error occurs, though. Model + KV Buffer amount to ~6.5 GB. The available iGPU memory is ~15 GB as per the log output. Both model and KV buffer should easily fit. Also, both model and KV buffer are the same size when running VULKAN. So why does it work with VULKAN but not with SYCL? It seems that something memory-inefficient happens within the SYCL backend, which causes the error.

qnixsynapse · 2025-01-05T13:19:27Z

It seems that something memory-inefficient happens within the SYCL

@BenPortner The kernels are definitely unoptimized in SYCL. I have plans to optimize them this year if Intel is not interested in doing so.

BenPortner · 2025-01-05T17:52:37Z

Hi @qnixsynapse, It would be amazing to see more optimizations in the SYCL backend! It is the fastest backend on Intel iGPU after all (with small contexts / when using KV buffer offloading). I'm not a C/C++ programmer, but let me know if I can help somehow. Perhaps I can open a an issue in one of their repos? For this, we would have to be sure that the issue is not within llama.cpp/ggml though.

NeoZhangJianyu · 2025-01-07T04:55:08Z

llama3-8-int4 with big context will take more 16Gb memory.
I have seen similar case on Arc 770.
In this case, I suggest adjusting the parameters to make the balance.

Additionally, @qnixsynapse I also have plan to optimize the SYCL backend on Intel GPU in this year. Please go ahead if you want to do.
Maybe we need to consider the impact to iGPU (like iGPU of 11/12th Core) too.

Thank you!

NeoZhangJianyu · 2025-01-07T04:55:59Z

@BenPortner
If your issue is resolved, could you close this issue?

Thank you!

qnixsynapse · 2025-01-07T05:53:43Z

Additionally, @qnixsynapse I also have plan to optimize the SYCL backend on Intel GPU in this year. Please go ahead if you want to do.
Maybe we need to consider the impact to iGPU (like iGPU of 11/12th Core) too

Sounds great. Will love to collaborate together. Our first priority is to implement flash attention which can reduce memory usage. Vulkan backend currently has it with coopmat support.

BenPortner · 2025-01-07T10:41:39Z

@BenPortner
If your issue is resolved, could you close this issue?

Although I now better understand why this error occurs, I wouldn't call it resolved. Perhaps it is useful to keep this issue open for further discussion and coordination of the development taks? I'll leave it up to you, though. You're the devs :)

BenPortner · 2025-01-09T16:19:06Z

Perhaps this could be relevant: https://forums.developer.nvidia.com/t/why-is-cl-device-max-mem-alloc-size-never-larger-than-25-of-cl-device-global-mem-size-only-on-nvidia/47745/10

TL;DR: The OpenCL standard somewhat arbitrarily states that CL_DEVICE_MAX_MEM_ALLOC_SIZE can never be larger than 1/4 of the actual GPU memory. "Developers can try to allocate more memory than CL_DEVICE_MAX_MEM_ALLOC_SIZE, but the successful allocation is not guaranteed (this is same for any allocation call). The developers should check for error returned by clCreateBuffer and use the allocation only if the call returns CL_SUCCESS".

I know SYCL is not the same as OpenCL, but since both are defined by the Khronos Group, perhaps the underlying limitation is the same? If yes, it might be worth ignoring this artificial limitation?

qnixsynapse · 2025-01-09T16:57:06Z

That is not the problem here. In your case model weights are successfully loaded into memory. The problem happens when the gemm_batch kernel tries to calculate "batched matrix multiplication". It ends up using too much memory and fells short of 568 MB and crashes. Normally, I would prefer half of its job to be dedicated to an optimised flash attention kernel.

You can test my theory by passing --no-warmup and a smaller batch size of something like 64(by also passing -b 64 as cmdline arguments`)

BenPortner · 2025-01-13T09:30:55Z

Hi @qnixsynapse, I kept investigating and it seems that the 4GB allocation limit is a problem for Intel+SYCL after all: intel/llvm#10946. If I understand the issue right, then even if you fix the batch matrix multiplication, I will eventually run into OOM if llama.cpp at any point tries to allocate >4GB buffer in memory. Unless there are any safe-guards implemented against this from your side?

simonlui · 2025-01-14T06:53:41Z

@BenPortner Hi, that is what I understand as well given that inside SYCL, you can not set the special allocation flags needed inside the memory allocation calls to make this happen from SYCL -> OpenCL/Level Zero backends even if you can set the appropriate compiler flags for the SYCL -> OpenCL/Level Zero backends. The problem is only an issue for older Intel GPUs, integrated and discrete, based on the original Xe architecture and doesn't seem to affect Ponte Vecchio as the only exception. The issue seems to have been fixed with Battlemage/Lunar Lake Xe2 GPUs.

BenPortner · 2025-01-14T14:46:55Z

I'm still trying to wrap my head around things here. The fact that llama.cpp+VULKAN backend manages to allocate >4GB buffers on my Tiger Lake iGPU just fine makes me think that this is not a hardware limitation.

@simonlui Thanks for chiming in! You mention that the >4GB buffer problem does not occur on newer Intel GPUs. Do you know why? Would they still require the "special allocation flags" to handle buffers >4GB?

@qnixsynapse Does the llama.cpp VULKAN backend somehow split buffers internally into <4GB chunks before allocating them? If not, then it seems to me that the limitation is not imposed by the hardware but by the drivers or APIs.

simonlui · 2025-01-14T19:29:52Z

@BenPortner From what I understand from what Intel engineers have talked about with regards to this issue, it has to do with having int64 functionality natively and being able to do memory operations like addressing fast and not wanting to hit a performance issue with translation or etc. which is a restriction inside their compute runtime/drivers. This int64 functionality seems to correspond to FP64 functionality which was missing from Intel Xe except in HPC with Ponte Vecchio hence why it seems unaffected and I think some iGPUs after Alchemist had FP64 too. They have now re-implemented FP64 in hardware and fixed this issue by simply skipping over the entire issue.

NeoZhangJianyu · 2025-01-15T03:14:52Z

@BenPortner
If your issue is resolved, could you close this issue?

Although I now better understand why this error occurs, I wouldn't call it resolved. Perhaps it is useful to keep this issue open for further discussion and coordination of the development taks? I'll leave it up to you, though. You're the devs :)

The issue answer will help other users with same issue.
If this issue is not closed for a long time, that would make other users think SYCL backend always have out of memory issue. And some users would stop trying it.

So, if we provide a workaround solution and it works in your case, please close it as possible.

For more requirement, like want SYCL backend has same memory usage by flash attention, please create a feature issue to trace it.
Too long (term and content) issue, can't help we discussing.

Thank you!

BenPortner · 2025-01-20T08:40:44Z

Hello @NeoZhangJianyu I would like to turn this into a feature issue but unfortunately this is very hard for me as a user. I do not know the llama.cpp code enough to locate problems. Furthermore, I don't know enough about LLM engines to propose improvements, like the flash attention mechanism you mention. For me, llama.cpp is kind of a black box. I can throw inputs at it and compare the outputs. This is enough to report problems but not enough to create a feature ticket. That being said, I'll be happy if you or any of the involved devs turn this issue into a feature ticket :)

BenPortner added the bug-unconfirmed label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: SYCL out of memory error #11044

Misc. bug: SYCL out of memory error #11044

BenPortner commented Jan 2, 2025 •

edited

Loading

qnixsynapse commented Jan 2, 2025

BenPortner commented Jan 2, 2025 •

edited

Loading

qnixsynapse commented Jan 3, 2025 •

edited

Loading

BenPortner commented Jan 3, 2025

qnixsynapse commented Jan 3, 2025

BenPortner commented Jan 3, 2025

qnixsynapse commented Jan 5, 2025 •

edited

Loading

BenPortner commented Jan 5, 2025

NeoZhangJianyu commented Jan 7, 2025

NeoZhangJianyu commented Jan 7, 2025

qnixsynapse commented Jan 7, 2025

BenPortner commented Jan 7, 2025

BenPortner commented Jan 9, 2025

qnixsynapse commented Jan 9, 2025

BenPortner commented Jan 13, 2025

simonlui commented Jan 14, 2025

BenPortner commented Jan 14, 2025

simonlui commented Jan 14, 2025 •

edited

Loading

NeoZhangJianyu commented Jan 15, 2025

BenPortner commented Jan 20, 2025

Misc. bug: SYCL out of memory error #11044

Misc. bug: SYCL out of memory error #11044

Comments

BenPortner commented Jan 2, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

Problem

Hardware

Minimum Error example

First Bad Commit

Relevant log output

qnixsynapse commented Jan 2, 2025

BenPortner commented Jan 2, 2025 • edited Loading

qnixsynapse commented Jan 3, 2025 • edited Loading

BenPortner commented Jan 3, 2025

qnixsynapse commented Jan 3, 2025

BenPortner commented Jan 3, 2025

qnixsynapse commented Jan 5, 2025 • edited Loading

BenPortner commented Jan 5, 2025

NeoZhangJianyu commented Jan 7, 2025

NeoZhangJianyu commented Jan 7, 2025

qnixsynapse commented Jan 7, 2025

BenPortner commented Jan 7, 2025

BenPortner commented Jan 9, 2025

qnixsynapse commented Jan 9, 2025

BenPortner commented Jan 13, 2025

simonlui commented Jan 14, 2025

BenPortner commented Jan 14, 2025

simonlui commented Jan 14, 2025 • edited Loading

NeoZhangJianyu commented Jan 15, 2025

BenPortner commented Jan 20, 2025

BenPortner commented Jan 2, 2025 •

edited

Loading

BenPortner commented Jan 2, 2025 •

edited

Loading

qnixsynapse commented Jan 3, 2025 •

edited

Loading

qnixsynapse commented Jan 5, 2025 •

edited

Loading

simonlui commented Jan 14, 2025 •

edited

Loading