Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

lucyknada · 2025-01-26T19:45:40Z

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
version: 4561 (6f53d8a)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m ../../models/70B-IQ3_M.gguf -c 8192 -ngl 999 --rpc localhost:50052,localhost:50053,192.168.0.21:50052

Problem description & steps to reproduce

localhost: 24+12GB vram
192.168.0.21: 12GB vram

localhost: 2 rpc servers or 1 doesnt matter, always oom's, with 1 rpc server running for two devices the allocation is even worse at like 10% allocated

loading the 70b it allocates:

9.414Gi/12.0Gi
17.162Gi/24.0Gi
--
4.745Gi/12.0Gi

then OOM-crashes after a bit:

llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1
llama_kv_cache_init: RPC[localhost:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init: RPC[localhost:50053] KV buffer size =   736.00 MiB
llama_kv_cache_init: RPC[192.168.0.21:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   736.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   320.00 MiB
llama_init_from_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
llama_init_from_model: failed to allocate compute buffers

even though there is more space on the 24 gigs and especially the third device 12 gigs

First Bad Commit

No response

Relevant log output

The text was updated successfully, but these errors were encountered:

rgerganov · 2025-01-27T09:19:25Z

You don't need to run RPC servers for local devices. Start RPC server only 192.168.0.21 and paste the logs from the main host and the rpc-server on 192.168.0.21.

lucyknada · 2025-01-27T17:28:09Z

that indeed was the issue thanks! possibly worth adding as a note to the rpc example readme? also I'm still noticing a very weird usage pattern:

9.880/12.0Gi
17.393/24.0Gi
--
10.195/12.0Gi

is there a way to affect that since the 24 gigs are not being used properly on the same node?

rgerganov · 2025-01-28T07:36:42Z

is there a way to affect that since the 24 gigs are not being used properly on the same node?

You can use --tensor-split to control how memory is split across devices

lucyknada · 2025-01-28T19:14:24Z

indeed that works thanks! though with rpc the order seems unpredictable and --list-devices does not show rpc devices and their index'

for example:

dev0: 24GB
dev1: 12GB
---
rpc-dev0: 12GB

putting it e.g. as: 0.7,0.15,0.15 (per device, incl rpc?) or 0.8,0.2 (maybe local then remote?) ooms rpc-dev0 with what should have been on dev0 though I have tried many other values too and landed on a rather odd order:

rpc0, dev1, dev0

is that a bug or expected?

rgerganov · 2025-01-29T07:37:08Z

You should put --rpc before --list-devices and --tensor-split, see #10609 (comment)

lucyknada · 2025-01-30T00:17:14Z

thanks! though the order displayed by list-devices is still different from what tensor-split wants, list-devices output:

dev0
dev1
rpc0

but --tensor-split (after --rpc) maps them now to:

rpc0
dev0
dev1

is there any way without tactically OOMing / bruteforcing to get the lcpp mapped order of devices with rpc for -ts?

--

the tensor split ratios being somewhat confusing to manage too, what if I wanted to fill both dev0 and dev1 to their vram max and then the rest onto rpc0, I've resorted to just bruteforcing that too and ending up on subpar splits, because 12GB of course can not fit as much of the ratio as 24gb

e.g. as opposed to something like percentage syntax: 0.95, 0.95, 0.95 (use 95% of each device vram)

that's supposed to be the default behavior of lcpp with just -ngl but in reality it uses 14GB of the 24GB etc it looks like

rgerganov · 2025-01-30T10:59:51Z

is there any way without tactically OOMing / bruteforcing to get the lcpp mapped order of devices with rpc for -ts?

We reorder the available devices before evaluating the model and put RPC devices first. This is a performance optimization, see #9296 for details. --list-devices doesn't respect that and list devices in the order they are registered. There was an attempt for changing device registration, making RPC devices come first in the registry but it was rejected by @slaren here.

So I am not sure how to fix this.

lucyknada · 2025-01-30T17:23:26Z

I see, makes sense, if this can't be properly implemented, maybe could settle for at least a note inside RPC readme I suppose, but I've left a comment on the PR too to see if slaren has an idea

also would there be a way with rpc to overflow the local devices first before offloading to remote? something akin to 95% usage on local devices and then just the overflow to rpc

List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref ggml-org#11435

List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref #11435

List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref ggml-org#11435

lucyknada added the bug-unconfirmed label Jan 26, 2025

rgerganov mentioned this issue Feb 4, 2025

arg : list RPC devices first when using --list-devices #11655

Merged

jukofyork mentioned this issue Feb 5, 2025

llama : add option to override model tensor buffers #11397

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

lucyknada commented Jan 26, 2025

rgerganov commented Jan 27, 2025

lucyknada commented Jan 27, 2025

rgerganov commented Jan 28, 2025

lucyknada commented Jan 28, 2025 •

edited

Loading

rgerganov commented Jan 29, 2025

lucyknada commented Jan 30, 2025

rgerganov commented Jan 30, 2025

lucyknada commented Jan 30, 2025 •

edited

Loading

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Comments

lucyknada commented Jan 26, 2025

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

rgerganov commented Jan 27, 2025

lucyknada commented Jan 27, 2025

rgerganov commented Jan 28, 2025

lucyknada commented Jan 28, 2025 • edited Loading

rgerganov commented Jan 29, 2025

lucyknada commented Jan 30, 2025

rgerganov commented Jan 30, 2025

lucyknada commented Jan 30, 2025 • edited Loading

lucyknada commented Jan 28, 2025 •

edited

Loading

lucyknada commented Jan 30, 2025 •

edited

Loading