Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Open
lucyknada opened this issue Jan 26, 2025 · 8 comments

Comments

@lucyknada
Copy link

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
version: 4561 (6f53d8a)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m ../../models/70B-IQ3_M.gguf -c 8192 -ngl 999 --rpc localhost:50052,localhost:50053,192.168.0.21:50052

Problem description & steps to reproduce

localhost: 24+12GB vram
192.168.0.21: 12GB vram

localhost: 2 rpc servers or 1 doesnt matter, always oom's, with 1 rpc server running for two devices the allocation is even worse at like 10% allocated

loading the 70b it allocates:

9.414Gi/12.0Gi
17.162Gi/24.0Gi
--
4.745Gi/12.0Gi

then OOM-crashes after a bit:

llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1
llama_kv_cache_init: RPC[localhost:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init: RPC[localhost:50053] KV buffer size =   736.00 MiB
llama_kv_cache_init: RPC[192.168.0.21:50052] KV buffer size =   384.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   736.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   320.00 MiB
llama_init_from_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1104.00 MiB on device 1: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 1157632000
llama_init_from_model: failed to allocate compute buffers

even though there is more space on the 24 gigs and especially the third device 12 gigs

First Bad Commit

No response

Relevant log output

@rgerganov
Copy link
Collaborator

You don't need to run RPC servers for local devices. Start RPC server only 192.168.0.21 and paste the logs from the main host and the rpc-server on 192.168.0.21.

@lucyknada
Copy link
Author

that indeed was the issue thanks! possibly worth adding as a note to the rpc example readme? also I'm still noticing a very weird usage pattern:

9.880/12.0Gi
17.393/24.0Gi
--
10.195/12.0Gi

is there a way to affect that since the 24 gigs are not being used properly on the same node?

@rgerganov
Copy link
Collaborator

is there a way to affect that since the 24 gigs are not being used properly on the same node?

You can use --tensor-split to control how memory is split across devices

@lucyknada
Copy link
Author

lucyknada commented Jan 28, 2025

indeed that works thanks! though with rpc the order seems unpredictable and --list-devices does not show rpc devices and their index'

for example:

dev0: 24GB
dev1: 12GB
---
rpc-dev0: 12GB

putting it e.g. as: 0.7,0.15,0.15 (per device, incl rpc?) or 0.8,0.2 (maybe local then remote?) ooms rpc-dev0 with what should have been on dev0 though I have tried many other values too and landed on a rather odd order:

rpc0, dev1, dev0

is that a bug or expected?

@rgerganov
Copy link
Collaborator

You should put --rpc before --list-devices and --tensor-split, see #10609 (comment)

@lucyknada
Copy link
Author

thanks! though the order displayed by list-devices is still different from what tensor-split wants, list-devices output:

dev0
dev1
rpc0

but --tensor-split (after --rpc) maps them now to:

rpc0
dev0
dev1

is there any way without tactically OOMing / bruteforcing to get the lcpp mapped order of devices with rpc for -ts?

--

the tensor split ratios being somewhat confusing to manage too, what if I wanted to fill both dev0 and dev1 to their vram max and then the rest onto rpc0, I've resorted to just bruteforcing that too and ending up on subpar splits, because 12GB of course can not fit as much of the ratio as 24gb

e.g. as opposed to something like percentage syntax: 0.95, 0.95, 0.95 (use 95% of each device vram)

that's supposed to be the default behavior of lcpp with just -ngl but in reality it uses 14GB of the 24GB etc it looks like

@rgerganov
Copy link
Collaborator

is there any way without tactically OOMing / bruteforcing to get the lcpp mapped order of devices with rpc for -ts?

We reorder the available devices before evaluating the model and put RPC devices first. This is a performance optimization, see #9296 for details. --list-devices doesn't respect that and list devices in the order they are registered. There was an attempt for changing device registration, making RPC devices come first in the registry but it was rejected by @slaren here.

So I am not sure how to fix this.

@lucyknada
Copy link
Author

lucyknada commented Jan 30, 2025

I see, makes sense, if this can't be properly implemented, maybe could settle for at least a note inside RPC readme I suppose, but I've left a comment on the PR too to see if slaren has an idea

also would there be a way with rpc to overflow the local devices first before offloading to remote? something akin to 95% usage on local devices and then just the overflow to rpc

rgerganov added a commit to rgerganov/llama.cpp that referenced this issue Feb 4, 2025
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref ggml-org#11435
rgerganov added a commit that referenced this issue Feb 4, 2025
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref #11435
tinglou pushed a commit to tinglou/llama.cpp that referenced this issue Feb 13, 2025
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref ggml-org#11435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants