-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435
Comments
You don't need to run RPC servers for local devices. Start RPC server only |
that indeed was the issue thanks! possibly worth adding as a note to the rpc example readme? also I'm still noticing a very weird usage pattern:
is there a way to affect that since the 24 gigs are not being used properly on the same node? |
You can use --tensor-split to control how memory is split across devices |
indeed that works thanks! though with rpc the order seems unpredictable and --list-devices does not show rpc devices and their index' for example:
putting it e.g. as: 0.7,0.15,0.15 (per device, incl rpc?) or 0.8,0.2 (maybe local then remote?) ooms rpc-dev0 with what should have been on dev0 though I have tried many other values too and landed on a rather odd order: rpc0, dev1, dev0 is that a bug or expected? |
You should put |
thanks! though the order displayed by list-devices is still different from what tensor-split wants, list-devices output:
but --tensor-split (after --rpc) maps them now to:
is there any way without tactically OOMing / bruteforcing to get the lcpp mapped order of devices with rpc for -ts? -- the tensor split ratios being somewhat confusing to manage too, what if I wanted to fill both dev0 and dev1 to their vram max and then the rest onto rpc0, I've resorted to just bruteforcing that too and ending up on subpar splits, because 12GB of course can not fit as much of the ratio as 24gb e.g. as opposed to something like percentage syntax: 0.95, 0.95, 0.95 (use 95% of each device vram) that's supposed to be the default behavior of lcpp with just -ngl but in reality it uses 14GB of the 24GB etc it looks like |
We reorder the available devices before evaluating the model and put RPC devices first. This is a performance optimization, see #9296 for details. So I am not sure how to fix this. |
I see, makes sense, if this can't be properly implemented, maybe could settle for at least a note inside RPC readme I suppose, but I've left a comment on the PR too to see if slaren has an idea also would there be a way with rpc to overflow the local devices first before offloading to remote? something akin to 95% usage on local devices and then just the overflow to rpc |
List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref ggml-org#11435
List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref #11435
List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref ggml-org#11435
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
version: 4561 (6f53d8a)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
localhost: 2 rpc servers or 1 doesnt matter, always oom's, with 1 rpc server running for two devices the allocation is even worse at like 10% allocated
loading the 70b it allocates:
then OOM-crashes after a bit:
even though there is more space on the 24 gigs and especially the third device 12 gigs
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: