-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory not cleaned up after off-loading layers to GPU using n_gpu_layers
#223
Comments
I am encountering this too. Windows, CuBLAS, AMD CPU, RTX 1080. When the llm model is destroyed you get the RAM back but the VRAM stays occupied until the whole python script using it quits. That means you get out of memory doing a second inference this way. Which makes GPU acceleration unusable for me, currently. Kind of a big deal I think. |
Is this a bug then in how I suspect the latter, which means a bug needs to be logged with llama.cpp that reproduces the issue as simply as possible. Does someone have a few lines of EDIT: Sorry, I see that the OP did provide some code. What happens if you do a:
before returning? My python-fu isn't that strong, but I suspect you need to explicitly destroy the object to destroy the reference to the code running on the GPU. Of course this looks horribly inefficient as every time the method is being called the model needs to be reloaded on the GPU. |
I’ll make a minimal example and update you. |
By now I am doing
Changes nothing. RAM goes down, VRAM keeps up. |
This snippet should do the job: from llama_cpp import Llama
import gc
import os
def measure_resources(func):
def get_ram_usage(pid):
ram = os.popen(f'pmap {pid} | tail -1').read().strip()
return ram.split(' ')[-1]
def get_gpu_usage(pid):
gpu = os.popen(f'nvidia-smi --query-compute-apps=pid,used_memory --format=csv | grep {pid}').read().strip()
return gpu.split(', ')[-1] if gpu else '0 MiB'
def wrapper():
pid = os.getpid()
print('pid:', pid)
pre_ram, pre_gpu = get_ram_usage(pid), get_gpu_usage(pid)
print('pre_ram:', pre_ram, 'pre_gpu:', pre_gpu)
func()
post_ram, post_gpu = get_ram_usage(pid), get_gpu_usage(pid)
print('post_ram:', post_ram, 'post_gpu:', post_gpu)
return wrapper
@measure_resources
def generate_text():
llm = Llama(model_path='./weights/oasst-30b.bin', n_gpu_layers=40)
del llm
gc.collect()
if __name__ == '__main__':
generate_text() Output:
Interestingly RAM usage also doesn't go down until process is terminated? Perhaps I'm missing something. |
Good repro! I patched it to use You probably want to append to the bug llama.cpp/issues/1456, but they may ask which
|
oobabooga/text-generation-webui/#2087 was able to fix the RAM not being released. Have we already integrated this change? |
It seems to me they only fixed the RAM not VRAM? RAM was already freed when I tried this. Anyway it clearly seems to be a llama.cpp problem and I don't know how this can be open for a week or so. The fix must be a single line or something. Over there. As a workaround, I could imagine wrapping the model useage entirely into a thread and killing that after use could work to force it to free everything like it does when the pyhton script exits, but I have not tested it. I doubt it would be a fix llama-cpp-python could actually implement though. |
Yeah they fixed the RAM issue. I have been following the thread on llama.cpp and seems like the author of the GPU implementation was able to fix it after being able to reproduce with my snippet. Not sure when the fix will be pushed though. |
But it is a VRAM issue. |
Yes, the author was able to clean up VRAM. Check the thread issue #1456. |
Is the last thing I see there Edit: It is possible I misinterpreted the most recent comment there, I don't know what they tested. The guy above indeed says the issue is fixed in his branch. Hopefully there will just be a fix in llama.cpp soon? |
Fix looks to be available in upstream PR which also adds a new
|
I'v tested save-load-state with |
I tried manually installing the llama-cpp-python with the llama.cpp given through the PR here, both including and not including the --tensor-split arg but resulted in segmentation fault while loading model. Until llama-cpp-python gets updated, the best strategy when you need to reload multiple models right now might be to use subprocess package to execute a separate python script that loads the llama.cpp model and outputs results. This successfully releases the gpu vram. |
Yeah, this is what I was doing as a workaround. |
This may be somewhat fixed in the latest llama-cpp-python version. The VRAM goes down when the model is unloaded. However the dedicated GPU memory usage does not return to the same level it was before first loading, and it still goes down further when terminating the python script. But when loading it again, at least now it returns to the same usage it had before, so it should not run out of VRAM anymore, as far as I can tell. But really, the VRAM usage should completely go away when unloading. The reason for unloading is that you want to make that VRAM available to something else. |
The llama.cpp CUDA code allocates static memory buffers for holding temporary results. This is done to avoid having to allocate memory during the computation which would be much slower. So that's most likely the reason VRAM is not completely freed until the process exits. The static buffers are currently proportional to batch size in size so that parameter can be lowered to reduce VRAM usage. |
I don't understand how it can even keep any buffers if i delete the model, and even if it's possible, it should not be allowed to do that. I realize it keeps its memory when i have the model created, but when i do not, there should not be any trace of me even using llama-cpp-python. So, maybe a usecase helps. My AI server runs all the time. But I kick it out of memory if I haven't used it for 10 minutes. If it keeps stuff in memory (RAM or VRAM) this is a problem when I want to play Diablo4. |
It has to do with how the CUDA code works. The memory is not tied to a specific model object but rather it is tied to global static variables. |
I see. I hope it can be forced to release its memory without relying on the process quitting, otherwise that sounds pretty incompetent on nvidia's side. I really don't want to wrap it in its own process just to work around what I would consider to be a serious memory leak. Or maybe there is something llama-cpp-python can stop holding on to, to trigger full destruction, idk. |
It doesn't have anything to do with what NVIDIA did, it's a consequence of the llama.cpp code. There is a global memory buffer pool and a global scratch buffer that are not tied to a specific model. |
That's great, makes it solveable. May I suggest a "cleanup" call in the API or something. |
Right now using multiple models in the same process won't work correctly anyways. I'll include a fix that just frees the buffers upon model deletion the next time I make a PR. |
Thank you, sounds great! <3 |
Still basically a memory leak issue for 1 1/2 months now. |
Hey guys, in case you have CPU memory issues, check out this issue ggml-org/llama.cpp#2145.
|
It is a GPU memory issue. VRAM rises just importing llama-cpp-python. It is not a lot but in my book that's a no-go already. Then when I load a model with BLAS (cuda) and a few layers and do inference, VRAM goes to 5GB. Fine. Then I delete/unload the model, goes down to 2.5GB VRAM usage. Terminate the python process, goes down to 1.1 GB VRAM usage. When that should not make anything go down after the model was already deleted. Does it really take 2 months to add some function that frees some kind of sloppy shared resources? By now I am getting the impression this is deemed perfectly cool behavior by llama.cpp and llama-cpp-python can't do anything about it due to how some bindings-magic works probably. |
That is 100% a llama-cpp-python issue, most likely from the eager initialization of the CUDA backend.
That is very likely due to the buffer pool for prompt processing, see ggml-org/llama.cpp#1935 , will be fixed by ggml-org/llama.cpp#2160 .
I'm doing this as a hobby and I don't particularly care about the use cases of other people. I personally only use llama.cpp from the command line or the native server. So I'm not going to spend my time on a temporary fix that manages the deallocation of the buffer pool when the proper fix would be to implement kernels that don't need temporary buffers in the first place. If someone else does care they can make a PR for it. |
I mean if there is a global buffer pool, is it even sloppy to give it like a "flush" function that llama-cpp-python could call? |
Feel free to implement it. As I said, I'm not going to spend my time on a temporary solution. |
Yeah I understand that completely. I didn't mean to sound ungrateful either. And I know I can't demand anything if I'm not going to do it myself and all that. But there's also people being in better positions to do it with less effort, and I guess I just don't understand why such a well managed project doesn't just kind of priority fix something like that. I mean I know I can't technically call it a leak on llama's end. But apparently the bindings can't fix this either and in combination it's just pretty much a gpu memory leak. Idk. Also thanks for your gpu inference, it's pretty cool. |
It's just a matter of manpower. I just work on the things that I want for myself when I feel like it and the only significant CUDA infrastructure contributor other than me is slaren. |
I see. I mean I have never made a pull request in my life but maybe I will actually look into it. |
If you do you should invoke clearing the buffer pool at the same time that the VRAM scratch buffer gets deallocated. |
I am sorry to report that I did in fact opt to not go for a temporary fix, since who knows what the next tool I use decides to keep in global buffer. So I wrapped all my llama-cpp-python stuff in a process wrapper. Here is the code in case a second person wants to use the vram for stable diffusion or something:
I guess it should work for streaming updates too once that works correctly. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama-cpp-python
to do.After calling this function, the
llm
object still occupies memory on the GPU.Current Behavior
Please provide a detailed written description of what
llama-cpp-python
did, instead.The
llm
object should clean up after itself and clear GPU memory. The GPU memory is only released after terminating the python process.Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Linux name 6.2.13-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 26 Apr 2023 20:50:14 +0000 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: