-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA/OpenCL error, out of memory when reload. #1456
Comments
It seems that |
I found all gpu malloc call cudaFree except |
For some reason, I was having this problem but I solved it by killing the task TabNine-deep-local.exe. That might have been local to my computer, but if your GPU is holding onto the memory, try closing some of the processes. |
@bfrasure What is TabNine? If it means code assistant application which you said, I don't use it. |
It's an extension I loaded with VSCode. Looking further, I don't think it's related. |
I could deallocate the gpu offloaded parts by llama_free() modifying. |
I wanted to bring more attention to this issue, @JohannesGaessler, as downstream packages are being affected by offloaded layers not being cleaned from GPU VRAM. |
I can't reproduce the issue. In any case. if I had to guess the problem is not that the cuda buffers for the model weights aren't being deallocated but rather that they are getting allocated multiple times. I will soon make a PR that overhauls the CUDA code to make it more scalable and I'll try to include a fix then. |
Using the python bindings on Linux, this snippet was able to reproduce the issue: from llama_cpp import Llama
import gc
import os
def measure_resources(func):
def get_ram_usage(pid):
ram = os.popen(f'pmap {pid} | tail -1').read().strip()
return ram.split(' ')[-1]
def get_gpu_usage(pid):
gpu = os.popen(f'nvidia-smi --query-compute-apps=pid,used_memory --format=csv | grep {pid}').read().strip()
return gpu.split(', ')[-1] if gpu else '0 MiB'
def wrapper():
pid = os.getpid()
print('pid:', pid)
pre_ram, pre_gpu = get_ram_usage(pid), get_gpu_usage(pid)
print('pre_ram:', pre_ram, 'pre_gpu:', pre_gpu)
func()
post_ram, post_gpu = get_ram_usage(pid), get_gpu_usage(pid)
print('post_ram:', post_ram, 'post_gpu:', post_gpu)
return wrapper
@measure_resources
def generate_text():
llm = Llama(model_path=os.environ.get("MODEL"), n_gpu_layers=40)
del llm
gc.collect()
if __name__ == '__main__':
generate_text() Output:
More info here. |
Thanks for the code snippet, I can reproduce the issue now. I think I'll be able to fix it by adding a destructor to |
I was able to fix this issue on my branch where I'm refactoring CUDA code. |
#include "common.h"
#include "llama.h"
#include "build-info.h"
#include <vector>
#include <cstdio>
#include <chrono>
int main(int argc, char ** argv) {
gpt_params params;
params.seed = 42;
params.n_threads = 4;
params.repeat_last_n = 64;
params.prompt = "The quick brown fox";
if (gpt_params_parse(argc, argv, params) == false) {
return 1;
}
fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);
if (params.n_predict < 0) {
params.n_predict = 16;
}
auto lparams = llama_context_default_params();
lparams.n_ctx = params.n_ctx;
lparams.n_gpu_layers = params.n_gpu_layers; /** Here, I modified for gpu offload enabling */
lparams.seed = params.seed;
lparams.f16_kv = params.memory_f16;
lparams.use_mmap = params.use_mmap;
lparams.use_mlock = params.use_mlock;
auto n_past = 0;
auto last_n_tokens_data = std::vector<llama_token>(params.repeat_last_n, 0);
// init
auto ctx = llama_init_from_file(params.model.c_str(), lparams);
auto tokens = std::vector<llama_token>(params.n_ctx);
auto n_prompt_tokens = llama_tokenize(ctx, params.prompt.c_str(), tokens.data(), tokens.size(), true);
if (n_prompt_tokens < 1) {
fprintf(stderr, "%s : failed to tokenize prompt\n", __func__);
return 1;
}
// evaluate prompt
llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, params.n_threads);
last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
n_past += n_prompt_tokens;
const size_t state_size = llama_get_state_size(ctx);
uint8_t * state_mem = new uint8_t[state_size];
// Save state (rng, logits, embedding and kv_cache) to file
{
FILE *fp_write = fopen("dump_state.bin", "wb");
llama_copy_state_data(ctx, state_mem); // could also copy directly to memory mapped file
fwrite(state_mem, 1, state_size, fp_write);
fclose(fp_write);
}
// save state (last tokens)
const auto last_n_tokens_data_saved = std::vector<llama_token>(last_n_tokens_data);
const auto n_past_saved = n_past;
// first run
printf("\n%s", params.prompt.c_str());
for (auto i = 0; i < params.n_predict; i++) {
auto logits = llama_get_logits(ctx);
auto n_vocab = llama_n_vocab(ctx);
std::vector<llama_token_data> candidates;
candidates.reserve(n_vocab);
for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
}
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
auto next_token = llama_sample_token(ctx, &candidates_p);
auto next_token_str = llama_token_to_str(ctx, next_token);
last_n_tokens_data.push_back(next_token);
printf("%s", next_token_str);
if (llama_eval(ctx, &next_token, 1, n_past, params.n_threads)) {
fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
return 1;
}
n_past += 1;
}
printf("\n\n");
// free old model
llama_free(ctx);
// load new model
auto ctx2 = llama_init_from_file(params.model.c_str(), lparams);
// Load state (rng, logits, embedding and kv_cache) from file
{
FILE *fp_read = fopen("dump_state.bin", "rb");
if (state_size != llama_get_state_size(ctx2)) {
fprintf(stderr, "\n%s : failed to validate state size\n", __func__);
return 1;
}
const size_t ret = fread(state_mem, 1, state_size, fp_read);
if (ret != state_size) {
fprintf(stderr, "\n%s : failed to read state\n", __func__);
return 1;
}
llama_set_state_data(ctx2, state_mem); // could also read directly from memory mapped file
fclose(fp_read);
}
delete[] state_mem;
// restore state (last tokens)
last_n_tokens_data = last_n_tokens_data_saved;
n_past = n_past_saved;
// second run
for (auto i = 0; i < params.n_predict; i++) {
auto logits = llama_get_logits(ctx2);
auto n_vocab = llama_n_vocab(ctx2);
std::vector<llama_token_data> candidates;
candidates.reserve(n_vocab);
for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
}
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
auto next_token = llama_sample_token(ctx2, &candidates_p);
auto next_token_str = llama_token_to_str(ctx2, next_token);
last_n_tokens_data.push_back(next_token);
printf("%s", next_token_str);
if (llama_eval(ctx2, &next_token, 1, n_past, params.n_threads)) {
fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
return 1;
}
n_past += 1;
}
printf("\n\n");
return 0;
} Here is code I used. D:\llama.cpp_test>save-load-state.exe -m vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin -ngl 32
main: build = 589 (1fcdcc2)
llama.cpp: loading model from vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3475 MB
..................................................................................................
llama_init_from_file: kv self size = 256.00 MB
The quick brown fox jumps over the lazy dog.
<!-- InstanceEnd -->Visible transl
llama.cpp: loading model from vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3475 MB
.........................................................................................CUDA error 2 at D:\dev\pcbangstudio\workspace\llama.cpp\ggml-cuda.cu:935: out of memory
D:\llama.cpp_test> I tried vicuna 7b model which consume about 4gb vram on 3060ti 8gb also with #1530 .
For vram, still not work. |
I added a fix in this PR #1607 where I'm refactoring the CUDA code. However, I added a new CLI argument
Can I easily fix this on my end or will llamacpp-python need to be updated? |
I think llama-cpp-python needs to be updated. I briefly looked at the code that’s causing the error. Seems like we will need to update the default parameters being passed during initialization to llama.cpp. What do you think @gjmulder? |
@JohannesGaessler It seems work for cuda. Does it also affect to opencl? |
@nidhishs @JohannesGaessler, I believe @abetlen's policy is to expose all parameters that It is certainly required when doing apples-to-apples tests as we seem to be getting a number of "llama-cpp-python is slower than llama.cpp" issues. |
|
I have not made any changes to OpenCL. |
Disregard my previous post, I was using the correct repository. |
Currently CUDA release vram usage well. Thank you @JohannesGaessler . |
Go ahead, sure. |
Ok, I will do it. |
Done. @0cc4m thank you for acceptance. |
Oh this is closed? That probably explains why I'm still waiting for the memory leak fix in llama-cpp-python 2 months later. |
@iactix Yes, the leakage issues which I'v met atleast , were solved. If you still have problem, you can open another issue. |
Hello folks,
When try
save-load-state
example with CUDA, error occured.It seems to necessary to add something toward
llama_free
function.n_gpu_layers
variable is appended at main function like below.And tried to run as below.
The text was updated successfully, but these errors were encountered: