CUDA/OpenCL error, out of memory when reload. #1456

edp1096 · 2023-05-14T17:56:02Z

Hello folks,

When try save-load-state example with CUDA, error occured.
It seems to necessary to add something toward llama_free function.

n_gpu_layers variable is appended at main function like below.

int main(int argc, char ** argv) {
    ...
    auto lparams = llama_context_default_params();

    lparams.n_ctx     = params.n_ctx;
    lparams.n_parts   = params.n_parts;
    lparams.n_gpu_layers = params.n_gpu_layers; // Add gpu layers count
    lparams.seed      = params.seed;
    ...
}

And tried to run as below.

D:\dev\pcbangstudio\workspace\my-llama\bin>save-load-state.exe -m ggml-vic7b-q4_0.bin -ngl 32
main: build = 548 (60f8c36)
llama.cpp: loading model from ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3860 MB
llama_init_from_file: kv self size  =  256.00 MB

The quick brown fox jumps over the lazy dog.

<!-- InstanceEnd -->Visible transl

llama.cpp: loading model from ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
CUDA error 2 at D:\dev\pcbangstudio\workspace\my-llama\llama.cpp\ggml-cuda.cu:462: out of memory

D:\dev\pcbangstudio\workspace\my-llama\bin>

The text was updated successfully, but these errors were encountered:

FSSRepo · 2023-05-15T03:14:27Z

It seems that llama_free is not releasing the memory used by the previously used weights.

edp1096 · 2023-05-17T07:31:54Z

I found all gpu malloc call cudaFree except ggml_cuda_transform_tensor in ggml_cuda.cu
Is there reason to leave qkv layers in state of allocated?

bfrasure · 2023-05-21T04:48:24Z

For some reason, I was having this problem but I solved it by killing the task TabNine-deep-local.exe. That might have been local to my computer, but if your GPU is holding onto the memory, try closing some of the processes.

edp1096 · 2023-05-21T05:22:47Z

@bfrasure What is TabNine? If it means code assistant application which you said, I don't use it.

bfrasure · 2023-05-21T06:28:35Z

It's an extension I loaded with VSCode. Looking further, I don't think it's related.

edp1096 · 2023-05-22T14:48:22Z

I could deallocate the gpu offloaded parts by llama_free() modifying.
#1459 for clblast, a pr which is not accepted yet is working but #1412 for cuda is not woking.

nidhishs · 2023-05-25T14:34:10Z

I wanted to bring more attention to this issue, @JohannesGaessler, as downstream packages are being affected by offloaded layers not being cleaned from GPU VRAM.

JohannesGaessler · 2023-05-25T21:20:51Z

I can't reproduce the issue. In any case. if I had to guess the problem is not that the cuda buffers for the model weights aren't being deallocated but rather that they are getting allocated multiple times. I will soon make a PR that overhauls the CUDA code to make it more scalable and I'll try to include a fix then.

nidhishs · 2023-05-25T21:25:44Z

Using the python bindings on Linux, this snippet was able to reproduce the issue:

from llama_cpp import Llama
import gc
import os

def measure_resources(func):
    def get_ram_usage(pid):
        ram = os.popen(f'pmap {pid} | tail -1').read().strip()
        return ram.split(' ')[-1]
    
    def get_gpu_usage(pid):
        gpu = os.popen(f'nvidia-smi --query-compute-apps=pid,used_memory --format=csv | grep {pid}').read().strip()
        return gpu.split(', ')[-1] if gpu else '0 MiB'

    def wrapper():
        pid = os.getpid()
        print('pid:', pid)
        pre_ram, pre_gpu = get_ram_usage(pid), get_gpu_usage(pid)
        print('pre_ram:', pre_ram, 'pre_gpu:', pre_gpu)
        func()
        post_ram, post_gpu = get_ram_usage(pid), get_gpu_usage(pid)
        print('post_ram:', post_ram, 'post_gpu:', post_gpu)

    return wrapper

@measure_resources
def generate_text():
    llm = Llama(model_path=os.environ.get("MODEL"), n_gpu_layers=40)
    del llm
    gc.collect()

if __name__ == '__main__':
    generate_text()

Output:

pid: 13121
pre_ram: 720676K pre_gpu: 0 MiB
llama.cpp: loading model from ./weights/oasst-30b.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32016
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 135.75 KB
llama_model_load_internal: mem required  = 25573.29 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 15307 MB
llama_init_from_file: kv self size  =  780.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
post_ram: 25209048K post_gpu: 16074 MiB

More info here.

JohannesGaessler · 2023-05-25T22:08:22Z

Thanks for the code snippet, I can reproduce the issue now. I think I'll be able to fix it by adding a destructor to ggml_tensor although deleting and then recreating LLama Python objects will still require you to load up VRAM every time.

JohannesGaessler · 2023-05-25T22:52:05Z

I was able to fix this issue on my branch where I'm refactoring CUDA code.

edp1096 · 2023-05-26T00:33:50Z

#include "common.h"
#include "llama.h"
#include "build-info.h"

#include <vector>
#include <cstdio>
#include <chrono>

int main(int argc, char ** argv) {
    gpt_params params;
    params.seed = 42;
    params.n_threads = 4;
    params.repeat_last_n = 64;
    params.prompt = "The quick brown fox";

    if (gpt_params_parse(argc, argv, params) == false) {
        return 1;
    }

    fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);

    if (params.n_predict < 0) {
        params.n_predict = 16;
    }

    auto lparams = llama_context_default_params();

    lparams.n_ctx     = params.n_ctx;
    lparams.n_gpu_layers = params.n_gpu_layers;  /** Here, I modified for gpu offload enabling */
    lparams.seed      = params.seed;
    lparams.f16_kv    = params.memory_f16;
    lparams.use_mmap  = params.use_mmap;
    lparams.use_mlock = params.use_mlock;

    auto n_past = 0;
    auto last_n_tokens_data = std::vector<llama_token>(params.repeat_last_n, 0);

    // init
    auto ctx = llama_init_from_file(params.model.c_str(), lparams);
    auto tokens = std::vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, params.prompt.c_str(), tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        fprintf(stderr, "%s : failed to tokenize prompt\n", __func__);
        return 1;
    }

    // evaluate prompt
    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, params.n_threads);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    const size_t state_size = llama_get_state_size(ctx);
    uint8_t * state_mem = new uint8_t[state_size];

    // Save state (rng, logits, embedding and kv_cache) to file
    {
        FILE *fp_write = fopen("dump_state.bin", "wb");
        llama_copy_state_data(ctx, state_mem); // could also copy directly to memory mapped file
        fwrite(state_mem, 1, state_size, fp_write);
        fclose(fp_write);
    }

    // save state (last tokens)
    const auto last_n_tokens_data_saved = std::vector<llama_token>(last_n_tokens_data);
    const auto n_past_saved = n_past;

    // first run
    printf("\n%s", params.prompt.c_str());

    for (auto i = 0; i < params.n_predict; i++) {
        auto logits = llama_get_logits(ctx);
        auto n_vocab = llama_n_vocab(ctx);
        std::vector<llama_token_data> candidates;
        candidates.reserve(n_vocab);
        for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
            candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
        }
        llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
        auto next_token = llama_sample_token(ctx, &candidates_p);
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);

        printf("%s", next_token_str);
        if (llama_eval(ctx, &next_token, 1, n_past, params.n_threads)) {
            fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
            return 1;
        }
        n_past += 1;
    }

    printf("\n\n");

    // free old model
    llama_free(ctx);

    // load new model
    auto ctx2 = llama_init_from_file(params.model.c_str(), lparams);

    // Load state (rng, logits, embedding and kv_cache) from file
    {
        FILE *fp_read = fopen("dump_state.bin", "rb");
        if (state_size != llama_get_state_size(ctx2)) {
            fprintf(stderr, "\n%s : failed to validate state size\n", __func__);
            return 1;
        }

        const size_t ret = fread(state_mem, 1, state_size, fp_read);
        if (ret != state_size) {
            fprintf(stderr, "\n%s : failed to read state\n", __func__);
            return 1;
        }

        llama_set_state_data(ctx2, state_mem);  // could also read directly from memory mapped file
        fclose(fp_read);
    }

    delete[] state_mem;

    // restore state (last tokens)
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // second run
    for (auto i = 0; i < params.n_predict; i++) {
        auto logits = llama_get_logits(ctx2);
        auto n_vocab = llama_n_vocab(ctx2);
        std::vector<llama_token_data> candidates;
        candidates.reserve(n_vocab);
        for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
            candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
        }
        llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
        auto next_token = llama_sample_token(ctx2, &candidates_p);
        auto next_token_str = llama_token_to_str(ctx2, next_token);
        last_n_tokens_data.push_back(next_token);

        printf("%s", next_token_str);
        if (llama_eval(ctx2, &next_token, 1, n_past, params.n_threads)) {
            fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
            return 1;
        }
        n_past += 1;
    }

    printf("\n\n");

    return 0;
}

Here is code I used.

D:\llama.cpp_test>save-load-state.exe -m vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin -ngl 32
main: build = 589 (1fcdcc2)
llama.cpp: loading model from vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3475 MB
..................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

The quick brown fox jumps over the lazy dog.

<!-- InstanceEnd -->Visible transl

llama.cpp: loading model from vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3475 MB
.........................................................................................CUDA error 2 at D:\dev\pcbangstudio\workspace\llama.cpp\ggml-cuda.cu:935: out of memory

D:\llama.cpp_test>

I tried vicuna 7b model which consume about 4gb vram on 3060ti 8gb also with #1530 .

llama_free function works well for cpu ram.

For vram, still not work.

JohannesGaessler · 2023-05-28T10:06:37Z

I added a fix in this PR #1607 where I'm refactoring the CUDA code. However, I added a new CLI argument --tensor-split and because of that the Python script that I used to reproduce the memory leak seems to now be broken:

ggml_init_cublas: found 1 CUDA devices:
  1. NVIDIA GeForce RTX 3090
pid: 2070536
pre_ram: 8135808K pre_gpu: 628 MiB
Fatal Python error: PyEval_RestoreThread: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: initialized

Current thread 0x00007fdd557c3740 (most recent call first):
  File "/home/johannesg/Projects/llama-cpp-python/llama_cpp/llama_cpp.py", line 207 in llama_context_default_params
  File "/home/johannesg/Projects/llama-cpp-python/llama_cpp/llama.py", line 128 in __init__
  File "/home/johannesg/Projects/llama.cpp/oom.py", line 29 in generate_text
  File "/home/johannesg/Projects/llama.cpp/oom.py", line 21 in wrapper
  File "/home/johannesg/Projects/llama.cpp/oom.py", line 34 in <module>
[1]    2070536 IOT instruction (core dumped)  python3 oom.py

Can I easily fix this on my end or will llamacpp-python need to be updated?

nidhishs · 2023-05-28T10:26:19Z

I think llama-cpp-python needs to be updated. I briefly looked at the code that’s causing the error. Seems like we will need to update the default parameters being passed during initialization to llama.cpp. What do you think @gjmulder?

edp1096 · 2023-05-28T11:59:26Z

@JohannesGaessler It seems work for cuda. Does it also affect to opencl?

gjmulder · 2023-05-28T12:00:11Z

@nidhishs @JohannesGaessler, I believe @abetlen's policy is to expose all parameters that llama.cpp exposes so they can be configured within python.

It is certainly required when doing apples-to-apples tests as we seem to be getting a number of "llama-cpp-python is slower than llama.cpp" issues.

JohannesGaessler · 2023-05-28T14:57:53Z

Wait, I think I may have been using the wrong Python bindings. I was using this repository which worked for me to reproduce the bug. Can someone give me a quick rundown for the difference between this and abetlen's repository?

JohannesGaessler · 2023-05-28T14:58:54Z

It seems work for cuda. Does it also affect to opencl?

I have not made any changes to OpenCL.

JohannesGaessler · 2023-05-28T15:03:06Z

Disregard my previous post, I was using the correct repository.

edp1096 · 2023-06-07T13:06:41Z

Currently CUDA release vram usage well. Thank you @JohannesGaessler .
@0cc4m I'm still looking forward for opencl but if you're busy, can I post a PR for this?

0cc4m · 2023-06-07T13:16:52Z

Go ahead, sure.

edp1096 · 2023-06-07T13:45:06Z

Ok, I will do it.

edp1096 · 2023-06-09T23:16:05Z

Done. @0cc4m thank you for acceptance.

iactix · 2023-07-19T18:38:17Z

Oh this is closed? That probably explains why I'm still waiting for the memory leak fix in llama-cpp-python 2 months later.

edp1096 · 2023-07-20T03:39:31Z

@iactix Yes, the leakage issues which I'v met atleast , were solved. If you still have problem, you can open another issue.

strnad mentioned this issue May 15, 2023

Add settings UI for llama.cpp and fixed reloading of llama.cpp models oobabooga/text-generation-webui#2087

Merged

iactix mentioned this issue May 21, 2023

GPU memory not cleaned up after off-loading layers to GPU using n_gpu_layers abetlen/llama-cpp-python#223

Open

4 tasks

gjmulder added bug Something isn't working high priority Very important issue hardware Hardware related labels May 22, 2023

edp1096 changed the title ~~CUDA error, out of memory when reload.~~ CUDA/OpenCL error, out of memory when reload. May 28, 2023

edp1096 mentioned this issue Jun 7, 2023

OpenCL: Add release memory #1741

Merged

edp1096 closed this as completed Jun 9, 2023

vuongvmu mentioned this issue Aug 9, 2023

Out of GPU memory when creating multiple sessions SciSharp/LLamaSharp#94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA/OpenCL error, out of memory when reload. #1456

CUDA/OpenCL error, out of memory when reload. #1456

edp1096 commented May 14, 2023 •

edited

Loading

FSSRepo commented May 15, 2023

edp1096 commented May 17, 2023 •

edited

Loading

bfrasure commented May 21, 2023 •

edited

Loading

edp1096 commented May 21, 2023

bfrasure commented May 21, 2023

edp1096 commented May 22, 2023

nidhishs commented May 25, 2023

JohannesGaessler commented May 25, 2023

nidhishs commented May 25, 2023

JohannesGaessler commented May 25, 2023

JohannesGaessler commented May 25, 2023

edp1096 commented May 26, 2023

JohannesGaessler commented May 28, 2023

nidhishs commented May 28, 2023

edp1096 commented May 28, 2023

gjmulder commented May 28, 2023

JohannesGaessler commented May 28, 2023 •

edited

Loading

JohannesGaessler commented May 28, 2023

JohannesGaessler commented May 28, 2023

edp1096 commented Jun 7, 2023

0cc4m commented Jun 7, 2023

edp1096 commented Jun 7, 2023

edp1096 commented Jun 9, 2023

iactix commented Jul 19, 2023 •

edited

Loading

edp1096 commented Jul 20, 2023

CUDA/OpenCL error, out of memory when reload. #1456

CUDA/OpenCL error, out of memory when reload. #1456

Comments

edp1096 commented May 14, 2023 • edited Loading

FSSRepo commented May 15, 2023

edp1096 commented May 17, 2023 • edited Loading

bfrasure commented May 21, 2023 • edited Loading

edp1096 commented May 21, 2023

bfrasure commented May 21, 2023

edp1096 commented May 22, 2023

nidhishs commented May 25, 2023

JohannesGaessler commented May 25, 2023

nidhishs commented May 25, 2023

JohannesGaessler commented May 25, 2023

JohannesGaessler commented May 25, 2023

edp1096 commented May 26, 2023

JohannesGaessler commented May 28, 2023

nidhishs commented May 28, 2023

edp1096 commented May 28, 2023

gjmulder commented May 28, 2023

JohannesGaessler commented May 28, 2023 • edited Loading

JohannesGaessler commented May 28, 2023

JohannesGaessler commented May 28, 2023

edp1096 commented Jun 7, 2023

0cc4m commented Jun 7, 2023

edp1096 commented Jun 7, 2023

edp1096 commented Jun 9, 2023

iactix commented Jul 19, 2023 • edited Loading

edp1096 commented Jul 20, 2023

edp1096 commented May 14, 2023 •

edited

Loading

edp1096 commented May 17, 2023 •

edited

Loading

bfrasure commented May 21, 2023 •

edited

Loading

JohannesGaessler commented May 28, 2023 •

edited

Loading

iactix commented Jul 19, 2023 •

edited

Loading