OOM error and memory usage question #309

lockon-n · 2023-06-29T06:38:53Z

lockon-n
Jun 29, 2023

I was trying to use vLLM on a finetined LLaMA 65B model. At first the model is a complete fp32 bin file, i.e., pytorch_model.bin (more than 200GB), and I found this will cause OOM (not cuda memory) error even the available memory was more than 1TB. This was be fixed after I split the big file info many small shards, e.g., 00001-00100.bin. So, do we have to use the sliced version when loading super large models?

Another observation is that, I execute free -h from time to time when loading the model, and the memory usage keeps going up and down, although the general trend is up. Is it a normal phenomenon?

jibowang · 2023-06-29T13:13:11Z

jibowang
Jun 29, 2023

I aslo meet torch.cuda.OutOfMemoryError: CUDA out of memory , when i run this python :

from vllm import LLM, SamplingParams
import os

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/home/ubuntu/wangjibo/models/facebook_opt-350m")
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Error message like this:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 424.00 MiB (GPU 0; 23.69 GiB total capacity; 7.25
GiB already allocated; 405.31 MiB free; 7.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory
try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF

Watch GPU memory usage: watch -n 2 nvidia-smi
Increments incrementally until OOM

If i miss some params or it's a vllm bug?

0 replies

zhuohan123 · 2023-06-29T14:09:01Z

zhuohan123
Jun 29, 2023
Maintainer

@lockon-n The current weight-loading logic is to load each shard into the CPU, split the weights, and transfer then them to GPU. Therefore, when a single shard is too large, it can cause CPU OOM. Please use the sliced version to reduce the memory pressure.

4 replies

lockon-n Jun 29, 2023
Author

However, I can do a standard AutoModelForCausalLM.from_pretrained(), so seems if we just load a complete model into cpu memory that will be ok. So you mean the split weights process will cause (may be far more) memory usage than just loading it. Is my understanding correct?

zhuohan123 Jun 29, 2023
Maintainer

The from_pretrained function might use safetensors than simple python pickle, so it might be more memory efficient than our current implementation. The weight-splitting process should not use that much memory. The memory going up and down might be caused by that when we finish reading a shard, we will delete that shard from memory, and thus you see a sudden drop in memory usage.

BTW, what's your CPU memory size? Is the model larger than your CPU memory size?

lockon-n Jun 30, 2023
Author

My CPU memory is 1TB. When I use the standard from_pretrained(), the peak memory usage when loading it is about 500~600GB, and the stable memory usage after completion of loading is about 300GB.
Here is the model structure

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32001, 8192, padding_idx=0)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (v_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (o_proj): Linear(in_features=8192, out_features=8192, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=8192, out_features=22016, bias=False)
          (down_proj): Linear(in_features=22016, out_features=8192, bias=False)
          (up_proj): Linear(in_features=8192, out_features=22016, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=8192, out_features=32001, bias=False)
)

The error when using vLLM be like:

OutOfMemoryError: Task was killed due to the node running low on memory. Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_threshold when starting Ray. To disable worker killing, set the environment variable RAY_memory_monitor_refresh_ms` to zero.

lucasjinreal Jul 13, 2023

it has nothing to do with cpu mem.... my cpu is 2T mem still OOM

zhuohan123 · 2023-06-29T14:11:40Z

zhuohan123
Jun 29, 2023
Maintainer

@jibowang it seems like you have other processes running on the same GPU as vLLM. vLLM is designed to occupy all the GPU memory for storing KV cache blocks. You can pass in the gpu_memory_utilization=0.5 (50% utilization, or can be set even lower) when initialize the LLM class to reduce the memory footprint.

0 replies

rolandtannous · 2023-07-31T19:35:37Z

rolandtannous
Jul 31, 2023

@jibowang it seems like you have other processes running on the same GPU as vLLM. vLLM is designed to occupy all the GPU memory for storing KV cache blocks. You can pass in the gpu_memory_utilization=0.5 (50% utilization, or can be set even lower) when initialize the LLM class to reduce the memory footprint.

I was wondering about this. I noticed that for meta-llama/Llama-2-7b-chat-hf, loading the model using transformers would leave a GPU RAM footprint of 13.8GB for inference work, while when loaded with vLLM it goes up to 24GB. Is this the reason? I thought it was a bug or a problem.
Would decreasing the gpu memory utilization to 0.50% negatively impact the continuous batching capability?

2 replies

cosmic-heart Mar 24, 2024

Do you know the answer ?

rolandtannous Mar 31, 2024

I can't remember what I did exactly back then but try to decrease the gpu_memory_utilization parameter and see what happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error and memory usage question #309

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OOM error and memory usage question #309

lockon-n Jun 29, 2023

Replies: 4 comments · 6 replies

jibowang Jun 29, 2023

zhuohan123 Jun 29, 2023 Maintainer

lockon-n Jun 29, 2023 Author

zhuohan123 Jun 29, 2023 Maintainer

lockon-n Jun 30, 2023 Author

lucasjinreal Jul 13, 2023

zhuohan123 Jun 29, 2023 Maintainer

rolandtannous Jul 31, 2023

cosmic-heart Mar 24, 2024

rolandtannous Mar 31, 2024

lockon-n
Jun 29, 2023

Replies: 4 comments 6 replies

jibowang
Jun 29, 2023

zhuohan123
Jun 29, 2023
Maintainer

lockon-n Jun 29, 2023
Author

zhuohan123 Jun 29, 2023
Maintainer

lockon-n Jun 30, 2023
Author

zhuohan123
Jun 29, 2023
Maintainer

rolandtannous
Jul 31, 2023