Replies: 4 comments 6 replies
-
Beta Was this translation helpful? Give feedback.
-
@lockon-n The current weight-loading logic is to load each shard into the CPU, split the weights, and transfer then them to GPU. Therefore, when a single shard is too large, it can cause CPU OOM. Please use the sliced version to reduce the memory pressure. |
Beta Was this translation helpful? Give feedback.
-
@jibowang it seems like you have other processes running on the same GPU as vLLM. vLLM is designed to occupy all the GPU memory for storing KV cache blocks. You can pass in the |
Beta Was this translation helpful? Give feedback.
-
I was wondering about this. I noticed that for meta-llama/Llama-2-7b-chat-hf, loading the model using transformers would leave a GPU RAM footprint of 13.8GB for inference work, while when loaded with vLLM it goes up to 24GB. Is this the reason? I thought it was a bug or a problem. |
Beta Was this translation helpful? Give feedback.
-
I was trying to use vLLM on a finetined LLaMA 65B model. At first the model is a complete fp32 bin file, i.e., pytorch_model.bin (more than 200GB), and I found this will cause OOM (not cuda memory) error even the available memory was more than 1TB. This was be fixed after I split the big file info many small shards, e.g., 00001-00100.bin. So, do we have to use the sliced version when loading super large models?
Another observation is that, I execute
free -h
from time to time when loading the model, and the memory usage keeps going up and down, although the general trend is up. Is it a normal phenomenon?Beta Was this translation helpful? Give feedback.
All reactions