-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support LoRA on quantized models #4012
Conversation
@Yard1 All checks have passed. Could I trouble you to please review it again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Very nice -- ty for closing out the last chores 🎉 |
output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1) | ||
|
||
del llm_tp1 | ||
cleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 @jeejeelee I'm not able to release GPU memory usage using cleanup
or del llm_tp1
. Are you sure this would work (if CI had multiple GPUs and we could enable this test)?
On 0.4.2
version, if I start vllm.LLM
I can't find any way to release the GPU memory again. I have to kill the process. Do you know any other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot a bit, but I'm quite sure test_baichuan.py
can be tested using tp=2. I've tested it myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried it using the following code, GPU memory can release normly, however, I didn't use the released 0.4.2 version, I used f12c3b5 instead.
@pytest.mark.parametrize("model", MODELS)
# @pytest.mark.skip("Requires multiple GPUs")
def test_quant_model_tp_equality(tinyllama_lora_files, model):
# Cannot use as it will initialize torch.cuda too early...
# if torch.cuda.device_count() < 2:
# pytest.skip(f"Not enough GPUs for tensor parallelism {2}")
llm_tp1 = vllm.LLM(model=model.model_path,
enable_lora=True,
max_num_seqs=16,
max_loras=4,
tensor_parallel_size=1,
quantization=model.quantization,
gpu_memory_utilization=0.8,
trust_remote_code=True)
output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)
del llm_tp1
cleanup()
llm_tp1 = vllm.LLM(model=model.model_path,
enable_lora=True,
max_num_seqs=16,
max_loras=4,
tensor_parallel_size=1,
quantization=model.quantization,
gpu_memory_utilization=0.8,
trust_remote_code=True)
output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)
del llm_tp1
cleanup()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any multi-GPU machines available right now, I will use tp=2
for testing again later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for checking that. I believe the tensor_parallel_size=2
uses multiprocessing
and so the behavior is different from what I experimented with. I'll post a new issue.
Building upon the excellent work done in #2828
Since there hasn't been much progress on #2828,so I'd like to continue and complete this feature.
Compared to #2828, the main improvement is the addition of support for
tensor parallelism
.@fmmoret Please feel free to contact me if there are any inappropriate issues.