-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add support for the multi-modal Llama 3.2 model #8811
Merged
Merged
Changes from all commits
Commits
Show all changes
82 commits
Select commit
Hold shift + click to select a range
566d57f
add llamav tokeninizer and redirect loader to it
heheda12345 218145a
start to load shape
heheda12345 1c57f26
copy original model
heheda12345 5233e2d
add LlamaVLConfig
heheda12345 72b9a8a
can load weight, attention is ignored
heheda12345 2dd36f5
skip profile run by hardcode, can start model execution
heheda12345 ba9507d
Merge branch 'main' of github.com:vllm-project/vllm
heheda12345 affa9ba
can run text tokenizer now
heheda12345 f633de5
finish image preprocessor
heheda12345 de8bbad
can run vision encoder now
heheda12345 30239ad
run prefill self attention
heheda12345 6972cbf
run prefill crossattention
heheda12345 4e1344b
can generate the first token :)
heheda12345 f3d869d
can perform offline e2e run without decode crossattn, but wrong answer
heheda12345 6f26a3b
pass mm data in encoder-decoder
heheda12345 fa0912e
prefill result matches now. Model is speaking human words.
heheda12345 46634ff
generate correct result for single image
heheda12345 6b73f4d
can support arbitary number of image, need better mask for image_cnt<>1
heheda12345 fb10a70
temp save for profile run
heheda12345 718f879
can run tp, but wrong answer
heheda12345 2644349
can run tp for small model with correct result
heheda12345 ec4cb9c
tp for vision encoder
heheda12345 fc01266
update image preprocessor
heheda12345 3e1d249
support text-only input
heheda12345 c5ba3cf
Merge tag 'v0.6.1.post2' into llamavl
heheda12345 cac19d5
enable profile run
heheda12345 7e5eadd
copy mllama from transformer
heheda12345 7e3fb1e
can init model from vllm
heheda12345 49b05d6
weight loader
heheda12345 2e66a5d
run image encoder now
heheda12345 9770d84
Add API Server Support
simon-mo c9d612b
run single image reqeusts correctly
heheda12345 2f54ae3
single image match huggingface result
heheda12345 9e2d4ea
Merge remote-tracking branch 'origin/meta-ckpt-early-api-server' into…
heheda12345 8f3989e
small fix
heheda12345 01621a5
remove old code
heheda12345 65a470b
hardcode some config to read huggingface's config.json without modify…
heheda12345 2146716
move prompt to encoder prompt
heheda12345 062534b
hardcode to match tokenizer result
heheda12345 23f04b4
update test script
heheda12345 4ed4e6e
update test script
heheda12345 c140258
support text-only input
heheda12345 f662fdd
fix bug in text only prompt
heheda12345 6cf166a
add unit test
heheda12345 b7124e5
add complex tests, but cannot run single-gpu and multi-gpu at the sam…
heheda12345 e69f127
seperate encoder/decoder dummy input, support max_image=1
heheda12345 e0e297c
add mllamaconfig to override some params, simplying the model code (WIP)
heheda12345 f6732cf
upd
heheda12345 228b66b
code cleanup
heheda12345 f30319c
remove image processing from input processor
heheda12345 471e79f
fix precision issue of RMSNorm
heheda12345 2a0cb7e
only keep usefull vision encoder layer
heheda12345 f4a7e1e
Merge remote-tracking branch 'public/main' into llamavl
heheda12345 efbd9b8
merge main
heheda12345 a596997
format code
heheda12345 70b6bb3
try formater again
heheda12345 31000d0
try formater again
heheda12345 5be8a65
try formater again again again
heheda12345 8505a8f
try formater again again again again
heheda12345 a32c3ab
update example
heheda12345 10d1736
fix bug in openai api -> chat template
heheda12345 0aa61b0
change model based on new hf
heheda12345 b993988
make formater happy
heheda12345 9065770
update model name in example
heheda12345 bc34aa4
remove mllama chat template, use HF's instead
heheda12345 a25e383
[Bugfix] Include encoder_prompt_tokens in num_prompt_tokensin UsageInfo
CatherineSue 9b931bf
Merge pull request #6 from vllm-project/chang/num_prompt_tokens
heheda12345 1eefdc7
update config based on HF update
heheda12345 ccebf14
Merge branch 'main' of github.com:vllm-project/vllm
heheda12345 d7750d3
update doc and hf model id
heheda12345 1ebd6dc
update hf model id again
heheda12345 3b6fb2b
Merge branch 'main' of github.com:vllm-project/vllm
heheda12345 c857735
fix format problem
heheda12345 e4bf803
Apply suggestions from code review
heheda12345 4d7fe0a
Update vllm/worker/enc_dec_model_runner.py
heheda12345 4cdc6b5
Update vllm/worker/worker.py
heheda12345 a6ad79f
Update vllm/worker/worker.py
heheda12345 8364093
upgrade huggingface
heheda12345 a12c8d3
Update vllm/transformers_utils/configs/__init__.py
heheda12345 4065047
update code based on code review
heheda12345 293f07f
add note
ywang96 3db294b
format
ywang96 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -38,7 +38,7 @@ | |||||
"content": [ | ||||||
{ | ||||||
"type": "text", | ||||||
"text": "What’s in this image?" | ||||||
"text": "What's in this image?" | ||||||
}, | ||||||
{ | ||||||
"type": "image_url", | ||||||
|
@@ -75,7 +75,7 @@ def encode_image_base64_from_url(image_url: str) -> str: | |||||
"content": [ | ||||||
{ | ||||||
"type": "text", | ||||||
"text": "What’s in this image?" | ||||||
"text": "What's in this image?" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
}, | ||||||
{ | ||||||
"type": "image_url", | ||||||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
283 changes: 283 additions & 0 deletions
283
tests/models/encoder_decoder/vision_language/test_mllama.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,283 @@ | ||
from typing import List, Optional, Tuple, Type, overload | ||
|
||
import pytest | ||
from transformers import (AutoConfig, AutoModelForVision2Seq, AutoTokenizer, | ||
BatchEncoding) | ||
|
||
from vllm.multimodal.utils import rescale_image_size | ||
from vllm.sequence import SampleLogprobs | ||
|
||
from ....conftest import (IMAGE_ASSETS, HfRunner, PromptImageInput, VllmRunner, | ||
_ImageAssets) | ||
from ....utils import multi_gpu_test | ||
from ...utils import check_logprobs_close | ||
|
||
_LIMIT_IMAGE_PER_PROMPT = 1 | ||
|
||
HF_IMAGE_PROMPTS = IMAGE_ASSETS.prompts({ | ||
"stop_sign": | ||
"<|image|><|begin_of_text|>The meaning of the image is", | ||
"cherry_blossom": | ||
"<|image|><|begin_of_text|>The city is", | ||
}) | ||
|
||
text_only_prompts = [ | ||
"The color of the sky is blue but sometimes it can also be", | ||
] | ||
|
||
models = [ | ||
"meta-llama/Llama-3.2-11B-Vision-Instruct", | ||
] | ||
|
||
|
||
def vllm_to_hf_output(vllm_output: Tuple[List[int], str, | ||
Optional[SampleLogprobs]], | ||
model: str): | ||
"""Sanitize vllm output to be comparable with hf output.""" | ||
output_ids, output_str, out_logprobs = vllm_output | ||
|
||
config = AutoConfig.from_pretrained(model) | ||
image_token_id = config.image_token_index | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model) | ||
eos_token_id = tokenizer.eos_token_id | ||
|
||
hf_output_ids = [ | ||
token_id for idx, token_id in enumerate(output_ids) | ||
if token_id != image_token_id or output_ids[idx - 1] != image_token_id | ||
] | ||
|
||
assert output_str[0] == " " | ||
hf_output_str = output_str[1:] | ||
if hf_output_ids[-1] == eos_token_id: | ||
hf_output_str = hf_output_str + tokenizer.decode(eos_token_id) | ||
|
||
return hf_output_ids, hf_output_str, out_logprobs | ||
|
||
|
||
@overload | ||
def run_test( | ||
hf_runner: Type[HfRunner], | ||
vllm_runner: Type[VllmRunner], | ||
image_assets: _ImageAssets, | ||
model: str, | ||
*, | ||
size_factors: List[float], | ||
dtype: str, | ||
max_tokens: int, | ||
num_logprobs: int, | ||
tensor_parallel_size: int, | ||
distributed_executor_backend: Optional[str] = None, | ||
): | ||
... | ||
|
||
|
||
@overload | ||
def run_test( | ||
hf_runner: Type[HfRunner], | ||
vllm_runner: Type[VllmRunner], | ||
image_assets: _ImageAssets, | ||
model: str, | ||
*, | ||
sizes: List[Tuple[int, int]], | ||
dtype: str, | ||
max_tokens: int, | ||
num_logprobs: int, | ||
tensor_parallel_size: int, | ||
distributed_executor_backend: Optional[str] = None, | ||
): | ||
... | ||
|
||
|
||
def run_test( | ||
hf_runner: Type[HfRunner], | ||
vllm_runner: Type[VllmRunner], | ||
image_assets: _ImageAssets, | ||
model: str, | ||
*, | ||
size_factors: Optional[List[float]] = None, | ||
sizes: Optional[List[Tuple[int, int]]] = None, | ||
dtype: str, | ||
max_tokens: int, | ||
num_logprobs: int, | ||
tensor_parallel_size: int, | ||
distributed_executor_backend: Optional[str] = None, | ||
): | ||
images = [asset.pil_image for asset in image_assets] | ||
|
||
if size_factors is not None: | ||
inputs_per_image = [( | ||
[prompt for _ in size_factors], | ||
[rescale_image_size(image, factor) for factor in size_factors], | ||
) for image, prompt in zip(images, HF_IMAGE_PROMPTS)] | ||
elif sizes is not None: | ||
inputs_per_image = [( | ||
[ | ||
prompt if size is not None else text_only_prompts[0] | ||
for size in sizes | ||
], | ||
[ | ||
image.resize(size) if size is not None else None | ||
for size in sizes | ||
], | ||
) for image, prompt in zip(images, HF_IMAGE_PROMPTS)] | ||
if len(sizes) == 0: | ||
inputs_per_image.append( | ||
(text_only_prompts, [None] * len(text_only_prompts))) | ||
else: | ||
raise ValueError("You must provide either `size_factors` or `sizes`") | ||
|
||
_run_test(hf_runner, | ||
vllm_runner, | ||
inputs_per_image, | ||
model, | ||
dtype=dtype, | ||
max_tokens=max_tokens, | ||
num_logprobs=num_logprobs, | ||
tensor_parallel_size=tensor_parallel_size, | ||
distributed_executor_backend=distributed_executor_backend) | ||
|
||
|
||
def _run_test( | ||
hf_runner: Type[HfRunner], | ||
vllm_runner: Type[VllmRunner], | ||
inputs: List[Tuple[List[str], PromptImageInput]], | ||
model: str, | ||
*, | ||
dtype: str, | ||
max_tokens: int, | ||
num_logprobs: int, | ||
tensor_parallel_size: int, | ||
distributed_executor_backend: Optional[str] = None, | ||
): | ||
"""Inference result should be the same between hf and vllm. | ||
|
||
All the image fixtures for the test are from IMAGE_ASSETS. | ||
For huggingface runner, we provide the PIL images as input. | ||
For vllm runner, we provide MultiModalDataDict objects | ||
and corresponding MultiModalConfig as input. | ||
Note, the text input is also adjusted to abide by vllm contract. | ||
The text output is sanitized to be able to compare with hf. | ||
""" | ||
# NOTE: take care of the order. run vLLM first, and then run HF. | ||
# vLLM needs a fresh new process without cuda initialization. | ||
# if we run HF first, the cuda initialization will be done and it | ||
# will hurt multiprocessing backend with fork method (the default method). | ||
|
||
# max_model_len should be greater than image_feature_size | ||
with vllm_runner(model, | ||
dtype=dtype, | ||
max_num_seqs=16, | ||
max_model_len=4096, | ||
tensor_parallel_size=tensor_parallel_size, | ||
distributed_executor_backend=distributed_executor_backend, | ||
enforce_eager=True, | ||
limit_mm_per_prompt={"image": _LIMIT_IMAGE_PER_PROMPT | ||
}) as vllm_model: | ||
vllm_outputs_per_image = [ | ||
vllm_model.generate_greedy_logprobs(prompts, | ||
max_tokens, | ||
num_logprobs=num_logprobs, | ||
images=images) | ||
for prompts, images in inputs | ||
] | ||
|
||
def process(hf_inputs: BatchEncoding): | ||
return hf_inputs | ||
|
||
from transformers import AutoConfig | ||
from transformers.models.mllama import MllamaConfig as MllamaConfigHf | ||
|
||
# use transformer's MllamaConfig for hf_runner | ||
# and vllm's MllamaConfig for vllm_runner | ||
AutoConfig.register("mllama", MllamaConfigHf, exist_ok=True) | ||
with hf_runner(model, | ||
dtype=dtype, | ||
postprocess_inputs=process, | ||
auto_cls=AutoModelForVision2Seq) as hf_model: | ||
hf_outputs_per_image = [ | ||
hf_model.generate_greedy_logprobs_limit(prompts, | ||
max_tokens, | ||
num_logprobs=num_logprobs, | ||
images=images) | ||
for prompts, images in inputs | ||
] | ||
|
||
from vllm.transformers_utils.configs.mllama import MllamaConfig | ||
AutoConfig.register("mllama", MllamaConfig, exist_ok=True) | ||
for hf_outputs, vllm_outputs in zip(hf_outputs_per_image, | ||
vllm_outputs_per_image): | ||
check_logprobs_close( | ||
outputs_0_lst=hf_outputs, | ||
outputs_1_lst=[ | ||
vllm_to_hf_output(vllm_output, model) | ||
for vllm_output in vllm_outputs | ||
], | ||
name_0="hf", | ||
name_1="vllm", | ||
) | ||
|
||
|
||
@pytest.mark.parametrize("model", models) | ||
@pytest.mark.parametrize( | ||
"sizes", | ||
[ | ||
# Text only | ||
[], | ||
# Single-size | ||
[(512, 512)], | ||
# Single-size, batched | ||
[(512, 512), (512, 512), (512, 512)], | ||
# Multi-size, batched | ||
[(512, 512), (1024, 512), (1536, 512), (2048, 512), (512, 1024), | ||
(1024, 1024), (512, 1536), (512, 2028)], | ||
# Multi-size, batched, including text only | ||
[(512, 512), (1024, 512), (1536, 512), (2048, 512), (512, 1024), | ||
(1024, 1024), (512, 1536), (512, 2028), None], | ||
# mllama has 8 possible aspect ratios, carefully set the sizes | ||
# to cover all of them | ||
], | ||
) | ||
@pytest.mark.parametrize("dtype", ["bfloat16"]) | ||
@pytest.mark.parametrize("max_tokens", [128]) | ||
@pytest.mark.parametrize("num_logprobs", [5]) | ||
def test_models(hf_runner, vllm_runner, image_assets, model, sizes, dtype, | ||
max_tokens, num_logprobs) -> None: | ||
run_test( | ||
hf_runner, | ||
vllm_runner, | ||
image_assets, | ||
model, | ||
sizes=sizes, | ||
dtype=dtype, | ||
max_tokens=max_tokens, | ||
num_logprobs=num_logprobs, | ||
tensor_parallel_size=1, | ||
) | ||
|
||
|
||
@multi_gpu_test(num_gpus=2) | ||
@pytest.mark.parametrize("model", models) | ||
@pytest.mark.parametrize( | ||
"sizes", | ||
[ | ||
[(512, 512), (1024, 512), (1536, 512), (2048, 512), (512, 1024), | ||
(1024, 1024), (512, 1536), (512, 2028), None], | ||
], | ||
) | ||
@pytest.mark.parametrize("dtype", ["bfloat16"]) | ||
@pytest.mark.parametrize("max_tokens", [128]) | ||
@pytest.mark.parametrize("num_logprobs", [5]) | ||
def test_models_distributed(hf_runner, vllm_runner, image_assets, model, sizes, | ||
dtype, max_tokens, num_logprobs) -> None: | ||
run_test( | ||
hf_runner, | ||
vllm_runner, | ||
image_assets, | ||
model, | ||
sizes=sizes, | ||
dtype=dtype, | ||
max_tokens=max_tokens, | ||
num_logprobs=num_logprobs, | ||
tensor_parallel_size=2, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing unicode from asian
'
to english'
to avoid some encoding error.