Qwen2-VL runtime error fix when prompted with multiple images #2840

janne-alatalo · 2024-12-13T11:56:33Z

What does this PR do?

See commit messages for more detailed descriptions of the fix.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids text_pos_ids = torch.arange(text_length, device=d) RuntimeError: upper bound and larger bound inconsistent with step sign The error was caused by text_length variable going to negative value when multiple images caused multiple loops in the get_position_ids function's main loop. The error is a simple logic mistake where next_image_pos is initialized as relative offset from current_pos, but was used like it was absolute position from zero.

Fix runtime error when Qwen2-VL model is prompted with prompt with more than one image. The runtime error was: File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward inputs_embeds[input_ids == self.image_token_id] = image_embeds RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584] (The error message shape numbers can be different depending on the input image resolutions) The error was caused by adding the wrong number of <|image_pad|> tokens to the tokenized input in the image_text_replacement function. The error is a simple logical mistake where the number of image pad tokens is checked from pixel_value_shape tensor's first dimension length. However, the pixel_value_shape contains patches from all of the images. Therefore the code added the total number of required image pad tokens for the whole input to each of the images locations. This resulted to extra image pad tokens to be present in the tokenized input. The fix was to check the number of required tokens from the image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w values for each image. grid_t * grid_h * grid_w results to the total number of patches for the image [1]. The number of required image pad tokens is number_of_patches // 4. [1] https://github.com/huggingface/transformers/blob/31f9a289a6207be6cae746e009d8e0db523be203/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L311

drbh

LGTM! @janne-alatalo thank you for the contribution and fix! I've tested the changes locally and CI is passing on #2849 (the only failing tests are unrelated).

will merge this branch and close #2849

AHEADer · 2025-01-09T01:06:01Z

Hi May I know if you have tried compiling tgi locally for dev? I compiled tgi locally and it took me endless time until I got oom error. I have 1400G memory for my local machine. I just run BUILD_EXTENSIONS=True make install and seems it stuck at build ext_module forever....

janne-alatalo · 2025-01-09T16:09:45Z

Hi May I know if you have tried compiling tgi locally for dev? I compiled tgi locally and it took me endless time until I got oom error. I have 1400G memory for my local machine. I just run BUILD_EXTENSIONS=True make install and seems it stuck at build ext_module forever....

You probably want to open a new issue about this problem if there already isn't one. I assume your question doesn't have anything to do with this pull request that was already merged to upstream and only contained some small fixes to the qwen2-vl model.

Anyways, your problem is probably related to the flash-attn pip installation. Or some of the flash-attn submodule install. I do not remember which it was. It took veeery long time for me too. Like several hours. If you google "pip install flash-attn takes a long time" you can see that many others have the same problem too. I think that setting the export MAX_JOBS=4 env var might have helped a little.

alatja added 2 commits December 13, 2024 12:27

janne-alatalo mentioned this pull request Dec 13, 2024

RuntimeError when prompting Qwen2-VL with multiple images #2839

Closed

4 tasks

drbh self-assigned this Dec 13, 2024

drbh mentioned this pull request Dec 16, 2024

Pr 2840 ci branch #2849

Closed

drbh approved these changes Dec 17, 2024

View reviewed changes

drbh merged commit 7eeefa3 into huggingface:main Dec 17, 2024
10 of 14 checks passed

drbh mentioned this pull request Jan 8, 2025

Qwen2-VL failed to infer multiple images (Server error: upper bound and larger bound inconsistent with step sign) #2888

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2-VL runtime error fix when prompted with multiple images #2840

Qwen2-VL runtime error fix when prompted with multiple images #2840

janne-alatalo commented Dec 13, 2024

drbh left a comment

AHEADer commented Jan 9, 2025

janne-alatalo commented Jan 9, 2025

Qwen2-VL runtime error fix when prompted with multiple images #2840

Qwen2-VL runtime error fix when prompted with multiple images #2840

Conversation

janne-alatalo commented Dec 13, 2024

What does this PR do?

Before submitting

Who can review?

drbh left a comment

Choose a reason for hiding this comment

AHEADer commented Jan 9, 2025

janne-alatalo commented Jan 9, 2025