Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-VL runtime error fix when prompted with multiple images #2840

Merged
merged 2 commits into from
Dec 17, 2024

Conversation

janne-alatalo
Copy link
Contributor

What does this PR do?

Fixes #2839

See commit messages for more detailed descriptions of the fix.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:

 File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids
    text_pos_ids = torch.arange(text_length, device=d)
RuntimeError: upper bound and larger bound inconsistent with step sign

The error was caused by text_length variable going to negative value
when multiple images caused multiple loops in the get_position_ids
function's main loop.

The error is a simple logic mistake where next_image_pos is initialized
as relative offset from current_pos, but was used like it was absolute
position from zero.
Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:

File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward
    inputs_embeds[input_ids == self.image_token_id] = image_embeds
RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584]

(The error message shape numbers can be different depending on the input
image resolutions)

The error was caused by adding the wrong number of <|image_pad|> tokens
to the tokenized input in the image_text_replacement function.

The error is a simple logical mistake where the number of image pad
tokens is checked from pixel_value_shape tensor's first dimension
length. However, the pixel_value_shape contains patches from all of the
images. Therefore the code added the total number of required image pad
tokens for the whole input to each of the images locations. This
resulted to extra image pad tokens to be present in the tokenized input.

The fix was to check the number of required tokens from the
image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w
values for each image. grid_t * grid_h * grid_w results to the total
number of patches for the image [1]. The number of required image pad
tokens is number_of_patches // 4.

[1] https://github.com/huggingface/transformers/blob/31f9a289a6207be6cae746e009d8e0db523be203/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L311
Copy link
Collaborator

@drbh drbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @janne-alatalo thank you for the contribution and fix! I've tested the changes locally and CI is passing on #2849 (the only failing tests are unrelated).

will merge this branch and close #2849

@drbh drbh merged commit 7eeefa3 into huggingface:main Dec 17, 2024
10 of 14 checks passed
@AHEADer
Copy link

AHEADer commented Jan 9, 2025

Hi May I know if you have tried compiling tgi locally for dev? I compiled tgi locally and it took me endless time until I got oom error. I have 1400G memory for my local machine. I just run BUILD_EXTENSIONS=True make install and seems it stuck at build ext_module forever....

@janne-alatalo
Copy link
Contributor Author

Hi May I know if you have tried compiling tgi locally for dev? I compiled tgi locally and it took me endless time until I got oom error. I have 1400G memory for my local machine. I just run BUILD_EXTENSIONS=True make install and seems it stuck at build ext_module forever....

You probably want to open a new issue about this problem if there already isn't one. I assume your question doesn't have anything to do with this pull request that was already merged to upstream and only contained some small fixes to the qwen2-vl model.

Anyways, your problem is probably related to the flash-attn pip installation. Or some of the flash-attn submodule install. I do not remember which it was. It took veeery long time for me too. Like several hours. If you google "pip install flash-attn takes a long time" you can see that many others have the same problem too. I think that setting the export MAX_JOBS=4 env var might have helped a little.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RuntimeError when prompting Qwen2-VL with multiple images
4 participants