Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text-only prompts don't work for VLMs #177

Closed
adk9 opened this issue Jan 10, 2025 · 7 comments · Fixed by #179
Closed

Text-only prompts don't work for VLMs #177

adk9 opened this issue Jan 10, 2025 · 7 comments · Fixed by #179
Assignees

Comments

@adk9
Copy link

adk9 commented Jan 10, 2025

I'm trying to run mlx-community/Qwen2-VL-2B-Instruct-4bit with a text-only prompt. And I get the following error:

Model loaded -> id("mlx-community/Qwen2-VL-2B-Instruct-4bit")
Starting generation ...
How are you?
MLX error: [reshape] Cannot reshape array of size 35328 into shape (23,1536,12,128). at /Users/adk9/Library/Developer/Xcode/DerivedData/mlx-swift-examples-cnyznthhkqnptvdsrbolumbygvsd/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:2337

It seems like Qwen2-VL should support text-only prompts (at least, it seems to account for this case when preprocessing input here), but I can't tell where it goes wrong.

I tried making the input 2D as follows:

let tokens2D = MLXArray(promptTokens).expandedDimensions(axis: 0)
return LMInput(tokens: tokens2D)

which further leads to this error MLX error: [rms_norm] weight must have the same size as the last dimension of x but has 1536 elements.

I'd appreciate any help or pointers.

@davidkoski
Copy link
Collaborator

It is possible it doesn't -- I need to check back to mlx-vlm (python) and see what it does. It may require an image.

@davidkoski
Copy link
Collaborator

Based on

this is the same logic, but in reality the shape of the inputIds isn't what is expected. I think it requires:

        guard let pixelValues, let gridThw else {
            return languageModel(inputIds[.newAxis, 0...]).logits
        }

Then the output shape isn't right:

(lldb) po inputEmbeddings.shape
▿ 3 elements
  - 0 : 1
  - 1 : 22
  - 2 : 151936

with an image it looks like this:

(lldb) po inputEmbeddings.shape
▿ 3 elements
  - 0 : 1
  - 1 : 15576
  - 2 : 1536

It isn't clear to me what would have to happen to make something like this.

The code path in the python version isn't callable from the command line (it requires an image), not is it in the swift version. Perhaps we should additionally:

        guard let image = input.image else { throw VLMError.imageRequired }

I know that doesn't quite address your original question, but it looks like that is how it is meant to be used.

@adk9
Copy link
Author

adk9 commented Jan 13, 2025

Thanks for looking into this!

I did actually check mlx-vlm before creating this issue. I can use the Python API to pass an empty image array and it seems to still work.

~ python generate.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 11 files: 100%|██████████████████████████████████████████████████████| 11/11 [00:00<00:00, 132960.65it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████| 11/11 [00:00<00:00, 125033.45it/s]
I am an artificial intelligence, so I don't have feelings. However, I'm here to help you with any questions you have.

And here's generate.py:

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = []
prompt = "How are you?"

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

@davidkoski
Copy link
Collaborator

OK, good to know -- I can look into how this differs from the swift version.

@davidkoski
Copy link
Collaborator

OK, so one difference is the Qwen2-VL swift code is based on [email protected]:awni/mlx-vlm.git video branch and it doesn't work with no images:

Traceback (most recent call last):
  File "/Users/dkoski/Developer/mlx-vlm-video/t.py", line 21, in <module>
    output = generate(model, processor, formatted_prompt, image, verbose=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/Developer/mlx-vlm-video/mlx_vlm/utils.py", line 1033, in generate
    prompt_tokens = mx.array(processor.tokenizer.encode(prompt))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2635, in encode
    encoded_inputs = self.encode_plus(
                     ^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3054, in encode_plus
    return self._encode_plus(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 613, in _encode_plus
    batched_output = self._batch_encode_plus(
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 539, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

@awni in case that might be an issue

but the one from mlx-vlm does work. The difference is:

# branch
        if pixel_values is None:
            return self.language_model(input_ids)
# mlx-vlm main
        if pixel_values is None:
            return self.language_model.model.embed_tokens(input_ids)

To get the right shapes we need this:

        guard let pixelValues, let gridThw else {
            return languageModel.model.embedTokens(inputIds[.newAxis, .ellipsis])
        }

@davidkoski
Copy link
Collaborator

Ah, and that is actually a recent fix on the mlx-vlm side: Blaizzy/mlx-vlm#94

@davidkoski davidkoski self-assigned this Jan 14, 2025
@adk9
Copy link
Author

adk9 commented Jan 16, 2025

Thanks again for tracking this down. This fixes the issue for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants