Text-only prompts don't work for VLMs #177

adk9 · 2025-01-10T02:48:54Z

I'm trying to run mlx-community/Qwen2-VL-2B-Instruct-4bit with a text-only prompt. And I get the following error:

Model loaded -> id("mlx-community/Qwen2-VL-2B-Instruct-4bit")
Starting generation ...
How are you?
MLX error: [reshape] Cannot reshape array of size 35328 into shape (23,1536,12,128). at /Users/adk9/Library/Developer/Xcode/DerivedData/mlx-swift-examples-cnyznthhkqnptvdsrbolumbygvsd/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:2337

It seems like Qwen2-VL should support text-only prompts (at least, it seems to account for this case when preprocessing input here), but I can't tell where it goes wrong.

I tried making the input 2D as follows:

let tokens2D = MLXArray(promptTokens).expandedDimensions(axis: 0)
return LMInput(tokens: tokens2D)

which further leads to this error MLX error: [rms_norm] weight must have the same size as the last dimension of x but has 1536 elements.

I'd appreciate any help or pointers.

The text was updated successfully, but these errors were encountered:

davidkoski · 2025-01-10T03:26:15Z

It is possible it doesn't -- I need to check back to mlx-vlm (python) and see what it does. It may require an image.

davidkoski · 2025-01-13T19:12:11Z

Based on

https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/qwen2_vl/qwen2_vl.py#L59

this is the same logic, but in reality the shape of the inputIds isn't what is expected. I think it requires:

        guard let pixelValues, let gridThw else {
            return languageModel(inputIds[.newAxis, 0...]).logits
        }

Then the output shape isn't right:

(lldb) po inputEmbeddings.shape
▿ 3 elements
  - 0 : 1
  - 1 : 22
  - 2 : 151936

with an image it looks like this:

(lldb) po inputEmbeddings.shape
▿ 3 elements
  - 0 : 1
  - 1 : 15576
  - 2 : 1536

It isn't clear to me what would have to happen to make something like this.

The code path in the python version isn't callable from the command line (it requires an image), not is it in the swift version. Perhaps we should additionally:

        guard let image = input.image else { throw VLMError.imageRequired }

I know that doesn't quite address your original question, but it looks like that is how it is meant to be used.

adk9 · 2025-01-13T21:01:48Z

Thanks for looking into this!

I did actually check mlx-vlm before creating this issue. I can use the Python API to pass an empty image array and it seems to still work.

~ python generate.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 11 files: 100%|██████████████████████████████████████████████████████| 11/11 [00:00<00:00, 132960.65it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████| 11/11 [00:00<00:00, 125033.45it/s]
I am an artificial intelligence, so I don't have feelings. However, I'm here to help you with any questions you have.

And here's generate.py:

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = []
prompt = "How are you?"

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

davidkoski · 2025-01-13T21:34:51Z

OK, good to know -- I can look into how this differs from the swift version.

davidkoski · 2025-01-14T18:37:18Z

OK, so one difference is the Qwen2-VL swift code is based on [email protected]:awni/mlx-vlm.git video branch and it doesn't work with no images:

Traceback (most recent call last):
  File "/Users/dkoski/Developer/mlx-vlm-video/t.py", line 21, in <module>
    output = generate(model, processor, formatted_prompt, image, verbose=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/Developer/mlx-vlm-video/mlx_vlm/utils.py", line 1033, in generate
    prompt_tokens = mx.array(processor.tokenizer.encode(prompt))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2635, in encode
    encoded_inputs = self.encode_plus(
                     ^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3054, in encode_plus
    return self._encode_plus(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 613, in _encode_plus
    batched_output = self._batch_encode_plus(
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkoski/miniconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 539, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

@awni in case that might be an issue

but the one from mlx-vlm does work. The difference is:

# branch
        if pixel_values is None:
            return self.language_model(input_ids)

# mlx-vlm main
        if pixel_values is None:
            return self.language_model.model.embed_tokens(input_ids)

To get the right shapes we need this:

        guard let pixelValues, let gridThw else {
            return languageModel.model.embedTokens(inputIds[.newAxis, .ellipsis])
        }

- fixes #177

davidkoski · 2025-01-14T18:40:56Z

Ah, and that is actually a recent fix on the mlx-vlm side: Blaizzy/mlx-vlm#94

adk9 · 2025-01-16T23:44:12Z

Thanks again for tracking this down. This fixes the issue for me!

davidkoski added a commit that referenced this issue Jan 14, 2025

correct handling of no-image case for qwen2-vl

19ebbe8

- fixes #177

davidkoski mentioned this issue Jan 14, 2025

correct handling of no-image case for qwen2-vl #179

Merged

davidkoski self-assigned this Jan 14, 2025

davidkoski closed this as completed in #179 Jan 27, 2025

davidkoski closed this as completed in 4b67a79 Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text-only prompts don't work for VLMs #177

Text-only prompts don't work for VLMs #177

adk9 commented Jan 10, 2025

davidkoski commented Jan 10, 2025

davidkoski commented Jan 13, 2025

adk9 commented Jan 13, 2025

davidkoski commented Jan 13, 2025

davidkoski commented Jan 14, 2025

davidkoski commented Jan 14, 2025

adk9 commented Jan 16, 2025

Text-only prompts don't work for VLMs #177

Text-only prompts don't work for VLMs #177

Comments

adk9 commented Jan 10, 2025

davidkoski commented Jan 10, 2025

davidkoski commented Jan 13, 2025

adk9 commented Jan 13, 2025

davidkoski commented Jan 13, 2025

davidkoski commented Jan 14, 2025

davidkoski commented Jan 14, 2025

adk9 commented Jan 16, 2025