-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text-only prompts don't work for VLMs #177
Comments
It is possible it doesn't -- I need to check back to mlx-vlm (python) and see what it does. It may require an image. |
Based on this is the same logic, but in reality the shape of the inputIds isn't what is expected. I think it requires: guard let pixelValues, let gridThw else {
return languageModel(inputIds[.newAxis, 0...]).logits
} Then the output shape isn't right:
with an image it looks like this:
It isn't clear to me what would have to happen to make something like this. The code path in the python version isn't callable from the command line (it requires an image), not is it in the swift version. Perhaps we should additionally: guard let image = input.image else { throw VLMError.imageRequired } I know that doesn't quite address your original question, but it looks like that is how it is meant to be used. |
Thanks for looking into this! I did actually check mlx-vlm before creating this issue. I can use the Python API to pass an empty image array and it seems to still work.
And here's import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
# Prepare input
image = []
prompt = "How are you?"
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output) |
OK, good to know -- I can look into how this differs from the swift version. |
OK, so one difference is the Qwen2-VL swift code is based on [email protected]:awni/mlx-vlm.git video branch and it doesn't work with no images:
@awni in case that might be an issue but the one from mlx-vlm does work. The difference is: # branch
if pixel_values is None:
return self.language_model(input_ids) # mlx-vlm main
if pixel_values is None:
return self.language_model.model.embed_tokens(input_ids) To get the right shapes we need this: guard let pixelValues, let gridThw else {
return languageModel.model.embedTokens(inputIds[.newAxis, .ellipsis])
} |
Ah, and that is actually a recent fix on the mlx-vlm side: Blaizzy/mlx-vlm#94 |
Thanks again for tracking this down. This fixes the issue for me! |
I'm trying to run
mlx-community/Qwen2-VL-2B-Instruct-4bit
with a text-only prompt. And I get the following error:It seems like Qwen2-VL should support text-only prompts (at least, it seems to account for this case when preprocessing input here), but I can't tell where it goes wrong.
I tried making the input 2D as follows:
which further leads to this error
MLX error: [rms_norm] weight must have the same size as the last dimension of x but has 1536 elements.
I'd appreciate any help or pointers.
The text was updated successfully, but these errors were encountered: