[Bug]: CUDA Graph Capture Error with Llama-3.2-11B-Vision-Instruct-bnb-4bit on RTX 4090 #11587
Closed
1 task done
Labels
bug
Something isn't working
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
Environment:
Error Description:
The model fails during CUDA graph capture with the error: "CUDA error: operation failed due to a previous error during capture". Memory profile shows:
Stack Trace:
The error occurs in
mllama.py
during the forward pass:This operation is not permitted during stream capture.
Additional Context:
The error persists even with 4-bit quantization enabled. The model initialization completes successfully, but fails during the CUDA graph capture phase.
Question:
Is there a workaround to run this vision model with vLLM on a single RTX 4090? I've tried adjusting memory utilization and sequence parameters without success.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: