[Bug]: Failed to python vllm/examples/offline_inference/whisper.py #13272

mru4913 · 2025-02-14T08:22:46Z

Your current environment

The env is fine.

🐛 Describe the bug

 python vllm/examples/offline_inference/whisper.py
INFO 02-14 15:54:03 __init__.py:190] Automatically detected platform cuda.
INFO 02-14 15:54:10 config.py:548] This model supports multiple tasks: {'generate', 'transcription', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'transcription'.
INFO 02-14 15:54:10 config.py:1121] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 02-14 15:54:10 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3.dev134+g40932d7a) with config: model='openai/whisper-large-v3', speculative_config=None, tokenizer='openai/whisper-large-v3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=openai/whisper-large-v3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":400}, use_cached_outputs=False, 
INFO 02-14 15:54:12 cuda.py:190] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 02-14 15:54:12 cuda.py:192] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 02-14 15:54:12 cuda.py:227] Using XFormers backend.
INFO 02-14 15:54:13 model_runner.py:1109] Starting to load model openai/whisper-large-v3...
INFO 02-14 15:54:14 weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 02-14 15:54:14 weight_utils.py:306] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:00,  9.35it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:00<00:00,  2.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.82it/s]

INFO 02-14 15:54:16 model_runner.py:1114] Loading model weights took 2.8762 GB
INFO 02-14 15:54:16 enc_dec_model_runner.py:280] Starting profile run for multi-modal models.
INFO 02-14 15:54:23 worker.py:267] Memory profiling takes 7.79 seconds
INFO 02-14 15:54:23 worker.py:267] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 02-14 15:54:23 worker.py:267] model weights take 2.88GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 16.07GiB; the rest of the memory reserved for KV Cache is 52.09GiB.
INFO 02-14 15:54:24 executor_base.py:110] # CUDA blocks: 42673, # CPU blocks: 3276
INFO 02-14 15:54:24 executor_base.py:115] Maximum concurrency for 448 tokens per request: 1524.04x
INFO 02-14 15:54:25 model_runner.py:1433] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:13<00:00,  3.89it/s]
INFO 02-14 15:54:38 model_runner.py:1561] Graph capturing finished in 14 secs, took 0.51 GiB
INFO 02-14 15:54:38 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 22.77 seconds
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/disk2/albert/vllm/examples/offline_inference/whisper.py", line 47, in <module>
[rank0]:     outputs = llm.generate(prompts, sampling_params)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xm/anaconda3/envs/myenv/lib/python3.12/site-packages/vllm/utils.py", line 1057, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xm/anaconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 438, in generate
[rank0]:     raise ValueError(" ".join(messages))
[rank0]: ValueError: LLM.generate() is only supported for (conditional) generation models (XForCausalLM, XForConditionalGeneration). Your model supports the 'generate' runner, but is currently initialized for the 'transcription' runner. Please initialize vLLM using `--task generate`.

As listed above, running test cases for whisper have failed due to the type of model runner.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2025-02-14T08:24:34Z

@NickLucche I think we need to update this example to use transcription task explicitly. Can we also update the Supported Models page with it?

NickLucche · 2025-02-14T08:55:43Z

On it

mru4913 added the bug Something isn't working label Feb 14, 2025

NickLucche mentioned this issue Feb 14, 2025

[Bugfix][Docs] Fix offline Whisper #13274

Merged

simon-mo closed this as completed in #13274 Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Failed to python vllm/examples/offline_inference/whisper.py #13272

[Bug]: Failed to python vllm/examples/offline_inference/whisper.py #13272

mru4913 commented Feb 14, 2025

DarkLight1337 commented Feb 14, 2025

NickLucche commented Feb 14, 2025

[Bug]: Failed to python vllm/examples/offline_inference/whisper.py #13272

[Bug]: Failed to python vllm/examples/offline_inference/whisper.py #13272

Comments

mru4913 commented Feb 14, 2025

Your current environment

🐛 Describe the bug

Before submitting a new issue...

DarkLight1337 commented Feb 14, 2025

NickLucche commented Feb 14, 2025