Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Failed to python vllm/examples/offline_inference/whisper.py #13272

Closed
1 task done
mru4913 opened this issue Feb 14, 2025 · 2 comments · Fixed by #13274
Closed
1 task done

[Bug]: Failed to python vllm/examples/offline_inference/whisper.py #13272

mru4913 opened this issue Feb 14, 2025 · 2 comments · Fixed by #13274
Labels
bug Something isn't working

Comments

@mru4913
Copy link

mru4913 commented Feb 14, 2025

Your current environment

The env is fine.

🐛 Describe the bug

 python vllm/examples/offline_inference/whisper.py
INFO 02-14 15:54:03 __init__.py:190] Automatically detected platform cuda.
INFO 02-14 15:54:10 config.py:548] This model supports multiple tasks: {'generate', 'transcription', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'transcription'.
INFO 02-14 15:54:10 config.py:1121] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 02-14 15:54:10 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3.dev134+g40932d7a) with config: model='openai/whisper-large-v3', speculative_config=None, tokenizer='openai/whisper-large-v3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=openai/whisper-large-v3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":400}, use_cached_outputs=False, 
INFO 02-14 15:54:12 cuda.py:190] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 02-14 15:54:12 cuda.py:192] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 02-14 15:54:12 cuda.py:227] Using XFormers backend.
INFO 02-14 15:54:13 model_runner.py:1109] Starting to load model openai/whisper-large-v3...
INFO 02-14 15:54:14 weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 02-14 15:54:14 weight_utils.py:306] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:00,  9.35it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:00<00:00,  2.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.82it/s]

INFO 02-14 15:54:16 model_runner.py:1114] Loading model weights took 2.8762 GB
INFO 02-14 15:54:16 enc_dec_model_runner.py:280] Starting profile run for multi-modal models.
INFO 02-14 15:54:23 worker.py:267] Memory profiling takes 7.79 seconds
INFO 02-14 15:54:23 worker.py:267] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 02-14 15:54:23 worker.py:267] model weights take 2.88GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 16.07GiB; the rest of the memory reserved for KV Cache is 52.09GiB.
INFO 02-14 15:54:24 executor_base.py:110] # CUDA blocks: 42673, # CPU blocks: 3276
INFO 02-14 15:54:24 executor_base.py:115] Maximum concurrency for 448 tokens per request: 1524.04x
INFO 02-14 15:54:25 model_runner.py:1433] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:13<00:00,  3.89it/s]
INFO 02-14 15:54:38 model_runner.py:1561] Graph capturing finished in 14 secs, took 0.51 GiB
INFO 02-14 15:54:38 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 22.77 seconds
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/disk2/albert/vllm/examples/offline_inference/whisper.py", line 47, in <module>
[rank0]:     outputs = llm.generate(prompts, sampling_params)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xm/anaconda3/envs/myenv/lib/python3.12/site-packages/vllm/utils.py", line 1057, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xm/anaconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 438, in generate
[rank0]:     raise ValueError(" ".join(messages))
[rank0]: ValueError: LLM.generate() is only supported for (conditional) generation models (XForCausalLM, XForConditionalGeneration). Your model supports the 'generate' runner, but is currently initialized for the 'transcription' runner. Please initialize vLLM using `--task generate`.

As listed above, running test cases for whisper have failed due to the type of model runner.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@mru4913 mru4913 added the bug Something isn't working label Feb 14, 2025
@DarkLight1337
Copy link
Member

@NickLucche I think we need to update this example to use transcription task explicitly. Can we also update the Supported Models page with it?

@NickLucche
Copy link
Contributor

On it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants