You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
python vllm/examples/offline_inference/whisper.py
INFO 02-14 15:54:03 __init__.py:190] Automatically detected platform cuda.
INFO 02-14 15:54:10 config.py:548] This model supports multiple tasks: {'generate', 'transcription', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'transcription'.
INFO 02-14 15:54:10 config.py:1121] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 02-14 15:54:10 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3.dev134+g40932d7a) with config: model='openai/whisper-large-v3', speculative_config=None, tokenizer='openai/whisper-large-v3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=openai/whisper-large-v3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":400}, use_cached_outputs=False,
INFO 02-14 15:54:12 cuda.py:190] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 02-14 15:54:12 cuda.py:192] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 02-14 15:54:12 cuda.py:227] Using XFormers backend.
INFO 02-14 15:54:13 model_runner.py:1109] Starting to load model openai/whisper-large-v3...
INFO 02-14 15:54:14 weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 02-14 15:54:14 weight_utils.py:306] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:00, 9.35it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:00<00:00, 2.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.82it/s]
INFO 02-14 15:54:16 model_runner.py:1114] Loading model weights took 2.8762 GB
INFO 02-14 15:54:16 enc_dec_model_runner.py:280] Starting profile run for multi-modal models.
INFO 02-14 15:54:23 worker.py:267] Memory profiling takes 7.79 seconds
INFO 02-14 15:54:23 worker.py:267] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 02-14 15:54:23 worker.py:267] model weights take 2.88GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 16.07GiB; the rest of the memory reserved for KV Cache is 52.09GiB.
INFO 02-14 15:54:24 executor_base.py:110] # CUDA blocks: 42673, # CPU blocks: 3276
INFO 02-14 15:54:24 executor_base.py:115] Maximum concurrency for 448 tokens per request: 1524.04x
INFO 02-14 15:54:25 model_runner.py:1433] Capturing cudagraphs fordecoding. This may lead to unexpected consequences if the model is not static. To run the modelin eager mode, set'enforce_eager=True' or use '--enforce-eager'in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:13<00:00, 3.89it/s]
INFO 02-14 15:54:38 model_runner.py:1561] Graph capturing finished in 14 secs, took 0.51 GiB
INFO 02-14 15:54:38 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 22.77 seconds
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/disk2/albert/vllm/examples/offline_inference/whisper.py", line 47, in<module>
[rank0]: outputs = llm.generate(prompts, sampling_params)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xm/anaconda3/envs/myenv/lib/python3.12/site-packages/vllm/utils.py", line 1057, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xm/anaconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 438, in generate
[rank0]: raise ValueError("".join(messages))
[rank0]: ValueError: LLM.generate() is only supported for (conditional) generation models (XForCausalLM, XForConditionalGeneration). Your model supports the 'generate' runner, but is currently initialized for the 'transcription' runner. Please initialize vLLM using `--task generate`.
As listed above, running test cases for whisper have failed due to the type of model runner.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
The env is fine.
🐛 Describe the bug
As listed above, running test cases for whisper have failed due to the type of model runner.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: