[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

xiguiw · 2025-02-11T09:14:21Z

Priority

P2-High

OS type

Ubuntu

Hardware type

Xeon-GNR

Installation method

Pull docker images from hub.docker.com
Build docker images from source
Other

Deploy method

Running nodes

Single Node

What's the version?

docker image opea/vllm:latest issue, Only one CPU core is busy/used for the vLLM inference.
latest image version vLLM API version 0.1.dev1+g51f0b5f seems incorrect,

There is no such issue in opea/vllm:1.2 image. 1.2 image version “0.1.dev1+g84bee4b”

opea/vllm                                       latest            2ab3b22443b5   18 hours ago    9.18GB

ChatQnA Xeon
docker compose up -d
the vLLM inference is slowly. htop cpu shows Only one cpu is busy, other cpus are idle.

2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f

Description

[Bug]
OPEA vllm image issue, only on CPU is buy, other are idle for vllm inference

Reproduce steps

ChtaQnA
docker compose up -d

Raw log

(base) xiguiwan@k8s-worker1:~/OPEA/GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon$ docker logs -t vllm-service
2025-02-11T08:56:04.868473527Z INFO 02-11 08:56:04 __init__.py:190] Automatically detected platform cpu.
2025-02-11T08:56:05.958476995Z WARNING 02-11 08:56:05 _logger.py:72] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f
2025-02-11T08:56:06.180543307Z INFO 02-11 08:56:06 api_server.py:841] args: Namespace(host='0.0.0.0', port=80, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Meta-Llama-3-8B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-02-11T08:56:06.194637552Z INFO 02-11 08:56:06 api_server.py:206] Started engine process with PID 73
2025-02-11T08:56:12.380949277Z INFO 02-11 08:56:12 __init__.py:190] Automatically detected platform cpu.
2025-02-11T08:56:13.681194805Z WARNING 02-11 08:56:13 _logger.py:72] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
2025-02-11T08:56:18.508111515Z INFO 02-11 08:56:18 config.py:542] This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
2025-02-11T08:56:18.508772192Z WARNING 02-11 08:56:18 config.py:678] Async output processing is not supported on the current platform type cpu.
2025-02-11T08:56:18.510404785Z WARNING 02-11 08:56:18 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
2025-02-11T08:56:18.510464490Z WARNING 02-11 08:56:18 _logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
2025-02-11T08:56:18.510474415Z WARNING 02-11 08:56:18 _logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
2025-02-11T08:56:24.331244845Z INFO 02-11 08:56:24 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
2025-02-11T08:56:24.331780462Z WARNING 02-11 08:56:24 config.py:678] Async output processing is not supported on the current platform type cpu.
2025-02-11T08:56:24.333213895Z WARNING 02-11 08:56:24 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
2025-02-11T08:56:24.333257077Z WARNING 02-11 08:56:24 _logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
2025-02-11T08:56:24.333267005Z WARNING 02-11 08:56:24 _logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
2025-02-11T08:56:24.336616167Z INFO 02-11 08:56:24 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
2025-02-11T08:56:24.339747792Z INFO 02-11 08:56:24 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+g51f0b5f) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
2025-02-11T08:56:30.329538857Z WARNING 02-11 08:56:30 _logger.py:72] Reducing Torch parallelism from 172 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.

Attachments

No response

The text was updated successfully, but these errors were encountered:

xiguiw · 2025-02-11T09:34:19Z

There is log in opea/vllm:latest, and no such log in opea/vllm:1.2 image.

2025-02-11T08:56:30.329538857Z WARNING 02-11 08:56:30 _logger.py:72] Reducing Torch parallelism from 172 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.

xiguiw · 2025-02-11T09:42:26Z

What's the vllm version 0.1.dev1+g51f0b5f in oepa/vllm:latest container logs is?

2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f

currently the vllm version update to 0.7.2

...
v0.5.2
v0.5.3
v0.5.3.post1
v0.5.4
v0.5.5
v0.6.0
v0.6.1
v0.6.1.post1
v0.6.1.post2
v0.6.2
v0.6.3
v0.6.3.post1
v0.6.4
v0.6.4.post1
v0.6.5
v0.6.6
v0.6.6.post1
v0.7.0
v0.7.1
v0.7.2

xiguiw added the bug Something isn't working label Feb 11, 2025

xiguiw self-assigned this Feb 12, 2025

yinghu5 changed the title ~~[Bug]OPEA vllm image issue, only on CPU is buy, other are idle for vllm inference~~ [Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference Feb 12, 2025

yinghu5 added the aitce label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

xiguiw commented Feb 11, 2025 •

edited

Loading

xiguiw commented Feb 11, 2025 •

edited

Loading

xiguiw commented Feb 11, 2025 •

edited

Loading

[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

Comments

xiguiw commented Feb 11, 2025 • edited Loading

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Attachments

xiguiw commented Feb 11, 2025 • edited Loading

xiguiw commented Feb 11, 2025 • edited Loading

xiguiw commented Feb 11, 2025 •

edited

Loading

xiguiw commented Feb 11, 2025 •

edited

Loading

xiguiw commented Feb 11, 2025 •

edited

Loading