You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docker image opea/vllm:latest issue, Only one CPU core is busy/used for the vLLM inference.
latest image version vLLM API version 0.1.dev1+g51f0b5f seems incorrect,
There is no such issue in opea/vllm:1.2 image. 1.2 image version “0.1.dev1+g84bee4b”
opea/vllm latest 2ab3b22443b5 18 hours ago 9.18GB
ChatQnA Xeon docker compose up -d
the vLLM inference is slowly. htop cpu shows Only one cpu is busy, other cpus are idle.
2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f
Description
[Bug]
OPEA vllm image issue, only on CPU is buy, other are idle for vllm inference
Reproduce steps
ChtaQnA docker compose up -d
Raw log
(base) xiguiwan@k8s-worker1:~/OPEA/GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon$ docker logs -t vllm-service
2025-02-11T08:56:04.868473527Z INFO 02-11 08:56:04 __init__.py:190] Automatically detected platform cpu.
2025-02-11T08:56:05.958476995Z WARNING 02-11 08:56:05 _logger.py:72] Torch Profiler is enabled in the API server. This should ONLY be used forlocal development!
2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f
2025-02-11T08:56:06.180543307Z INFO 02-11 08:56:06 api_server.py:841] args: Namespace(host='0.0.0.0', port=80, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Meta-Llama-3-8B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-02-11T08:56:06.194637552Z INFO 02-11 08:56:06 api_server.py:206] Started engine process with PID 73
2025-02-11T08:56:12.380949277Z INFO 02-11 08:56:12 __init__.py:190] Automatically detected platform cpu.
2025-02-11T08:56:13.681194805Z WARNING 02-11 08:56:13 _logger.py:72] Torch Profiler is enabled in the API server. This should ONLY be used forlocal development!
2025-02-11T08:56:18.508111515Z INFO 02-11 08:56:18 config.py:542] This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
2025-02-11T08:56:18.508772192Z WARNING 02-11 08:56:18 config.py:678] Async output processing is not supported on the current platform type cpu.
2025-02-11T08:56:18.510404785Z WARNING 02-11 08:56:18 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
2025-02-11T08:56:18.510464490Z WARNING 02-11 08:56:18 _logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
2025-02-11T08:56:18.510474415Z WARNING 02-11 08:56:18 _logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
2025-02-11T08:56:24.331244845Z INFO 02-11 08:56:24 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
2025-02-11T08:56:24.331780462Z WARNING 02-11 08:56:24 config.py:678] Async output processing is not supported on the current platform type cpu.
2025-02-11T08:56:24.333213895Z WARNING 02-11 08:56:24 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
2025-02-11T08:56:24.333257077Z WARNING 02-11 08:56:24 _logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
2025-02-11T08:56:24.333267005Z WARNING 02-11 08:56:24 _logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
2025-02-11T08:56:24.336616167Z INFO 02-11 08:56:24 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
2025-02-11T08:56:24.339747792Z INFO 02-11 08:56:24 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+g51f0b5f) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
2025-02-11T08:56:30.329538857Z WARNING 02-11 08:56:30 _logger.py:72] Reducing Torch parallelism from 172 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
Attachments
No response
The text was updated successfully, but these errors were encountered:
There is log in opea/vllm:latest, and no such log in opea/vllm:1.2 image.
2025-02-11T08:56:30.329538857Z WARNING 02-11 08:56:30 _logger.py:72] Reducing Torch parallelism from 172 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
yinghu5
changed the title
[Bug]OPEA vllm image issue, only on CPU is buy, other are idle for vllm inference
[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference
Feb 12, 2025
Priority
P2-High
OS type
Ubuntu
Hardware type
Xeon-GNR
Installation method
Deploy method
Running nodes
Single Node
What's the version?
docker image
opea/vllm:latest
issue, Only one CPU core is busy/used for the vLLM inference.latest image version vLLM API version
0.1.dev1+g51f0b5f
seems incorrect,There is no such issue in
opea/vllm:1.2
image. 1.2 image version “0.1.dev1+g84bee4b”ChatQnA Xeon
docker compose up -d
the vLLM inference is slowly. htop cpu shows Only one cpu is busy, other cpus are idle.
Description
[Bug]
OPEA vllm image issue, only on CPU is buy, other are idle for vllm inference
Reproduce steps
ChtaQnA
docker compose up -d
Raw log
Attachments
No response
The text was updated successfully, but these errors were encountered: