Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

Open
2 of 8 tasks
xiguiw opened this issue Feb 11, 2025 · 2 comments
Open
2 of 8 tasks
Assignees
Labels
aitce bug Something isn't working

Comments

@xiguiw
Copy link
Collaborator

xiguiw commented Feb 11, 2025

Priority

P2-High

OS type

Ubuntu

Hardware type

Xeon-GNR

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source
  • Other

Deploy method

  • Docker
  • Docker Compose
  • Kubernetes Helm Charts
  • Kubernetes GMC
  • Other

Running nodes

Single Node

What's the version?

docker image opea/vllm:latest issue, Only one CPU core is busy/used for the vLLM inference.
latest image version vLLM API version 0.1.dev1+g51f0b5f seems incorrect,

There is no such issue in opea/vllm:1.2 image. 1.2 image version “0.1.dev1+g84bee4b”

opea/vllm                                       latest            2ab3b22443b5   18 hours ago    9.18GB

ChatQnA Xeon
docker compose up -d
the vLLM inference is slowly. htop cpu shows Only one cpu is busy, other cpus are idle.

2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f

Description

[Bug]
OPEA vllm image issue, only on CPU is buy, other are idle for vllm inference

Reproduce steps

ChtaQnA
docker compose up -d

Raw log

(base) xiguiwan@k8s-worker1:~/OPEA/GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon$ docker logs -t vllm-service
2025-02-11T08:56:04.868473527Z INFO 02-11 08:56:04 __init__.py:190] Automatically detected platform cpu.
2025-02-11T08:56:05.958476995Z WARNING 02-11 08:56:05 _logger.py:72] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f
2025-02-11T08:56:06.180543307Z INFO 02-11 08:56:06 api_server.py:841] args: Namespace(host='0.0.0.0', port=80, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Meta-Llama-3-8B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-02-11T08:56:06.194637552Z INFO 02-11 08:56:06 api_server.py:206] Started engine process with PID 73
2025-02-11T08:56:12.380949277Z INFO 02-11 08:56:12 __init__.py:190] Automatically detected platform cpu.
2025-02-11T08:56:13.681194805Z WARNING 02-11 08:56:13 _logger.py:72] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
2025-02-11T08:56:18.508111515Z INFO 02-11 08:56:18 config.py:542] This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
2025-02-11T08:56:18.508772192Z WARNING 02-11 08:56:18 config.py:678] Async output processing is not supported on the current platform type cpu.
2025-02-11T08:56:18.510404785Z WARNING 02-11 08:56:18 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
2025-02-11T08:56:18.510464490Z WARNING 02-11 08:56:18 _logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
2025-02-11T08:56:18.510474415Z WARNING 02-11 08:56:18 _logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
2025-02-11T08:56:24.331244845Z INFO 02-11 08:56:24 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
2025-02-11T08:56:24.331780462Z WARNING 02-11 08:56:24 config.py:678] Async output processing is not supported on the current platform type cpu.
2025-02-11T08:56:24.333213895Z WARNING 02-11 08:56:24 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
2025-02-11T08:56:24.333257077Z WARNING 02-11 08:56:24 _logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
2025-02-11T08:56:24.333267005Z WARNING 02-11 08:56:24 _logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
2025-02-11T08:56:24.336616167Z INFO 02-11 08:56:24 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
2025-02-11T08:56:24.339747792Z INFO 02-11 08:56:24 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+g51f0b5f) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
2025-02-11T08:56:30.329538857Z WARNING 02-11 08:56:30 _logger.py:72] Reducing Torch parallelism from 172 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.

Attachments

No response

@xiguiw xiguiw added the bug Something isn't working label Feb 11, 2025
@xiguiw
Copy link
Collaborator Author

xiguiw commented Feb 11, 2025

There is log in opea/vllm:latest, and no such log in opea/vllm:1.2 image.

2025-02-11T08:56:30.329538857Z WARNING 02-11 08:56:30 _logger.py:72] Reducing Torch parallelism from 172 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.

@xiguiw
Copy link
Collaborator Author

xiguiw commented Feb 11, 2025

What's the vllm version 0.1.dev1+g51f0b5f in oepa/vllm:latest container logs is?

2025-02-11T08:56:06.179926647Z INFO 02-11 08:56:06 api_server.py:840] vLLM API server version 0.1.dev1+g51f0b5f

currently the vllm version update to 0.7.2

...
v0.5.2
v0.5.3
v0.5.3.post1
v0.5.4
v0.5.5
v0.6.0
v0.6.1
v0.6.1.post1
v0.6.1.post2
v0.6.2
v0.6.3
v0.6.3.post1
v0.6.4
v0.6.4.post1
v0.6.5
v0.6.6
v0.6.6.post1
v0.7.0
v0.7.1
v0.7.2

@xiguiw xiguiw self-assigned this Feb 12, 2025
@yinghu5 yinghu5 changed the title [Bug]OPEA vllm image issue, only on CPU is buy, other are idle for vllm inference [Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference Feb 12, 2025
@yinghu5 yinghu5 added the aitce label Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aitce bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants