[BUG🐛]The GPU is not being utilized and is almost idle, while the CPU is fully occupied. #48

zwong91 · 2024-12-22T08:21:17Z

Bug Description

[Provide a clear and concise description of the bug]

Minimal Reproducible Example

import time
from auralis import TTS, TTSRequest

# Initialize
tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt', device_map='cuda:0')

start_time = time.time()
# Prepare request
request = TTSRequest(
    text="生活如一场旅行，沿途的风景或许美丽，或许平淡，但每一步都值得珍惜。无论前方的路途多么曲折，我们都要怀揣希望，勇敢前行。学会感恩，感受每一个微小的幸福，才能在平凡中找到意义。",
    speaker_files=['1.wav']
)

# Measure execution time

output = tts.generate_speech(request)
end_time = time.time()

# Save output and print execution time
output.save('hello.wav')
print(f"Execution Time: {end_time - start_time:.2f} seconds")

Expected Behavior

[Describe what you expected to happen]

Actual Behavior

root@0dc92bcb0092:/workspace/Auralis# python 1.py
08:17:40.755 | XTTSv2.py:75 | ℹ️ INFO | Initializing XTTSv2Engine...
08:17:42.205 | XTTSv2.py:229 | ℹ️ INFO | Initializing VLLM engine with args: AsyncEngineArgs(model='AstraMindAI/xtts2-gpt', served_model_name=None, tokenizer='AstraMindAI/xtts2-gpt', task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=True, allowed_local_media_path='', download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=1047, worker_use_ray=False, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.1571125854166545, max_num_batched_tokens=10470, max_num_seqs=10, max_logprobs=20, disable_log_stats=True, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'audio': 1}, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False)
08:17:43.615 | logger.py:89 | ℹ️ INFO | Downcasting torch.float32 to torch.float16.
08:17:43.616 | logger.py:89 | ⚠️ WARNING | To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
08:17:43.616 | logger.py:89 | ℹ️ INFO | Initializing an LLM engine (v0.6.4.post1) with config: model='AstraMindAI/xtts2-gpt', speculative_config=None, tokenizer='AstraMindAI/xtts2-gpt', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1047, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=AstraMindAI/xtts2-gpt, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
08:17:44.640 | logger.py:89 | ℹ️ INFO | Using Flash Attention backend.
08:17:44.868 | logger.py:89 | ℹ️ INFO | Starting to load model AstraMindAI/xtts2-gpt...
08:17:45.139 | logger.py:89 | ℹ️ INFO | Using model weights format ['*.safetensors']
08:17:45.366 | logger.py:89 | ℹ️ INFO | No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.27s/it]

08:17:47.942 | logger.py:89 | ℹ️ INFO | Loading model weights took 0.7099 GB
08:17:48.407 | logger.py:89 | ℹ️ INFO | Memory profiling results: total_gpu_memory=23.64GiB initial_memory_usage=3.06GiB peak_torch_memory=1.00GiB memory_usage_post_profile=3.09GiB non_torch_memory=2.38GiB kv_cache_size=0.34GiB gpu_memory_utilization=0.16
08:17:48.625 | logger.py:89 | ℹ️ INFO | # GPU blocks: 186, # CPU blocks: 2184
08:17:48.625 | logger.py:89 | ℹ️ INFO | Maximum concurrency for 1047 tokens per request: 2.84x
08:17:56.146 | two_phase_scheduler.py:131 | ℹ️ INFO | Starting request 6c8e757e69c446f9b8589d77404d5056
08:18:00.478 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_0.
08:18:00.478 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_1.
08:18:02.471 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_1.
08:18:02.480 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_1_logits.
08:18:02.525 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_1_logits.
08:18:09.060 | performance.py:142 | ℹ️ INFO | Generation metrics | Throughput: 0.03 req/s | 3.1 tokens/s | Latency: 6875ms per second of audio generated
08:18:12.433 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_0.
08:18:12.434 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_0_logits.
08:18:12.482 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_0_logits.
08:18:12.711 | two_phase_scheduler.py:140 | ℹ️ INFO | Request 6c8e757e69c446f9b8589d77404d5056 completed
Execution Time: 18.08 seconds
[rank0]:[W1222 08:18:13.491462637 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

Error Logs

[Paste relevant error logs here, ensuring the logging level is set to DEBUG]

Environment

RunPod Pytorch 2.4.0
ID: 2obsncspfqngvl
1 x RTX 4090
9 vCPU 50 GB RAM
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
On-Demand - Community Cloud

# OS Information
uname -a

# Python version
python --version

# Installed Python packages
pip list

# GPU Information (if applicable)
nvidia-smi

# CUDA version (if applicable)
nvcc --version

Possible Solutions

[If you have ideas on how to solve the issue, include them here]

Additional Information

[Any other information you think might be helpful for diagnosing the issue]

The text was updated successfully, but these errors were encountered:

mlinmg · 2024-12-23T13:25:19Z

The current scheduler isn't fully optimized in the new scheduler the gpu utilization will actually be way better when #22 is merged

Nils-Lopez · 2025-02-25T13:46:29Z

Hi, are you still planning on merging it ?

zwong91 added the bug Something isn't working label Dec 22, 2024

mlinmg added enhancement New feature or request and removed bug Something isn't working labels Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG🐛]The GPU is not being utilized and is almost idle, while the CPU is fully occupied. #48

[BUG🐛]The GPU is not being utilized and is almost idle, while the CPU is fully occupied. #48

zwong91 commented Dec 22, 2024

mlinmg commented Dec 23, 2024

Nils-Lopez commented Feb 25, 2025

[BUG🐛]The GPU is not being utilized and is almost idle, while the CPU is fully occupied. #48

[BUG🐛]The GPU is not being utilized and is almost idle, while the CPU is fully occupied. #48

Comments

zwong91 commented Dec 22, 2024

Bug Description

Minimal Reproducible Example

Expected Behavior

Actual Behavior

Error Logs

Environment

Possible Solutions

Additional Information

mlinmg commented Dec 23, 2024

Nils-Lopez commented Feb 25, 2025