Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG🐛]The GPU is not being utilized and is almost idle, while the CPU is fully occupied. #48

Open
zwong91 opened this issue Dec 22, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@zwong91
Copy link

zwong91 commented Dec 22, 2024

Bug Description

[Provide a clear and concise description of the bug]

Minimal Reproducible Example

import time
from auralis import TTS, TTSRequest

# Initialize
tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt', device_map='cuda:0')

start_time = time.time()
# Prepare request
request = TTSRequest(
    text="生活如一场旅行,沿途的风景或许美丽,或许平淡,但每一步都值得珍惜。无论前方的路途多么曲折,我们都要怀揣希望,勇敢前行。学会感恩,感受每一个微小的幸福,才能在平凡中找到意义。",
    speaker_files=['1.wav']
)

# Measure execution time

output = tts.generate_speech(request)
end_time = time.time()

# Save output and print execution time
output.save('hello.wav')
print(f"Execution Time: {end_time - start_time:.2f} seconds")

Expected Behavior

[Describe what you expected to happen]

Actual Behavior

root@0dc92bcb0092:/workspace/Auralis# python 1.py
08:17:40.755 | XTTSv2.py:75 | ℹ️ INFO | Initializing XTTSv2Engine...
08:17:42.205 | XTTSv2.py:229 | ℹ️ INFO | Initializing VLLM engine with args: AsyncEngineArgs(model='AstraMindAI/xtts2-gpt', served_model_name=None, tokenizer='AstraMindAI/xtts2-gpt', task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=True, allowed_local_media_path='', download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=1047, worker_use_ray=False, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.1571125854166545, max_num_batched_tokens=10470, max_num_seqs=10, max_logprobs=20, disable_log_stats=True, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'audio': 1}, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False)
08:17:43.615 | logger.py:89 | ℹ️ INFO | Downcasting torch.float32 to torch.float16.
08:17:43.616 | logger.py:89 | ⚠️ WARNING | To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
08:17:43.616 | logger.py:89 | ℹ️ INFO | Initializing an LLM engine (v0.6.4.post1) with config: model='AstraMindAI/xtts2-gpt', speculative_config=None, tokenizer='AstraMindAI/xtts2-gpt', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1047, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=AstraMindAI/xtts2-gpt, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
08:17:44.640 | logger.py:89 | ℹ️ INFO | Using Flash Attention backend.
08:17:44.868 | logger.py:89 | ℹ️ INFO | Starting to load model AstraMindAI/xtts2-gpt...
08:17:45.139 | logger.py:89 | ℹ️ INFO | Using model weights format ['*.safetensors']
08:17:45.366 | logger.py:89 | ℹ️ INFO | No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.27s/it]

08:17:47.942 | logger.py:89 | ℹ️ INFO | Loading model weights took 0.7099 GB
08:17:48.407 | logger.py:89 | ℹ️ INFO | Memory profiling results: total_gpu_memory=23.64GiB initial_memory_usage=3.06GiB peak_torch_memory=1.00GiB memory_usage_post_profile=3.09GiB non_torch_memory=2.38GiB kv_cache_size=0.34GiB gpu_memory_utilization=0.16
08:17:48.625 | logger.py:89 | ℹ️ INFO | # GPU blocks: 186, # CPU blocks: 2184
08:17:48.625 | logger.py:89 | ℹ️ INFO | Maximum concurrency for 1047 tokens per request: 2.84x
08:17:56.146 | two_phase_scheduler.py:131 | ℹ️ INFO | Starting request 6c8e757e69c446f9b8589d77404d5056
08:18:00.478 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_0.
08:18:00.478 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_1.
08:18:02.471 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_1.
08:18:02.480 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_1_logits.
08:18:02.525 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_1_logits.
08:18:09.060 | performance.py:142 | ℹ️ INFO | Generation metrics | Throughput: 0.03 req/s | 3.1 tokens/s | Latency: 6875ms per second of audio generated
08:18:12.433 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_0.
08:18:12.434 | logger.py:89 | ℹ️ INFO | Added request 6c8e757e69c446f9b8589d77404d5056_0_logits.
08:18:12.482 | logger.py:89 | ℹ️ INFO | Finished request 6c8e757e69c446f9b8589d77404d5056_0_logits.
08:18:12.711 | two_phase_scheduler.py:140 | ℹ️ INFO | Request 6c8e757e69c446f9b8589d77404d5056 completed
Execution Time: 18.08 seconds
[rank0]:[W1222 08:18:13.491462637 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

Error Logs

[Paste relevant error logs here, ensuring the logging level is set to DEBUG]

Environment

RunPod Pytorch 2.4.0
ID: 2obsncspfqngvl
1 x RTX 4090
9 vCPU 50 GB RAM
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
On-Demand - Community Cloud

# OS Information
uname -a

# Python version
python --version

# Installed Python packages
pip list

# GPU Information (if applicable)
nvidia-smi

# CUDA version (if applicable)
nvcc --version

Possible Solutions

[If you have ideas on how to solve the issue, include them here]

Additional Information

[Any other information you think might be helpful for diagnosing the issue]

@zwong91 zwong91 added the bug Something isn't working label Dec 22, 2024
@mlinmg
Copy link
Contributor

mlinmg commented Dec 23, 2024

The current scheduler isn't fully optimized in the new scheduler the gpu utilization will actually be way better when #22 is merged

@mlinmg mlinmg added enhancement New feature or request and removed bug Something isn't working labels Dec 23, 2024
@Nils-Lopez
Copy link

Hi, are you still planning on merging it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants