Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable --gpu-memory-utilization in benchmark_throughput.py #3175

Merged
merged 1 commit into from
Mar 4, 2024

Conversation

AllenDou
Copy link
Contributor

@AllenDou AllenDou commented Mar 4, 2024

my hardward, single machine with 4 V100(16GB) cards, when i run
python3 benchmark_throughput.py --backend vllm --model /root/opt-6.7b/ --dataset /root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --tensor-parallel-size 4

(RayWorkerVllm pid=52049) INFO 03-04 16:58:02 model_runner.py:692] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [repeated 2x across cluster]
Processed prompts:   0%|                                      | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/vllm/benchmarks/benchmark_throughput.py", line 348, in <module>
    main(args)
  File "/root/vllm/benchmarks/benchmark_throughput.py", line 211, in main
    elapsed_time = run_vllm(
  File "/root/vllm/benchmarks/benchmark_throughput.py", line 113, in run_vllm
    llm._run_engine(use_tqdm=True)
  File "/root/vllm/vllm/entrypoints/llm.py", line 198, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/root/vllm/vllm/engine/llm_engine.py", line 842, in step
    all_outputs = self._run_workers(
  File "/root/vllm/vllm/engine/llm_engine.py", line 1045, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/model_runner.py", line 594, in execute_model
    output = self.model.sample(
  File "/root/vllm/vllm/model_executor/models/opt.py", line 313, in sample
    next_tokens = self.sampler(self.lm_head_weight, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/vllm/vllm/model_executor/layers/sampler.py", line 115, in forward
    sample_results = _sample(probs, logprobs, sampling_metadata)
  File "/root/vllm/vllm/model_executor/layers/sampler.py", line 417, in _sample
    multinomial_samples[sampling_type] = _multinomial(
  File "/root/vllm/vllm/model_executor/layers/sampler.py", line 364, in _multinomial
    q = torch.empty_like(probs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 15.77 GiB of which 47.62 MiB is free. Including non-PyTorch memory, this process has 15.71 GiB memory in use. Of the allocated memory 12.87 GiB is allocated by PyTorch, and 234.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
(RayWorkerVllm pid=52049) INFO 03-04 16:58:10 model_runner.py:760] Graph capturing finished in 8 secs. [repeated 2x across cluster]

if i set a proper gpu-memory-utilization value to class LLM through benchmark_throughput.py args, errors disappeared,
so, i don't know if there is something wrong with benchmark_throughput.py or i used wrong args.

@simon-mo simon-mo merged commit 9cbc7e5 into vllm-project:main Mar 4, 2024
22 checks passed
@AllenDou AllenDou deleted the benchmark_args branch March 5, 2024 02:00
dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants