enable --gpu-memory-utilization in benchmark_throughput.py #3175

AllenDou · 2024-03-04T09:19:06Z

my hardward, single machine with 4 V100(16GB) cards, when i run
python3 benchmark_throughput.py --backend vllm --model /root/opt-6.7b/ --dataset /root/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --tensor-parallel-size 4

(RayWorkerVllm pid=52049) INFO 03-04 16:58:02 model_runner.py:692] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [repeated 2x across cluster]
Processed prompts:   0%|                                      | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/vllm/benchmarks/benchmark_throughput.py", line 348, in <module>
    main(args)
  File "/root/vllm/benchmarks/benchmark_throughput.py", line 211, in main
    elapsed_time = run_vllm(
  File "/root/vllm/benchmarks/benchmark_throughput.py", line 113, in run_vllm
    llm._run_engine(use_tqdm=True)
  File "/root/vllm/vllm/entrypoints/llm.py", line 198, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/root/vllm/vllm/engine/llm_engine.py", line 842, in step
    all_outputs = self._run_workers(
  File "/root/vllm/vllm/engine/llm_engine.py", line 1045, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/vllm/vllm/worker/model_runner.py", line 594, in execute_model
    output = self.model.sample(
  File "/root/vllm/vllm/model_executor/models/opt.py", line 313, in sample
    next_tokens = self.sampler(self.lm_head_weight, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/vllm/vllm/model_executor/layers/sampler.py", line 115, in forward
    sample_results = _sample(probs, logprobs, sampling_metadata)
  File "/root/vllm/vllm/model_executor/layers/sampler.py", line 417, in _sample
    multinomial_samples[sampling_type] = _multinomial(
  File "/root/vllm/vllm/model_executor/layers/sampler.py", line 364, in _multinomial
    q = torch.empty_like(probs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 15.77 GiB of which 47.62 MiB is free. Including non-PyTorch memory, this process has 15.71 GiB memory in use. Of the allocated memory 12.87 GiB is allocated by PyTorch, and 234.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
(RayWorkerVllm pid=52049) INFO 03-04 16:58:10 model_runner.py:760] Graph capturing finished in 8 secs. [repeated 2x across cluster]

if i set a proper gpu-memory-utilization value to class LLM through benchmark_throughput.py args, errors disappeared,
so, i don't know if there is something wrong with benchmark_throughput.py or i used wrong args.

…t with a proper value to avoid gpu oom.

…ect#3175) Co-authored-by: zixiao <[email protected]>

enable --gpu-memory-utilization in benchmark_throughput.py, and set i…

ef5de34

…t with a proper value to avoid gpu oom.

simon-mo approved these changes Mar 4, 2024

View reviewed changes

simon-mo merged commit 9cbc7e5 into vllm-project:main Mar 4, 2024
22 checks passed

AllenDou deleted the benchmark_args branch March 5, 2024 02:00

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

enable --gpu-memory-utilization in benchmark_throughput.py (vllm-proj…

eb0b086

…ect#3175) Co-authored-by: zixiao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable --gpu-memory-utilization in benchmark_throughput.py #3175

enable --gpu-memory-utilization in benchmark_throughput.py #3175

AllenDou commented Mar 4, 2024

enable --gpu-memory-utilization in benchmark_throughput.py #3175

enable --gpu-memory-utilization in benchmark_throughput.py #3175

Conversation

AllenDou commented Mar 4, 2024