[Bug]: benchmark_serving model_id bug for lmdeploy #4001

zhyncs · 2024-04-11T09:44:29Z

Your current environment

PyTorch version: 2.1.2+cu118
CUDA used to build PyTorch: 11.8
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
Libc version: glibc-2.17

Python version: 3.9.16 (main, Aug 15 2023, 19:38:56)  [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-147.mt20200626.413.el8_1.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.103.01

🐛 Describe the bug

Hi @ywang96 Currently there is a small issue in benchmarks/backend_request_func when benchmark LMDeploy with Llama-2-13b-chat-hf.

# server
python3 -m lmdeploy serve api_server /workdir/Llama-2-13b-chat-hf

# client
# https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py
python3 benchmarks/benchmark_serving.py --backend lmdeploy --model /workdir/Llama-2-13b-chat-hf --dataset-name sharegpt --dataset-path /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 128 --num-prompts 1000 --port 23333

I need to change request_func_input.model to llama2

vllm/benchmarks/backend_request_func.py

Line 222 in f3d0bf7

"model": request_func_input.model,

After manual modification, testing can be conducted, and the correct result is:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  106.18
Total input tokens:                      248339
Total generated tokens:                  198641
Request throughput (req/s):              9.42
Input token throughput (tok/s):          2338.84
Output token throughput (tok/s):         1870.79
---------------Time to First Token----------------
Mean TTFT (ms):                          28614.33
Median TTFT (ms):                        24839.67
P99 TTFT (ms):                           80789.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.67
Median TPOT (ms):                        59.62
P99 TPOT (ms):                           220.70
==================================================

Otherwise, the test result is incorrect because the model name was not correctly matched.

The text was updated successfully, but these errors were encountered:

ywang96 · 2024-04-11T10:00:35Z

I wouldn't call this a bug because unlike other inference backends that use the huggingface model id also as the default model name for the API server, the model name for lmdeploy API server has to be from one of the names under lmdeploy list, and afaik this cannot be user defined either when launching the server.

lmdeploy list
The older chat template name like "internlm2-7b", "qwen-7b" and so on are deprecated and will be removed in the future. The supported chat template names are:
baichuan2
chatglm
codellama
dbrx
deepseek
deepseek-coder
deepseek-vl
falcon
gemma
internlm
internlm2
llama
llama2
mistral
mixtral
puyu
qwen
solar
ultracm
ultralm
vicuna
wizardlm
yi
yi-vl

The benchmark script already provides the flexibility to allow users to specify --model for only two purposes:

Used as the value for model in the payload when calling the server via OpenAI API.
Used to identify the tokenizer only if the user doesn't specify --tokenizer.

I can make a PR to make this clearer if that helps.

zhyncs · 2024-04-11T10:18:43Z

@ywang96 In order to make this benchmark run as expected, perhaps we can add a parameter similar to model_name for the lmdeploy scenario. Do you have any suggestions?

ywang96 · 2024-04-11T10:25:10Z

@ywang96 In order to make this benchmark run as expected, perhaps we can add a parameter similar to model_name for the lmdeploy scenario. Do you have any suggestions? Without changing anything or just adding some user instructions, the issue cannot be solved.

Wouldn't this work for lmdeploy? (Modified based on your original command in the issue)

python3 benchmarks/benchmark_serving.py \
               --backend lmdeploy \
               --model llama2 \
               --tokenizer /workdir/Llama-2-13b-chat-hf \
               --dataset-name sharegpt \
               --dataset-path /workdir/ShareGPT_V3_unfiltered_cleaned_split.json \
               --request-rate 128 \
               --num-prompts 1000 \
               --port 23333

Perhaps I can make the intention of --model clearer in its argument help message.

zhyncs · 2024-04-11T10:26:18Z

make sense

zhyncs · 2024-04-11T10:27:58Z

I can make a PR to make this clearer if that helps.

It's ok.

zhyncs added the bug Something isn't working label Apr 11, 2024

zhyncs closed this as completed Apr 11, 2024

ywang96 mentioned this issue May 25, 2024

[Misc] Make Serving Benchmark More User-friendly #5044

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: benchmark_serving model_id bug for lmdeploy #4001

[Bug]: benchmark_serving model_id bug for lmdeploy #4001

zhyncs commented Apr 11, 2024

ywang96 commented Apr 11, 2024 •

edited

Loading

zhyncs commented Apr 11, 2024 •

edited

Loading

ywang96 commented Apr 11, 2024 •

edited

Loading

zhyncs commented Apr 11, 2024

zhyncs commented Apr 11, 2024

[Bug]: benchmark_serving model_id bug for lmdeploy #4001

[Bug]: benchmark_serving model_id bug for lmdeploy #4001

Comments

zhyncs commented Apr 11, 2024

Your current environment

🐛 Describe the bug

ywang96 commented Apr 11, 2024 • edited Loading

zhyncs commented Apr 11, 2024 • edited Loading

ywang96 commented Apr 11, 2024 • edited Loading

zhyncs commented Apr 11, 2024

zhyncs commented Apr 11, 2024

ywang96 commented Apr 11, 2024 •

edited

Loading

zhyncs commented Apr 11, 2024 •

edited

Loading

ywang96 commented Apr 11, 2024 •

edited

Loading