Performance issue comparing sglang to vllm. #169

findalexli · 2024-02-08T23:26:03Z

Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today.

here is my code to launch sglang
python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8

Here is my code to launch v-llm

python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8

Both running with the same Conda with CUDA 12.1 environment, 8x a10g on aws.
Here is the openai-compatible curl request

'curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
]
}
'
The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue.

The text was updated successfully, but these errors were encountered:

comaniac · 2024-02-08T23:46:54Z

10 seconds look weird. Is it constantly 10 seconds if you run the same request multiple times? And what's the log on the server side?

hnyls2002 · 2024-02-09T03:31:48Z

I tried your case also with 8xA10g on aws and CUDA12.2, running NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO on the latest main branch.

The script is

curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
    "messages": [
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
    ],
    "max_tokens": 100
}'

It takes me less than one second to get the answers. There may be some unnoticed problems; please provide more details or server end outputs so we can help you well.

findalexli · 2024-02-09T03:46:14Z

HI there, just pulled the latest changes, a lot faster now:

Here are the result:

Running

curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", "messages": [ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."} ] }'

I am getting 0.57 (run 3 times), which is still a bit slower than using the same curl command to vllm, sitting at 0.45 second.

I also run the following python script

`

import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)`

The result is consistently around 1.3 second, which is almost 3 times slower than using curl.

All above numbers were run at least 5 times

comaniac · 2024-02-11T00:04:01Z

Since your prompt is pretty short, it's likely that this request cannot be benefit from RadixAttention that much. In this case, since vLLM enables CUDA graph, it might be faster in terms of prefill computation. You can try a longer prompt (e.g., >500 tokens) to see if it is still the case.

merrymercy · 2024-02-11T14:00:29Z

@findalexli SGLang is mainly optimized for high-throughput large-batch serving, especially for requests with many shared prefixes.
However, in your case, you benchmarked the latency of a single short prompt, which is not what SGLang is optimized for. To obtain more realistic results, you may want to run your own dataset with larger batch sizes.

Another factor is that there is a recent PR in vLLM (vllm-project/vllm#2542) that introduced some fused kernels to improve MoE inference. We can bring it to our code as well. (#179)

merrymercy closed this as completed Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue comparing sglang to vllm. #169

Performance issue comparing sglang to vllm. #169

findalexli commented Feb 8, 2024

comaniac commented Feb 8, 2024

hnyls2002 commented Feb 9, 2024

findalexli commented Feb 9, 2024 •

edited

Loading

comaniac commented Feb 11, 2024

merrymercy commented Feb 11, 2024 •

edited

Loading

Performance issue comparing sglang to vllm. #169

Performance issue comparing sglang to vllm. #169

Comments

findalexli commented Feb 8, 2024

comaniac commented Feb 8, 2024

hnyls2002 commented Feb 9, 2024

findalexli commented Feb 9, 2024 • edited Loading

comaniac commented Feb 11, 2024

merrymercy commented Feb 11, 2024 • edited Loading

findalexli commented Feb 9, 2024 •

edited

Loading

merrymercy commented Feb 11, 2024 •

edited

Loading