Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue comparing sglang to vllm. #169

Closed
findalexli opened this issue Feb 8, 2024 · 5 comments
Closed

Performance issue comparing sglang to vllm. #169

findalexli opened this issue Feb 8, 2024 · 5 comments

Comments

@findalexli
Copy link

Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today.

here is my code to launch sglang
python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8

Here is my code to launch v-llm

python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8

Both running with the same Conda with CUDA 12.1 environment, 8x a10g on aws.
Here is the openai-compatible curl request

'curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
]
}
'
The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue.

@comaniac
Copy link
Contributor

comaniac commented Feb 8, 2024

10 seconds look weird. Is it constantly 10 seconds if you run the same request multiple times? And what's the log on the server side?

@hnyls2002
Copy link
Collaborator

I tried your case also with 8xA10g on aws and CUDA12.2, running NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO on the latest main branch.

The script is

curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
    "messages": [
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
    ],
    "max_tokens": 100
}'

Feb-09-2024 11-23-11

It takes me less than one second to get the answers. There may be some unnoticed problems; please provide more details or server end outputs so we can help you well.

@findalexli
Copy link
Author

findalexli commented Feb 9, 2024

HI there, just pulled the latest changes, a lot faster now:

Here are the result:

Running

curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", "messages": [ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."} ] }'

I am getting 0.57 (run 3 times), which is still a bit slower than using the same curl command to vllm, sitting at 0.45 second.

I also run the following python script

`

import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)`

The result is consistently around 1.3 second, which is almost 3 times slower than using curl.

All above numbers were run at least 5 times

@comaniac
Copy link
Contributor

Since your prompt is pretty short, it's likely that this request cannot be benefit from RadixAttention that much. In this case, since vLLM enables CUDA graph, it might be faster in terms of prefill computation. You can try a longer prompt (e.g., >500 tokens) to see if it is still the case.

@merrymercy
Copy link
Contributor

merrymercy commented Feb 11, 2024

@findalexli SGLang is mainly optimized for high-throughput large-batch serving, especially for requests with many shared prefixes.
However, in your case, you benchmarked the latency of a single short prompt, which is not what SGLang is optimized for. To obtain more realistic results, you may want to run your own dataset with larger batch sizes.

Another factor is that there is a recent PR in vLLM (vllm-project/vllm#2542) that introduced some fused kernels to improve MoE inference. We can bring it to our code as well. (#179)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants