-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue comparing sglang to vllm. #169
Comments
10 seconds look weird. Is it constantly 10 seconds if you run the same request multiple times? And what's the log on the server side? |
I tried your case also with The script is
It takes me less than one second to get the answers. There may be some unnoticed problems; please provide more details or server end outputs so we can help you well. |
HI there, just pulled the latest changes, a lot faster now: Here are the result: Running
I am getting 0.57 (run 3 times), which is still a bit slower than using the same curl command to vllm, sitting at 0.45 second. I also run the following python script ` import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response)` The result is consistently around 1.3 second, which is almost 3 times slower than using curl. All above numbers were run at least 5 times |
Since your prompt is pretty short, it's likely that this request cannot be benefit from RadixAttention that much. In this case, since vLLM enables CUDA graph, it might be faster in terms of prefill computation. You can try a longer prompt (e.g., >500 tokens) to see if it is still the case. |
@findalexli SGLang is mainly optimized for high-throughput large-batch serving, especially for requests with many shared prefixes. Another factor is that there is a recent PR in vLLM (vllm-project/vllm#2542) that introduced some fused kernels to improve MoE inference. We can bring it to our code as well. (#179) |
Hi there, Amazing work on the RadixAttention and json contained decoding. I am running into some unexcited performance issue comparing sglang and vllm. I use latest pip of vllm, and use git-clone-ed sglang as of today.
here is my code to launch sglang
python -m sglang.launch_server --model-path NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --port 30000 --tp 8
Here is my code to launch v-llm
python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --tensor-parallel-size 8
Both running with the same Conda with CUDA 12.1 environment, 8x a10g on aws.
Here is the openai-compatible curl request
'curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "You are a helpful AI assistant. List 3 countries and their capitals."}
]
}
'
The SG-lang one is giving me 10 second of latency, while the vllm is giving 0.45 second. The number are reported after the first run to avoid any cold-start issue.
The text was updated successfully, but these errors were encountered: