-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: speculative model #4266
Comments
Thanks for the interest! We are merging the last speculative decoding correctness PR. After that, those two flags can be used and should be correct. That PR also contains tests that check speculative decoding correctness, feel free to check it out. But we will update the doc soon as well, stay tuned. |
+1, but note that performance isn't good yet, we're still optimizing it |
@cadedaniel First of all, thank you for your work. I'm curious about one thing. I've tested speculative decoding(SD) to compare latency using only target model VS SD. All variable are same except one thing that number of request per second. The more number of request per second , the worse latency when using SD than using only target model. When I set test data with long input or max output token is worse on SD. I guess too many queue make the SD worse.. of course many queue make latency worse. But the ratio of deterioration is bigger on SD. Could you guess what is the key point of this? |
Thanks for asking! Yes, this is expected. Intuitively, speculative decoding trades extra compute for reduced latency. When system has low load (low request rate), you can use spare FLOPs for speculative decoding. But when the system is already compute bound, it's hard to see any benefits of speculative decoding. Speculative decoding can even hurt the performance because it wastes some compute for wrong tokens. After vanilla speculative decoding, we will also open source policies/algorithms to automatically adjust and turn off speculative decoding. This is based on a recent research from the team. Please stay tuned! |
Thanks for your reply! I have one question based on your reply. Above sentence mean that when the system is compute bound with using speculative decoding, it make higher probability to emit wrong token in draft-model, and also cause more time to verify in target-model. right? Why compute-bound make draft-model emit wrong token? |
thanks for your nice work! looking forward to this. |
Sorry for the confusion here. The probability to emit wrong token will not change based on the system load. It only depends on the draft model, target model, and dataset. Once those factors are fixed, the token acceptance rate is the same. But the 'cost' to emit wrong tokens is different based on the system load. When the request rate is low, you have some free compute, so you can use those compute to do speculative decoding. If you make a mistake, it's fine because those compute is free anyways. However, when the request rate is high, the compute you use for speculative decoding is not free anymore, if you emit a wrong token, it's wasting some compute. The compute can otherwise be used to serve other requests. Does it make a bit more sense? |
@LiuXiaoxuanPKU Thank you for your kind explanation! I obviously understand it. I hope to see your contribution on speculative decoding. |
I have seen support has been added in V4.1 and am testing the code with: I am starting the server with the following code: python -m vllm.entrypoints.openai.api_server But when i request the endpoint, I am getting the following error on server side:
Can anyone please help me with this? |
By the way I am able to run using speculation model using LLM class. I was wondering if the openai.api_server also supports it: from vllm import LLM, SamplingParams prompts = [ MODEL_NAME="meta-llama/Llama-2-7b-chat-hf" outputs = llm.generate(prompts, sampling_params) for output in outputs: |
How would you like to use vllm
I am curious about the speculative model support in VLLM. I could not find much about speculation in docs, except the following flags:
--speculative-model
The name of the draft model to be used in speculative decoding.
--num-speculative-tokens
The number of speculative tokens to sample from the draft model in speculative decoding.
I am curious if this is supported now. And possibly how to use. (If possible, prompt based decoding like in tranformers)
thanks
The text was updated successfully, but these errors were encountered: