Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: speculative model #4266

Closed
arunpatala opened this issue Apr 22, 2024 · 10 comments
Closed

[Usage]: speculative model #4266

arunpatala opened this issue Apr 22, 2024 · 10 comments
Labels
usage How to use vllm

Comments

@arunpatala
Copy link

arunpatala commented Apr 22, 2024

How would you like to use vllm

I am curious about the speculative model support in VLLM. I could not find much about speculation in docs, except the following flags:

--speculative-model
The name of the draft model to be used in speculative decoding.

--num-speculative-tokens
The number of speculative tokens to sample from the draft model in speculative decoding.

I am curious if this is supported now. And possibly how to use. (If possible, prompt based decoding like in tranformers)

thanks

@arunpatala arunpatala added the usage How to use vllm label Apr 22, 2024
@LiuXiaoxuanPKU
Copy link
Collaborator

Thanks for the interest! We are merging the last speculative decoding correctness PR. After that, those two flags can be used and should be correct. That PR also contains tests that check speculative decoding correctness, feel free to check it out. But we will update the doc soon as well, stay tuned.

@cadedaniel
Copy link
Collaborator

+1, but note that performance isn't good yet, we're still optimizing it

@tolry418
Copy link

@cadedaniel First of all, thank you for your work. I'm curious about one thing. I've tested speculative decoding(SD) to compare latency using only target model VS SD. All variable are same except one thing that number of request per second. The more number of request per second , the worse latency when using SD than using only target model. When I set test data with long input or max output token is worse on SD. I guess too many queue make the SD worse.. of course many queue make latency worse. But the ratio of deterioration is bigger on SD. Could you guess what is the key point of this?

@LiuXiaoxuanPKU
Copy link
Collaborator

@cadedaniel First of all, thank you for your work. I'm curious about one thing. I've tested speculative decoding(SD) to compare latency using only target model VS SD. All variable are same except one thing that number of request per second. The more number of request per second , the worse latency when using SD than using only target model. When I set test data with long input or max output token is worse on SD. I guess too many queue make the SD worse.. of course many queue make latency worse. But the ratio of deterioration is bigger on SD. Could you guess what is the key point of this?

Thanks for asking! Yes, this is expected. Intuitively, speculative decoding trades extra compute for reduced latency. When system has low load (low request rate), you can use spare FLOPs for speculative decoding. But when the system is already compute bound, it's hard to see any benefits of speculative decoding. Speculative decoding can even hurt the performance because it wastes some compute for wrong tokens.

After vanilla speculative decoding, we will also open source policies/algorithms to automatically adjust and turn off speculative decoding. This is based on a recent research from the team. Please stay tuned!

@tolry418
Copy link

tolry418 commented Apr 23, 2024

But when the system is already compute bound, it's hard to see any benefits of speculative decoding. Speculative decoding can even hurt the performance because it wastes some compute for wrong tokens.

Thanks for your reply! I have one question based on your reply. Above sentence mean that when the system is compute bound with using speculative decoding, it make higher probability to emit wrong token in draft-model, and also cause more time to verify in target-model. right? Why compute-bound make draft-model emit wrong token?
please correct me if i misunderstood. Thank you. @LiuXiaoxuanPKU

@arunpatala
Copy link
Author

thanks for your nice work! looking forward to this.

@LiuXiaoxuanPKU
Copy link
Collaborator

But when the system is already compute bound, it's hard to see any benefits of speculative decoding. Speculative decoding can even hurt the performance because it wastes some compute for wrong tokens.

Thanks for your reply! I have one question based on your reply. Above sentence mean that when the system is compute bound with using speculative decoding, it make higher probability to emit wrong token in draft-model, and also cause more time to verify in target-model. right? Why compute-bound make draft-model emit wrong token? please correct me if i misunderstood. Thank you. @LiuXiaoxuanPKU

Sorry for the confusion here. The probability to emit wrong token will not change based on the system load. It only depends on the draft model, target model, and dataset. Once those factors are fixed, the token acceptance rate is the same. But the 'cost' to emit wrong tokens is different based on the system load. When the request rate is low, you have some free compute, so you can use those compute to do speculative decoding. If you make a mistake, it's fine because those compute is free anyways. However, when the request rate is high, the compute you use for speculative decoding is not free anymore, if you emit a wrong token, it's wasting some compute. The compute can otherwise be used to serve other requests. Does it make a bit more sense?

@tolry418
Copy link

@LiuXiaoxuanPKU Thank you for your kind explanation! I obviously understand it. I hope to see your contribution on speculative decoding.

@arunpatala
Copy link
Author

I have seen support has been added in V4.1 and am testing the code with:

I am starting the server with the following code:

python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--speculative-model JackFram/llama-68m
--num-speculative-tokens 5
--use-v2-block-manager
--num-lookahead-slots 5

But when i request the endpoint, I am getting the following error on server side:

TypeError: SpecDecodeWorker.execute_model() missing 1 required positional argument: 'num_lookahead_slots'

Can anyone please help me with this?

@arunpatala
Copy link
Author

arunpatala commented Apr 24, 2024

By the way I am able to run using speculation model using LLM class. I was wondering if the openai.api_server also supports it:

from vllm import LLM, SamplingParams

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

MODEL_NAME="meta-llama/Llama-2-7b-chat-hf"
llm = LLM(model=MODEL_NAME,
speculative_model="JackFram/llama-68m",
num_speculative_tokens=5,
use_v2_block_manager=True)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

4 participants