[Usage]: speculative model #4266

arunpatala · 2024-04-22T11:52:39Z

How would you like to use vllm

I am curious about the speculative model support in VLLM. I could not find much about speculation in docs, except the following flags:

--speculative-model
The name of the draft model to be used in speculative decoding.

--num-speculative-tokens
The number of speculative tokens to sample from the draft model in speculative decoding.

I am curious if this is supported now. And possibly how to use. (If possible, prompt based decoding like in tranformers)

thanks

LiuXiaoxuanPKU · 2024-04-22T22:19:22Z

Thanks for the interest! We are merging the last speculative decoding correctness PR. After that, those two flags can be used and should be correct. That PR also contains tests that check speculative decoding correctness, feel free to check it out. But we will update the doc soon as well, stay tuned.

cadedaniel · 2024-04-22T22:22:44Z

+1, but note that performance isn't good yet, we're still optimizing it

tolry418 · 2024-04-22T23:52:18Z

@cadedaniel First of all, thank you for your work. I'm curious about one thing. I've tested speculative decoding(SD) to compare latency using only target model VS SD. All variable are same except one thing that number of request per second. The more number of request per second , the worse latency when using SD than using only target model. When I set test data with long input or max output token is worse on SD. I guess too many queue make the SD worse.. of course many queue make latency worse. But the ratio of deterioration is bigger on SD. Could you guess what is the key point of this?

LiuXiaoxuanPKU · 2024-04-23T00:30:30Z

@cadedaniel First of all, thank you for your work. I'm curious about one thing. I've tested speculative decoding(SD) to compare latency using only target model VS SD. All variable are same except one thing that number of request per second. The more number of request per second , the worse latency when using SD than using only target model. When I set test data with long input or max output token is worse on SD. I guess too many queue make the SD worse.. of course many queue make latency worse. But the ratio of deterioration is bigger on SD. Could you guess what is the key point of this?

Thanks for asking! Yes, this is expected. Intuitively, speculative decoding trades extra compute for reduced latency. When system has low load (low request rate), you can use spare FLOPs for speculative decoding. But when the system is already compute bound, it's hard to see any benefits of speculative decoding. Speculative decoding can even hurt the performance because it wastes some compute for wrong tokens.

After vanilla speculative decoding, we will also open source policies/algorithms to automatically adjust and turn off speculative decoding. This is based on a recent research from the team. Please stay tuned!

tolry418 · 2024-04-23T01:04:00Z

But when the system is already compute bound, it's hard to see any benefits of speculative decoding. Speculative decoding can even hurt the performance because it wastes some compute for wrong tokens.

Thanks for your reply! I have one question based on your reply. Above sentence mean that when the system is compute bound with using speculative decoding, it make higher probability to emit wrong token in draft-model, and also cause more time to verify in target-model. right? Why compute-bound make draft-model emit wrong token?
please correct me if i misunderstood. Thank you. @LiuXiaoxuanPKU

arunpatala · 2024-04-23T07:36:05Z

thanks for your nice work! looking forward to this.

LiuXiaoxuanPKU · 2024-04-23T17:43:51Z

But when the system is already compute bound, it's hard to see any benefits of speculative decoding. Speculative decoding can even hurt the performance because it wastes some compute for wrong tokens.

Thanks for your reply! I have one question based on your reply. Above sentence mean that when the system is compute bound with using speculative decoding, it make higher probability to emit wrong token in draft-model, and also cause more time to verify in target-model. right? Why compute-bound make draft-model emit wrong token? please correct me if i misunderstood. Thank you. @LiuXiaoxuanPKU

Sorry for the confusion here. The probability to emit wrong token will not change based on the system load. It only depends on the draft model, target model, and dataset. Once those factors are fixed, the token acceptance rate is the same. But the 'cost' to emit wrong tokens is different based on the system load. When the request rate is low, you have some free compute, so you can use those compute to do speculative decoding. If you make a mistake, it's fine because those compute is free anyways. However, when the request rate is high, the compute you use for speculative decoding is not free anymore, if you emit a wrong token, it's wasting some compute. The compute can otherwise be used to serve other requests. Does it make a bit more sense?

tolry418 · 2024-04-24T04:14:37Z

@LiuXiaoxuanPKU Thank you for your kind explanation! I obviously understand it. I hope to see your contribution on speculative decoding.

arunpatala · 2024-04-24T09:03:13Z

I have seen support has been added in V4.1 and am testing the code with:

I am starting the server with the following code:

python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--speculative-model JackFram/llama-68m
--num-speculative-tokens 5
--use-v2-block-manager
--num-lookahead-slots 5

But when i request the endpoint, I am getting the following error on server side:

TypeError: SpecDecodeWorker.execute_model() missing 1 required positional argument: 'num_lookahead_slots'

Can anyone please help me with this?

arunpatala · 2024-04-24T09:06:42Z

By the way I am able to run using speculation model using LLM class. I was wondering if the openai.api_server also supports it:

from vllm import LLM, SamplingParams

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

MODEL_NAME="meta-llama/Llama-2-7b-chat-hf"
llm = LLM(model=MODEL_NAME,
speculative_model="JackFram/llama-68m",
num_speculative_tokens=5,
use_v2_block_manager=True)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

arunpatala added the usage How to use vllm label Apr 22, 2024

arunpatala closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: speculative model #4266

[Usage]: speculative model #4266

arunpatala commented Apr 22, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Apr 22, 2024

cadedaniel commented Apr 22, 2024

tolry418 commented Apr 22, 2024

LiuXiaoxuanPKU commented Apr 23, 2024

tolry418 commented Apr 23, 2024 •

edited

Loading

arunpatala commented Apr 23, 2024

LiuXiaoxuanPKU commented Apr 23, 2024

tolry418 commented Apr 24, 2024

arunpatala commented Apr 24, 2024

arunpatala commented Apr 24, 2024 •

edited

Loading

[Usage]: speculative model #4266

[Usage]: speculative model #4266

Comments

arunpatala commented Apr 22, 2024 • edited Loading

How would you like to use vllm

LiuXiaoxuanPKU commented Apr 22, 2024

cadedaniel commented Apr 22, 2024

tolry418 commented Apr 22, 2024

LiuXiaoxuanPKU commented Apr 23, 2024

tolry418 commented Apr 23, 2024 • edited Loading

arunpatala commented Apr 23, 2024

LiuXiaoxuanPKU commented Apr 23, 2024

tolry418 commented Apr 24, 2024

arunpatala commented Apr 24, 2024

arunpatala commented Apr 24, 2024 • edited Loading

arunpatala commented Apr 22, 2024 •

edited

Loading

tolry418 commented Apr 23, 2024 •

edited

Loading

arunpatala commented Apr 24, 2024 •

edited

Loading