-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support stream_options with vLLM #5197
Comments
Hi @pennycoders this is already supported behaviour. we check if the response contains stream_options or the provider-specific equivalent and return that. If you don't see this on latest, try bumping and let me know what you see. Where in docs would this have been useful to see? |
Hi @krrishdholakia, Thank you very much for the quick reply. Well, I am using vLLM as a backend and proxying both streaming and non-streaming requests to it via vLLM. However, even though when I call vLLM's /v1/chat/completions endpoint directly with stream_options, the last chunk before [DONE] looks like this: {
"id": "chat-a071f1a541c648d9ac615559fb7c3fab",
"object": "chat.completion.chunk",
"created": 1723664201,
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"choices": [],
"usage": {
"prompt_tokens": 4539,
"total_tokens": 5392,
"completion_tokens": 853
}
} However, when I call this exact instance via LiteLLM, I get the following on the chunk before [DONE]: {
"id": "chat-ac6b39f408564693b4f20cbe62513b2b",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"delta": {}
}
],
"created": 1723664471,
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"object": "chat.completion.chunk"
} Regarding the documentation, I don't see vLLM mentioned on this page: In the code, I see an if-else statement here: https://github.com/BerriAI/litellm/blob/main/litellm/llms/vllm.py#L86 Can you please provide me with what I am doing wrong please? Please find the request I send in both cases below: {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": [
{
"role": "system",
"content": "**System Role:**\nYou are a poet and you write poems. You will write a poem about whatever subject is given to you"
},
{
"role": "user",
"content": "Flowers"
}
],
"temperature": 1.00,
"top_p": 0.9,
"n": 1,
"stream": true,
"stream_options": {
"include_usage": true,
"continuous_usage_stats": true
},
"seed": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"logit_bias": {}
} |
Hi @pennycoders which version is this on? if not latest - can you try bumping Here's the relevant code block for handling streaming_usage - Line 10552 in 22243c6
|
Hey @krrishdholakia, I was using main-v1.40.4. Just tested with latest and it works. Thank you very much! Alex |
Hi @pennycoders, curious do you use LiteLLM Proxy in production today If so, I'd love to hop on a call and learn how we can improve LiteLLM for you
|
The Feature
Requests made with stream: true to LiteLLM should support passing the usage through if provided by the backend, in this case, vLLM
Motivation, pitch
Hi,
Provided that vLLM provides support for usage information during streaming requests (please see this PR), it would be suitable for LiteLLM to support that as well. At the time of opening this issue, it does not seem that it is supported, or if it is, it is not documented. Please keep in mind I am willing to make this contribution myself.
Thanks,
Alex
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: