Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support stream_options with vLLM #5197

Closed
pennycoders opened this issue Aug 14, 2024 · 5 comments
Closed

[Feature]: Support stream_options with vLLM #5197

pennycoders opened this issue Aug 14, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@pennycoders
Copy link

The Feature

Requests made with stream: true to LiteLLM should support passing the usage through if provided by the backend, in this case, vLLM

Motivation, pitch

Hi,
Provided that vLLM provides support for usage information during streaming requests (please see this PR), it would be suitable for LiteLLM to support that as well. At the time of opening this issue, it does not seem that it is supported, or if it is, it is not documented. Please keep in mind I am willing to make this contribution myself.

Thanks,
Alex

Twitter / LinkedIn details

No response

@pennycoders pennycoders added the enhancement New feature or request label Aug 14, 2024
@krrishdholakia
Copy link
Contributor

Hi @pennycoders this is already supported behaviour. we check if the response contains stream_options or the provider-specific equivalent and return that. If you don't see this on latest, try bumping and let me know what you see.


Where in docs would this have been useful to see?

@pennycoders
Copy link
Author

Hi @krrishdholakia,

Thank you very much for the quick reply.

Well, I am using vLLM as a backend and proxying both streaming and non-streaming requests to it via vLLM. However, even though when I call vLLM's /v1/chat/completions endpoint directly with stream_options, the last chunk before [DONE] looks like this:

{
    "id": "chat-a071f1a541c648d9ac615559fb7c3fab",
    "object": "chat.completion.chunk",
    "created": 1723664201,
    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "choices": [],
    "usage": {
        "prompt_tokens": 4539,
        "total_tokens": 5392,
        "completion_tokens": 853
    }
}

However, when I call this exact instance via LiteLLM, I get the following on the chunk before [DONE]:

{
    "id": "chat-ac6b39f408564693b4f20cbe62513b2b",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "delta": {}
        }
    ],
    "created": 1723664471,
    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "object": "chat.completion.chunk"
}

Regarding the documentation, I don't see vLLM mentioned on this page:
https://docs.litellm.ai/docs/completion/input

In the code, I see an if-else statement here: https://github.com/BerriAI/litellm/blob/main/litellm/llms/vllm.py#L86

Can you please provide me with what I am doing wrong please? Please find the request I send in both cases below:

{
    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "**System Role:**\nYou are a poet and you write poems. You will write a poem about whatever subject is given to you"
        },
        {
            "role": "user",
            "content": "Flowers"
        }
    ],
    "temperature": 1.00,
    "top_p": 0.9,
    "n": 1,
    "stream": true,
    "stream_options": {
        "include_usage": true,
        "continuous_usage_stats": true
    },
    "seed": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "logit_bias": {}
}

@krrishdholakia
Copy link
Contributor

Hi @pennycoders which version is this on? if not latest - can you try bumping

Here's the relevant code block for handling streaming_usage -

setattr(response, "usage", complete_streaming_response.usage)

@pennycoders
Copy link
Author

Hey @krrishdholakia,

I was using main-v1.40.4. Just tested with latest and it works.

Thank you very much!

Alex

@ishaan-jaff
Copy link
Contributor

Hi @pennycoders, curious do you use LiteLLM Proxy in production today If so, I'd love to hop on a call and learn how we can improve LiteLLM for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants