feat: updated inline vllm inference provider #880

frreiss · 2025-01-26T03:17:41Z

What does this PR do?

This PR updates the inline vLLM inference provider in several significant ways:

Models are now attached at run time to instances of the provider via the .../models API instead of hard-coding the model's full name into the provider's YAML configuration.
The provider supports models that are not Meta Llama models. Any model that vLLM supports can be loaded by passing Huggingface coordinates in the "provider_model_id" field. Custom fine-tuned versions of Meta Llama models can be loaded by specifying a path on local disk in the "provider_model_id".
To implement full chat completions support, including tool calling and constrained decoding, the provider now routes the chat_completions API to a captive (i.e. called directly in-process, not via HTTPS) instance of vLLM's OpenAI-compatible server .
The logprobs parameter and completions API are also working.

Test Plan

Existing tests in llama_stack/providers/tests/inference/test_text_inference.py have good coverage of the new functionality. These tests can be invoked as follows:

cd llama-stack && pytest \
    -vvv \
    llama_stack/providers/tests/inference/test_text_inference.py \
    --providers inference=vllm \
    --inference-model meta-llama/Llama-3.2-3B-Instruct
====================================== test session starts ======================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items                                                                               

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%]

=========================== 9 passed, 13 warnings in 97.18s (0:01:37) ===========================

Sources

Before submitting

Ran pre-commit to handle lint / formatting issues.
Read the contributor guideline,
Pull Request section?
Updated relevant documentation.
Wrote necessary unit or integration tests.

ashwinb

Wonderful PR, thank you!

I have a few comments inline.

llama_stack/providers/inline/inference/vllm/openai_utils.py

llama_stack/providers/inline/inference/vllm/vllm.py

frreiss · 2025-02-12T01:27:54Z

All the code review comments should be addressed now. Would one of the project leads mind having another look at the current change set before I merge changes from main into this branch?

leseb

Great job!

llama_stack/providers/inline/inference/vllm/vllm.py

llama_stack/providers/inline/inference/vllm/config.py

Co-authored-by: Sébastien Han <[email protected]>

leseb

Great job, thanks for addressing the comments! 🥳

yanxi0830 · 2025-02-19T16:39:28Z

Thanks! Could you also help attach the results from unit tests in 'tests/client-sdk/inference' & 'tests/client-sdk/agents'?

llama_stack/providers/inline/inference/vllm/config.py

ashwinb · 2025-02-20T06:33:50Z

@frreiss sorry this PR has gone stale again partly due to us on the review side and now it has a few non-trivial conflicts. I tried to resolve some but it is not super straightforward. I wonder if you could take a pass at it soon and we will be sure to merge it relatively quick.

frreiss · 2025-02-21T18:41:02Z

@ashwinb sure, I'll have a look today. Work might flow into Monday.

frreiss · 2025-02-21T23:47:11Z

Thanks! Could you also help attach the results from unit tests in 'tests/client-sdk/inference' & 'tests/client-sdk/agents'?

Results from tests/client-sdk/inference:

++ pytest -vvv tests/client-sdk/inference --inference-model meta-llama/Llama-3.2-3B-Instruct --embedding-model meta-llama/Llama-3.2-3B-Instruct
/mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'Linux-6.8.0-1019-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3', 'metadata': '3.1.1'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, asyncio-0.25.3, metadata-3.1.1
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 24 items                                                                                                                                                                                

tests/client-sdk/inference/test_embedding.py::test_embedding_text[meta-llama/Llama-3.2-3B-Instruct-list[string]] FAILED                                                                     [  4%]
tests/client-sdk/inference/test_embedding.py::test_embedding_text[meta-llama/Llama-3.2-3B-Instruct-list[text]] FAILED                                                                       [  8%]
tests/client-sdk/inference/test_embedding.py::test_embedding_image[meta-llama/Llama-3.2-3B-Instruct-list[url,base64]] SKIPPED (Media is not supported)                                      [ 12%]
tests/client-sdk/inference/test_embedding.py::test_embedding_image[meta-llama/Llama-3.2-3B-Instruct-list[url,string,base64,text]] SKIPPED (Media is not supported)                          [ 16%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_non_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                                              [ 20%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                                                  [ 25%]
tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.2-3B-Instruct] XFAIL (inline::vllm doesn't support log probs yet)             [ 29%]
tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.2-3B-Instruct] XFAIL (inline::vllm doesn't support log probs yet)                 [ 33%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output[meta-llama/Llama-3.2-3B-Instruct-completion-01] PASSED                                            [ 37%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.2-3B-Instruct-Which planet do humans live on?-Earth] PASSED                   [ 41%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.2-3B-Instruct-Which planet has rings around it with a name starting with letter S?-Saturn] PASSED [ 45%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.2-3B-Instruct-What's the name of the Sun in latin?-Sol] PASSED                    [ 50%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.2-3B-Instruct-What is the name of the US captial?-Washington] PASSED              [ 54%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                   [ 58%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                       [ 62%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[meta-llama/Llama-3.2-3B-Instruct] PASSED                                             [ 66%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[meta-llama/Llama-3.2-3B-Instruct] PASSED                                                 [ 70%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_structured_output[meta-llama/Llama-3.2-3B-Instruct-chat_completion-01] PASSED                                  [ 75%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[meta-llama/Llama-3.2-3B-Instruct-True] PASSED                                [ 79%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[meta-llama/Llama-3.2-3B-Instruct-False] PASSED                               [ 83%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_non_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] FAILED                                              [ 87%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] FAILED                                                  [ 91%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_base64[meta-llama/Llama-3.2-11B-Vision-Instruct-url] FAILED                                                 [ 95%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_base64[meta-llama/Llama-3.2-11B-Vision-Instruct-data] FAILED                                                [100%]

Embedding tests fail because embeddings on inline vLLM were not implemented before this PR and are still not implemented after this PR.

Tests of image inference fail due to some strange behavior in Llama Stack's handling of image input. Specifically, the current implementation of chat_completion_request_to_prompt(), when fed a request containing an image, converts that request to an array of tokens containing image placeholder tokens with token ID 128256. Then chat_completion_request_to_prompt() attempts to pass this array of tokens to ChatFormat.ecode_dialog_prompt(), which cannot encode token id 128256. I presume that this problem is a bug and will be fixed eventually.

frreiss · 2025-02-21T23:54:53Z

Thanks! Could you also help attach the results from unit tests in 'tests/client-sdk/inference' & 'tests/client-sdk/agents'?

Results from tests/client-sdk/agents:

++ cd llama-stack
++ LLAMA_STACK_BASE_URL=http://localhost:5000/
++ pytest -vvv tests/client-sdk/agents --inference-model meta-llama/Llama-3.2-3B-Instruct --embedding-model meta-llama/Llama-3.2-3B-Instruct
/mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================== test session starts ====================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'Linux-6.8.0-1019-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3', 'metadata': '3.1.1'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, asyncio-0.25.3, metadata-3.1.1
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 10 items                                                                                                          

tests/client-sdk/agents/test_agents.py::test_agent_simple[meta-llama/Llama-3.2-3B-Instruct] PASSED                    [ 10%]
tests/client-sdk/agents/test_agents.py::test_tool_config[meta-llama/Llama-3.2-3B-Instruct] PASSED                     [ 20%]
tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search[meta-llama/Llama-3.2-3B-Instruct] FAILED         [ 30%]
tests/client-sdk/agents/test_agents.py::test_builtin_tool_code_execution[meta-llama/Llama-3.2-3B-Instruct] FAILED     [ 40%]
tests/client-sdk/agents/test_agents.py::test_code_interpreter_for_attachments[meta-llama/Llama-3.2-3B-Instruct] FAILED [ 50%]
tests/client-sdk/agents/test_agents.py::test_custom_tool[meta-llama/Llama-3.2-3B-Instruct] PASSED                     [ 60%]
tests/client-sdk/agents/test_agents.py::test_tool_choice[meta-llama/Llama-3.2-3B-Instruct] FAILED                     [ 70%]
tests/client-sdk/agents/test_agents.py::test_rag_agent[meta-llama/Llama-3.2-3B-Instruct] FAILED                       [ 80%]
tests/client-sdk/agents/test_agents.py::test_rag_and_code_agent[meta-llama/Llama-3.2-3B-Instruct] FAILED              [ 90%]
tests/client-sdk/agents/test_agents.py::test_create_turn_response[meta-llama/Llama-3.2-3B-Instruct] PASSED            [100%]

Tests involving web search, retrieval, and code execution fail because my environment doesn't have those tools. Other tests pass.

frreiss added 9 commits January 22, 2025 17:15

Import updated vLLM code and get tests working

8908584

Merge branch 'meta-llama:main' into vllm2

aa3c991

Add completion API

414bd68

Update for latest APIs

ef806f9

Merge branch 'meta-llama:main' into vllm2

4b97714

Use shared code where possible

f9a55f7

Fix broken test case

0f30fc7

Remove unneeded config parameter

122e51d

Clean up logging

138285a

frreiss requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham, dineshyv, vladimirivic and sixianyi0721 as code owners January 26, 2025 03:17

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 26, 2025

ashwinb reviewed Jan 28, 2025

View reviewed changes

frreiss added 2 commits January 29, 2025 18:07

Implement unregister_model() and shutdown()

bfe91d4

Change line width and remove __del__

551c6c2

terrytangyuan reviewed Jan 31, 2025

View reviewed changes

llama_stack/providers/inline/inference/vllm/vllm.py Outdated Show resolved Hide resolved

llama_stack/providers/inline/inference/vllm/vllm.py Outdated Show resolved Hide resolved

llama_stack/providers/inline/inference/vllm/vllm.py Outdated Show resolved Hide resolved

frreiss added 2 commits January 31, 2025 14:12

Update logging and route Meta Llama requests differently

b56075d

Use Llama Stack template when streaming

eded03f

frreiss requested a review from ehhuang as a code owner February 4, 2025 21:09

Remove commented code

7d13361

This was referenced Feb 5, 2025

Support non-Llama models #964

Closed

Support non-Llama models #965

Open

frreiss added 2 commits February 10, 2025 17:09

Sync with latest vLLM APIs

fc5074c

Switch to more pythonic syntax

4bd4101

frreiss changed the title ~~Updated inline vllm inference provider~~ feat: updated inline vllm inference provider Feb 12, 2025

leseb reviewed Feb 12, 2025

View reviewed changes

leseb mentioned this pull request Feb 13, 2025

fix: logprobs support in remote-vllm provider #1074

Merged

frreiss and others added 5 commits February 15, 2025 17:13

Update llama_stack/providers/inline/inference/vllm/vllm.py

eca7262

Co-authored-by: Sébastien Han <[email protected]>

Remove unneeded pydoc

e468320

Clarify variable name

3407c29

Update llama_stack/providers/inline/inference/vllm/config.py

56e8091

Co-authored-by: Sébastien Han <[email protected]>

Add explanatory comment

ba49427

leseb approved these changes Feb 19, 2025

View reviewed changes

yanxi0830 reviewed Feb 19, 2025

View reviewed changes

llama_stack/providers/inline/inference/vllm/config.py Show resolved Hide resolved

frreiss added 2 commits February 21, 2025 10:49

Merge branch 'main' into vllm2

9b72ebb

Cleanup after merge

1cbf566

Further cleanup after merge

218f48e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: updated inline vllm inference provider #880

feat: updated inline vllm inference provider #880

frreiss commented Jan 26, 2025

ashwinb left a comment

frreiss commented Feb 12, 2025 •

edited

Loading

leseb left a comment

leseb left a comment

yanxi0830 commented Feb 19, 2025

ashwinb commented Feb 20, 2025

frreiss commented Feb 21, 2025

frreiss commented Feb 21, 2025

frreiss commented Feb 21, 2025

feat: updated inline vllm inference provider #880

Are you sure you want to change the base?

feat: updated inline vllm inference provider #880

Conversation

frreiss commented Jan 26, 2025

What does this PR do?

Test Plan

Sources

Before submitting

ashwinb left a comment

Choose a reason for hiding this comment

frreiss commented Feb 12, 2025 • edited Loading

leseb left a comment

Choose a reason for hiding this comment

leseb left a comment

Choose a reason for hiding this comment

yanxi0830 commented Feb 19, 2025

ashwinb commented Feb 20, 2025

frreiss commented Feb 21, 2025

frreiss commented Feb 21, 2025

frreiss commented Feb 21, 2025

frreiss commented Feb 12, 2025 •

edited

Loading