Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: updated inline vllm inference provider #880

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

frreiss
Copy link
Contributor

@frreiss frreiss commented Jan 26, 2025

What does this PR do?

This PR updates the inline vLLM inference provider in several significant ways:

  • Models are now attached at run time to instances of the provider via the .../models API instead of hard-coding the model's full name into the provider's YAML configuration.
  • The provider supports models that are not Meta Llama models. Any model that vLLM supports can be loaded by passing Huggingface coordinates in the "provider_model_id" field. Custom fine-tuned versions of Meta Llama models can be loaded by specifying a path on local disk in the "provider_model_id".
  • To implement full chat completions support, including tool calling and constrained decoding, the provider now routes the chat_completions API to a captive (i.e. called directly in-process, not via HTTPS) instance of vLLM's OpenAI-compatible server .
  • The logprobs parameter and completions API are also working.

Test Plan

Existing tests in llama_stack/providers/tests/inference/test_text_inference.py have good coverage of the new functionality. These tests can be invoked as follows:

cd llama-stack && pytest \
    -vvv \
    llama_stack/providers/tests/inference/test_text_inference.py \
    --providers inference=vllm \
    --inference-model meta-llama/Llama-3.2-3B-Instruct
====================================== test session starts ======================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items                                                                               

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%]

=========================== 9 passed, 13 warnings in 97.18s (0:01:37) ===========================

Sources

Before submitting

  • Ran pre-commit to handle lint / formatting issues.
  • Read the contributor guideline,
    Pull Request section?
  • Updated relevant documentation.
  • Wrote necessary unit or integration tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 26, 2025
Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful PR, thank you!

I have a few comments inline.

@frreiss frreiss requested a review from ehhuang as a code owner February 4, 2025 21:09
This was referenced Feb 5, 2025
@frreiss frreiss changed the title Updated inline vllm inference provider feat: updated inline vllm inference provider Feb 12, 2025
@frreiss
Copy link
Contributor Author

frreiss commented Feb 12, 2025

All the code review comments should be addressed now. Would one of the project leads mind having another look at the current change set before I merge changes from main into this branch?

Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, thanks for addressing the comments! 🥳

@yanxi0830
Copy link
Contributor

Thanks! Could you also help attach the results from unit tests in 'tests/client-sdk/inference' & 'tests/client-sdk/agents'?

@ashwinb
Copy link
Contributor

ashwinb commented Feb 20, 2025

@frreiss sorry this PR has gone stale again partly due to us on the review side and now it has a few non-trivial conflicts. I tried to resolve some but it is not super straightforward. I wonder if you could take a pass at it soon and we will be sure to merge it relatively quick.

@frreiss
Copy link
Contributor Author

frreiss commented Feb 21, 2025

@ashwinb sure, I'll have a look today. Work might flow into Monday.

@frreiss
Copy link
Contributor Author

frreiss commented Feb 21, 2025

Thanks! Could you also help attach the results from unit tests in 'tests/client-sdk/inference' & 'tests/client-sdk/agents'?

Results from tests/client-sdk/inference:

++ pytest -vvv tests/client-sdk/inference --inference-model meta-llama/Llama-3.2-3B-Instruct --embedding-model meta-llama/Llama-3.2-3B-Instruct
/mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'Linux-6.8.0-1019-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3', 'metadata': '3.1.1'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, asyncio-0.25.3, metadata-3.1.1
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 24 items                                                                                                                                                                                

tests/client-sdk/inference/test_embedding.py::test_embedding_text[meta-llama/Llama-3.2-3B-Instruct-list[string]] FAILED                                                                     [  4%]
tests/client-sdk/inference/test_embedding.py::test_embedding_text[meta-llama/Llama-3.2-3B-Instruct-list[text]] FAILED                                                                       [  8%]
tests/client-sdk/inference/test_embedding.py::test_embedding_image[meta-llama/Llama-3.2-3B-Instruct-list[url,base64]] SKIPPED (Media is not supported)                                      [ 12%]
tests/client-sdk/inference/test_embedding.py::test_embedding_image[meta-llama/Llama-3.2-3B-Instruct-list[url,string,base64,text]] SKIPPED (Media is not supported)                          [ 16%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_non_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                                              [ 20%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                                                  [ 25%]
tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.2-3B-Instruct] XFAIL (inline::vllm doesn't support log probs yet)             [ 29%]
tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.2-3B-Instruct] XFAIL (inline::vllm doesn't support log probs yet)                 [ 33%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output[meta-llama/Llama-3.2-3B-Instruct-completion-01] PASSED                                            [ 37%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.2-3B-Instruct-Which planet do humans live on?-Earth] PASSED                   [ 41%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.2-3B-Instruct-Which planet has rings around it with a name starting with letter S?-Saturn] PASSED [ 45%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.2-3B-Instruct-What's the name of the Sun in latin?-Sol] PASSED                    [ 50%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.2-3B-Instruct-What is the name of the US captial?-Washington] PASSED              [ 54%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                   [ 58%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[meta-llama/Llama-3.2-3B-Instruct] PASSED                                       [ 62%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[meta-llama/Llama-3.2-3B-Instruct] PASSED                                             [ 66%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[meta-llama/Llama-3.2-3B-Instruct] PASSED                                                 [ 70%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_structured_output[meta-llama/Llama-3.2-3B-Instruct-chat_completion-01] PASSED                                  [ 75%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[meta-llama/Llama-3.2-3B-Instruct-True] PASSED                                [ 79%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[meta-llama/Llama-3.2-3B-Instruct-False] PASSED                               [ 83%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_non_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] FAILED                                              [ 87%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] FAILED                                                  [ 91%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_base64[meta-llama/Llama-3.2-11B-Vision-Instruct-url] FAILED                                                 [ 95%]
tests/client-sdk/inference/test_vision_inference.py::test_image_chat_completion_base64[meta-llama/Llama-3.2-11B-Vision-Instruct-data] FAILED                                                [100%]

Embedding tests fail because embeddings on inline vLLM were not implemented before this PR and are still not implemented after this PR.

Tests of image inference fail due to some strange behavior in Llama Stack's handling of image input. Specifically, the current implementation of chat_completion_request_to_prompt(), when fed a request containing an image, converts that request to an array of tokens containing image placeholder tokens with token ID 128256. Then chat_completion_request_to_prompt() attempts to pass this array of tokens to ChatFormat.ecode_dialog_prompt(), which cannot encode token id 128256. I presume that this problem is a bug and will be fixed eventually.

@frreiss
Copy link
Contributor Author

frreiss commented Feb 21, 2025

Thanks! Could you also help attach the results from unit tests in 'tests/client-sdk/inference' & 'tests/client-sdk/agents'?

Results from tests/client-sdk/agents:

++ cd llama-stack
++ LLAMA_STACK_BASE_URL=http://localhost:5000/
++ pytest -vvv tests/client-sdk/agents --inference-model meta-llama/Llama-3.2-3B-Instruct --embedding-model meta-llama/Llama-3.2-3B-Instruct
/mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================== test session starts ====================================================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'Linux-6.8.0-1019-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3', 'metadata': '3.1.1'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, asyncio-0.25.3, metadata-3.1.1
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 10 items                                                                                                          

tests/client-sdk/agents/test_agents.py::test_agent_simple[meta-llama/Llama-3.2-3B-Instruct] PASSED                    [ 10%]
tests/client-sdk/agents/test_agents.py::test_tool_config[meta-llama/Llama-3.2-3B-Instruct] PASSED                     [ 20%]
tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search[meta-llama/Llama-3.2-3B-Instruct] FAILED         [ 30%]
tests/client-sdk/agents/test_agents.py::test_builtin_tool_code_execution[meta-llama/Llama-3.2-3B-Instruct] FAILED     [ 40%]
tests/client-sdk/agents/test_agents.py::test_code_interpreter_for_attachments[meta-llama/Llama-3.2-3B-Instruct] FAILED [ 50%]
tests/client-sdk/agents/test_agents.py::test_custom_tool[meta-llama/Llama-3.2-3B-Instruct] PASSED                     [ 60%]
tests/client-sdk/agents/test_agents.py::test_tool_choice[meta-llama/Llama-3.2-3B-Instruct] FAILED                     [ 70%]
tests/client-sdk/agents/test_agents.py::test_rag_agent[meta-llama/Llama-3.2-3B-Instruct] FAILED                       [ 80%]
tests/client-sdk/agents/test_agents.py::test_rag_and_code_agent[meta-llama/Llama-3.2-3B-Instruct] FAILED              [ 90%]
tests/client-sdk/agents/test_agents.py::test_create_turn_response[meta-llama/Llama-3.2-3B-Instruct] PASSED            [100%]

Tests involving web search, retrieval, and code execution fail because my environment doesn't have those tools. Other tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants