Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FastAPI v1/completions/ endpoint #12101

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

Conversation

athitten
Copy link
Collaborator

@athitten athitten commented Feb 8, 2025

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch 2 times, most recently from 63b2b15 to d09bf3c Compare February 11, 2025 03:23
temperature: float = 1.0
top_p: float = 0.0
top_k: int = 1
logProb: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be logprobs: int = 1

@@ -144,7 +144,7 @@ def query_llm(
"choices": [{"text": sentences}],
}
if log_probs_output is not None:
openai_response["log_probs"] = log_probs_output
openai_response["logprobs"] = log_probs_output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be zipped with choices and for each choice we should have:
{"logprobs": {"token_logprobs": [array_of_token_logprobs]}}

@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from c694f2c to 3502821 Compare February 11, 2025 23:28
# limitations under the License.

import os
from pathlib import Path

Check notice

Code scanning / CodeQL

Unused import

Import of 'Path' is not used.

Copilot Autofix AI 28 days ago

To fix the problem, we need to remove the unused import statement. This will clean up the code and remove the unnecessary dependency, making the code easier to read and maintain.

  • Locate the import statement from pathlib import Path on line 12.
  • Remove this line from the file.
Suggested changeset 1
nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py b/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py
--- a/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py
+++ b/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py
@@ -11,3 +11,2 @@
 import os
-from pathlib import Path
 
EOF
@@ -11,3 +11,2 @@
import os
from pathlib import Path

Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
from nemo.collections.llm.deploy.base import chat_template

# Load the template
template = Template(chat_template)

Check warning

Code scanning / CodeQL

Jinja2 templating with autoescape=False Medium

Using jinja2 templates with autoescape=False can potentially allow XSS attacks.

Copilot Autofix AI 25 days ago

To fix the problem, we need to ensure that the jinja2 environment used to render the template has autoescape enabled. This can be done by using the select_autoescape function when creating the Template object. This function will automatically enable escaping for HTML and XML files, which are the most common targets for XSS attacks.

We will modify the code to create a jinja2 environment with autoescape enabled and then use this environment to get the template and render it. This change will be made in the apply_chat_template function.

Suggested changeset 1
nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py b/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py
--- a/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py
+++ b/nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py
@@ -135,5 +135,9 @@
     from nemo.collections.llm.deploy.base import chat_template
+    from jinja2 import Environment, select_autoescape
+
+    # Create a jinja2 environment with autoescape enabled
+    env = Environment(autoescape=select_autoescape(['html', 'xml']))
 
     # Load the template
-    template = Template(chat_template)
+    template = env.from_string(chat_template)
 
EOF
@@ -135,5 +135,9 @@
from nemo.collections.llm.deploy.base import chat_template
from jinja2 import Environment, select_autoescape

# Create a jinja2 environment with autoescape enabled
env = Environment(autoescape=select_autoescape(['html', 'xml']))

# Load the template
template = Template(chat_template)
template = env.from_string(chat_template)

Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
temperature: float = 1.0
top_p: float = 0.0
top_k: int = 1
logprobs: int = 1
Copy link

@marta-sd marta-sd Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can set the default to 0? I'm not sure if this is in the official API spec, but lm-eval-harness assumes that 0 is the default and 1 needs to be passed (here)

Edit: I checked and it actually should be null by default (see here).

@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch 2 times, most recently from 5433d31 to 1324431 Compare February 27, 2025 00:28
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 7f3a900 to 40a0845 Compare March 1, 2025 00:07
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 46078aa to fca66a4 Compare March 5, 2025 06:21
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from ce1a81f to 48d422b Compare March 5, 2025 19:50
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 78aee07 to 436e676 Compare March 6, 2025 04:14
Signed-off-by: Abhishree <[email protected]>
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 9b65e79 to 04b80eb Compare March 6, 2025 19:49
Signed-off-by: Abhishree <[email protected]>
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 9e9609c to c52f979 Compare March 6, 2025 20:04
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 3c3900f to d00593f Compare March 6, 2025 21:41
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from ceb0839 to 6e715fe Compare March 7, 2025 01:09
@athitten athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 2d5ed5b to 9e93a89 Compare March 11, 2025 01:50
prompts = request.prompt
if not isinstance(request.prompt, list):
prompts = [request.prompt]
output = nq.query_llm(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marta-sd I looked at it, it seems to me that the call stack of it will go to pytriton.ModelClient.infer_batch instead of pytriton.AsyncioModelClient and will block. See https://github.com/NVIDIA/NeMo/pull/12101/files#diff-f70646f35e4a50b01c01caf162262447c66f8f54e3b1a582e9da8ff080fc5b48R128-R129. ModelClient.infer_batch is synchronous operation.

logging.info(f"Attempting to connect to Triton server at: {triton_url}")
print("---triton_url---", triton_url)
try:
response = requests.get(triton_url, timeout=5)
Copy link

@agronskiy agronskiy Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might get blocking too, it's recommended to use aihttp instead of requests inside async functions.

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants