Add FastAPI v1/completions/ endpoint #12101

athitten · 2025-02-08T03:02:13Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

Glorf · 2025-02-11T11:35:59Z

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

+    temperature: float = 1.0
+    top_p: float = 0.0
+    top_k: int = 1
+    logProb: bool = True


should be logprobs: int = 1

Glorf · 2025-02-11T11:38:35Z

nemo/deploy/nlp/query_llm.py

@@ -144,7 +144,7 @@ def query_llm(
                    "choices": [{"text": sentences}],
                }
                if log_probs_output is not None:
-                    openai_response["log_probs"] = log_probs_output
+                    openai_response["logprobs"] = log_probs_output


should be zipped with choices and for each choice we should have:
{"logprobs": {"token_logprobs": [array_of_token_logprobs]}}

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

+# limitations under the License.
+
+import os
+from pathlib import Path


To fix the problem, we need to remove the unused import statement. This will clean up the code and remove the unnecessary dependency, making the code easier to read and maintain.

Locate the import statement from pathlib import Path on line 12.

Remove this line from the file.

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

+    from nemo.collections.llm.deploy.base import chat_template
+
+    # Load the template
+    template = Template(chat_template)


To fix the problem, we need to ensure that the jinja2 environment used to render the template has autoescape enabled. This can be done by using the select_autoescape function when creating the Template object. This function will automatically enable escaping for HTML and XML files, which are the most common targets for XSS attacks.

We will modify the code to create a jinja2 environment with autoescape enabled and then use this environment to get the template and render it. This change will be made in the apply_chat_template function.

marta-sd · 2025-02-21T08:45:58Z

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

+    temperature: float = 1.0
+    top_p: float = 0.0
+    top_k: int = 1
+    logprobs: int = 1


Maybe we can set the default to 0? I'm not sure if this is in the official API spec, but lm-eval-harness assumes that 0 is the default and 1 needs to be passed (here)

Edit: I checked and it actually should be null by default (see here).

Signed-off-by: Abhishree <[email protected]>

Signed-off-by: athitten <[email protected]>

Signed-off-by: Abhishree <[email protected]>

Signed-off-by: athitten <[email protected]>

agronskiy · 2025-03-11T14:10:32Z

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

+        prompts = request.prompt
+        if not isinstance(request.prompt, list):
+            prompts = [request.prompt]
+        output = nq.query_llm(


@marta-sd I looked at it, it seems to me that the call stack of it will go to pytriton.ModelClient.infer_batch instead of pytriton.AsyncioModelClient and will block. See https://github.com/NVIDIA/NeMo/pull/12101/files#diff-f70646f35e4a50b01c01caf162262447c66f8f54e3b1a582e9da8ff080fc5b48R128-R129. ModelClient.infer_batch is synchronous operation.

agronskiy · 2025-03-11T14:11:35Z

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py

+    logging.info(f"Attempting to connect to Triton server at: {triton_url}")
+    print("---triton_url---", triton_url)
+    try:
+        response = requests.get(triton_url, timeout=5)


nit: this might get blocking too, it's recommended to use aihttp instead of requests inside async functions.

github-actions · 2025-03-26T02:03:30Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-advanced-security bot found potential problems Feb 8, 2025

View reviewed changes

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py Fixed Show fixed Hide fixed

athitten force-pushed the athitten/in-fw-eval-OAI-API branch 2 times, most recently from 63b2b15 to d09bf3c Compare February 11, 2025 03:23

github-advanced-security bot found potential problems Feb 11, 2025

View reviewed changes

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py Fixed Show fixed Hide fixed

nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py Fixed Show fixed Hide fixed

Glorf reviewed Feb 11, 2025

View reviewed changes

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from c694f2c to 3502821 Compare February 11, 2025 23:28

github-advanced-security bot found potential problems Feb 19, 2025

View reviewed changes

marta-sd reviewed Feb 21, 2025

View reviewed changes

athitten force-pushed the athitten/in-fw-eval-OAI-API branch 2 times, most recently from 5433d31 to 1324431 Compare February 27, 2025 00:28

athitten and others added 9 commits February 28, 2025 10:04

Add FastAPI v1/completions/ endpoint

b96c9ff

Signed-off-by: Abhishree <[email protected]>

Add http to triton url and logging fix

2e10482

Signed-off-by: Abhishree <[email protected]>

Serialization fixes for FastAPI

66feb29

Signed-off-by: Abhishree <[email protected]>

Add top_logprobs

232037d

Signed-off-by: Abhishree <[email protected]>

Add chat/completions and async for concurrency

21191cf

Signed-off-by: Abhishree <[email protected]>

Apply isort and black reformatting

75feb42

Signed-off-by: athitten <[email protected]>

Edits for chat/completion and increase PyTriton timeout

a9395b4

Signed-off-by: Abhishree <[email protected]>

Add fix for mGPU in-fw deployment

179859f

Signed-off-by: Abhishree <[email protected]>

Add inference_max_seq_length required by mcore>=0.12

40a0845

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 7f3a900 to 40a0845 Compare March 1, 2025 00:07

Apply chat template from tokenizer and add default logprob=None

fca66a4

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 46078aa to fca66a4 Compare March 5, 2025 06:21

Add correct attribute to get chat_template(llama3-8B-instruct)

48d422b

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from ce1a81f to 48d422b Compare March 5, 2025 19:50

Add deploy script with argparser

436e676

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 78aee07 to 436e676 Compare March 6, 2025 04:14

Bug fix

04b80eb

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 9b65e79 to 04b80eb Compare March 6, 2025 19:49

Add back output print

c52f979

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 9e9609c to c52f979 Compare March 6, 2025 20:04

Add ckpt_load_strictness="log_all" to strategy

d00593f

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 3c3900f to d00593f Compare March 6, 2025 21:41

Remove eos token from output response

6e715fe

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from ceb0839 to 6e715fe Compare March 7, 2025 01:09

Add object as chat.completion and message for chat endpoint

9e93a89

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/in-fw-eval-OAI-API branch from 2d5ed5b to 9e93a89 Compare March 11, 2025 01:50

Apply isort and black reformatting

e128274

Signed-off-by: athitten <[email protected]>

agronskiy reviewed Mar 11, 2025

View reviewed changes

github-actions bot added the stale label Mar 26, 2025

@@ -135,5 +135,9 @@
                 from nemo.collections.llm.deploy.base import chat_template
+                from jinja2 import Environment, select_autoescape
+                # Create a jinja2 environment with autoescape enabled
+                env = Environment(autoescape=select_autoescape(['html', 'xml']))
                 # Load the template
-                template = Template(chat_template)
+                template = env.from_string(chat_template)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FastAPI v1/completions/ endpoint #12101

Add FastAPI v1/completions/ endpoint #12101

athitten commented Feb 8, 2025

Glorf Feb 11, 2025

Glorf Feb 11, 2025

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

marta-sd Feb 21, 2025 •

edited

Loading

agronskiy Mar 11, 2025

agronskiy Mar 11, 2025 •

edited

Loading

github-actions bot commented Mar 26, 2025

Add FastAPI v1/completions/ endpoint #12101

Are you sure you want to change the base?

Add FastAPI v1/completions/ endpoint #12101

Conversation

athitten commented Feb 8, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Glorf Feb 11, 2025

Choose a reason for hiding this comment

Glorf Feb 11, 2025

Choose a reason for hiding this comment

marta-sd Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

agronskiy Mar 11, 2025

Choose a reason for hiding this comment

agronskiy Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Mar 26, 2025

marta-sd Feb 21, 2025 •

edited

Loading

agronskiy Mar 11, 2025 •

edited

Loading