Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FastAPI v1/completions/ endpoint #12101

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add http to triton url and logging fix
Signed-off-by: Abhishree <[email protected]>
  • Loading branch information
athitten committed Feb 28, 2025
commit 2e104825f96768f327d4cb8378c1a67c656b2657
5 changes: 3 additions & 2 deletions nemo/collections/llm/deploy/fastapi_interface_to_pytriton.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ async def check_triton_health():
f"http://{triton_settings.triton_service_ip}:{str(triton_settings.triton_service_port)}/v2/health/ready"
)
logging.info(f"Attempting to connect to Triton server at: {triton_url}")
print("---triton_url---", triton_url)
try:
response = requests.get(triton_url, timeout=5)
Copy link

@agronskiy agronskiy Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might get blocking too, it's recommended to use aihttp instead of requests inside async functions.

if response.status_code == 200:
Expand All @@ -85,7 +86,7 @@ async def check_triton_health():
def completions_v1(request: CompletionRequest):
try:
print("---hello----")
url = triton_settings.triton_service_ip + ":" + str(triton_settings.triton_service_port)
url = f"http://{triton_settings.triton_service_ip}:{triton_settings.triton_service_port}"
nq = NemoQueryLLMPyTorch(url=url, model_name=request.model)
print("---request----", request)
output = nq.query_llm(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marta-sd I looked at it, it seems to me that the call stack of it will go to pytriton.ModelClient.infer_batch instead of pytriton.AsyncioModelClient and will block. See https://github.com/NVIDIA/NeMo/pull/12101/files#diff-f70646f35e4a50b01c01caf162262447c66f8f54e3b1a582e9da8ff080fc5b48R128-R129. ModelClient.infer_batch is synchronous operation.

Expand All @@ -102,5 +103,5 @@ def completions_v1(request: CompletionRequest):
"output": output[0][0],
}
except Exception as error:
logging.error("An exception occurred with the post request to /v1/completions/ endpoint:", error)
logging.error(f"An exception occurred with the post request to /v1/completions/ endpoint: {error}")
return {"error": "An exception occurred"}