Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs on decoupled final response feature #5936

Merged
merged 9 commits into from
Jun 20, 2023
6 changes: 4 additions & 2 deletions docs/protocol/extension_parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,10 @@ used as custom parameters:
- timeout
- sequence_start
- sequence_end
- All the keys that start with "triton_" prefix.
- headers
- All the keys that start with `"triton_"` prefix. Some examples used today:
- `"triton_enable_empty_final_response"` request parameter
- `"triton_final_response"` response parameter

When using both GRPC and HTTP endpoints, you need to make sure to not use
the reserved parameters list to avoid unexpected behavior. The reserved
Expand Down Expand Up @@ -90,7 +92,7 @@ ModelInferRequest message can be used to send custom parameters.
Triton can forward HTTP/GRPC headers as inference request parameters. By
specifying a regular expression in `--http-header-forward-pattern` and
`--grpc-header-forward-pattern`,
Triton will add the headers that match with the regular experession as request
Triton will add the headers that match with the regular expression as request
parameters. All the forwarded headers will be added as a parameter with string
value. For example to forward all the headers that start with 'PREFIX_' from
both HTTP and GRPC, you should add `--http-header-forward-pattern PREFIX_.*
Expand Down
35 changes: 33 additions & 2 deletions docs/user_guide/decoupled_models.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -87,10 +87,41 @@ exactly one response per request. Even standard ModelInfer RPC in the GRPC endpo
does not support decoupled responses. In order to run inference on a decoupled
model, the client must use the bi-directional streaming RPC. See
[here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto)
for more details. The [decoupled_test.py](https://github.com/triton-inference-server/server/blob/main/qa/L0_decoupled/decoupled_test.py) demonstrates
for more details. The [decoupled_test.py](../../qa/L0_decoupled/decoupled_test.py) demonstrates
how the gRPC streaming can be used to infer decoupled models.

If using [Triton's in-process C API](../customization_guide/inference_protocols.md#in-process-triton-server-api),
your application should be cognizant that the callback function you registered with
`TRITONSERVER_InferenceRequestSetResponseCallback` can be invoked any number of times,
each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc_server.cc)

### Knowing When a Decoupled Inference Request is Complete

An inference request is considered complete when a response containing the
`TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag is received from a model/backend.

1. Client applications using streaming GRPC can access this information by
checking the response parameters for the `"triton_final_response"` parameter.
Decoupled models may not send a response for each request depending on how
the model/backend is designed. In these cases where no response is sent by
the backend, the streaming GRPC client can opt-in to receive an empty final
response for each request. By default, empty final responses are not sent to
save on network traffic.

```python
# Example of streaming GRPC client opting-in
client.async_stream_infer(
...,
enable_empty_final_response=True
)
```

2. Client applications using the C API can check the
`TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag directly in their response
handling / callback logic.

The [decoupled_test.py](../../qa/L0_decoupled/decoupled_test.py)
demonstrates an example of opting-in through the streaming GRPC
Python client API and programmatically identifying when a final response
is received through the `"triton_final_response"` response parameter.