Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs on decoupled final response feature #5936

Merged
merged 9 commits into from
Jun 20, 2023
94 changes: 92 additions & 2 deletions docs/user_guide/decoupled_models.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -87,10 +87,100 @@ exactly one response per request. Even standard ModelInfer RPC in the GRPC endpo
does not support decoupled responses. In order to run inference on a decoupled
model, the client must use the bi-directional streaming RPC. See
[here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto)
for more details. The [decoupled_test.py](https://github.com/triton-inference-server/server/blob/main/qa/L0_decoupled/decoupled_test.py) demonstrates
for more details. The [decoupled_test.py](../../qa/L0_decoupled/decoupled_test.py) demonstrates
how the gRPC streaming can be used to infer decoupled models.

If using [Triton's in-process C API](../customization_guide/inference_protocols.md#in-process-triton-server-api),
your application should be cognizant that the callback function you registered with
`TRITONSERVER_InferenceRequestSetResponseCallback` can be invoked any number of times,
each time with a new response. You can take a look at [grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc_server.cc)

### Knowing When a Decoupled Inference Request is Complete

A request is considered complete when a response containing the
rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved
`TRITONSERVER_RESPONSE_COMPLETE_FINAL` flag is received. For decoupled models,
there are two ways this can happen. The model/backend calls one of the
following [TRITONBACKEND APIs](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h):
1. `TRITONBACKEND_ResponseSend(response, TRITONSERVER_RESPONSE_COMPLETE_FINAL, ...)`
2. `TRITONBACKEND_ResponseFactorySendFlags(factory, TRITONSERVER_RESPONSE_COMPLETE_FINAL)`

As described in the
[backend repo](https://github.com/triton-inference-server/backend/blob/main/README.md#special-cases)
for decoupled models:

> If the backend should not send any more responses for the request,
> `TRITONBACKEND_ResponseFactorySendFlags` can be used to send
> `TRITONSERVER_RESPONSE_COMPLETE_FINAL` using the ResponseFactory.

In the `TRITONBACKEND_ResponseFactorySendFlags` case, only the `flags` are
communicated back to the frontend to update some internal state, and there
is no actual Inference Response sent back along with the flags. The default
behavior in this case is not to send anything back to the client, as there
is no response to send.

In some cases, this default behavior proved to be significantly more performant.
For example, take a decoupled model with an `N` request -> `1` response structure.
For each of the first `N-1` requests, the model will send "zero" responses back
by using `TRITONBACKEND_ResponseFactorySendFlags` as described above, and likely
update some internal states in the model. Finally, on the `N`th request, the
model is ready to send a response.

If the client is written in such a way that it is aware of the model's
expected behavior, it can save resources and avoid network contention by
not needing to communicate the `N-1` "empty" responses back and forth with
the client, and instead the client will just wait until it receives the single
non-empty response expected at the end on request `N`.

However, there are cases where a user may want to write a client that can
generically handle any model, without knowing implementation details about it.
Similarly, there are cases where the number of responses a model will send
is unknown beforehand, so the client may need a programmatic way to know when
the final response for a given request has been received. A common case for
this may be where for a language model that has a `1` request -> `N` response
structure.

To handle this case, Triton exposes a boolean `"triton_final_response"` response
nnshah1 marked this conversation as resolved.
Show resolved Hide resolved
parameter that communicates to the client that this response is the final response
for the associated request/response ID when communicating with decoupled models.

> **NOTE**
> This response parameter is only provided for `decoupled` models at this time.
> Since every response will be the final response for non-decoupled models,
> this would be redundant to communicate.


#### TRITONBACKEND_ResponseSend

When a final response is sent via
`TRITONBACKEND_ResponseSend(response, TRITONSERVER_RESPONSE_COMPLETE_FINAL, ...)`,
an actual response is sent to the frontend by the backend/model, so this response
parameter will be included in the response back to the client by default.

#### TRITONBACKEND_ResponseFactorySendFlags

When a final response is sent via
`TRITONBACKEND_ResponseFactorySendFlags(factory, TRITONSERVER_RESPONSE_COMPLETE_FINAL)`,
no response is sent to the frontend by the backend/model, so nothing will be sent
back to the client by default. For the client to receive a response containing
this "final" signal via the `"triton_final_response"` parameter, the client
will have to opt-in through the client library.

To opt-in through the Python client library, the `enable_empty_final_response` arg
rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved
should be set when calling `async_stream_infer(..., enable_empty_final_response=True)`.

> **NOTE**
> The `enable_empty_final_response` response parameter is only exposed in
> the `async_stream_infer` method as this time, since this feature is only
> needed for `decoupled` models.

The [decoupled_test.py](../../qa/L0_decoupled/decoupled_test.py)
demonstrates an example of using this opt-in arg and programmatically identifying
when a final response is received through the `"triton_final_response"`
response parameter.

If using
[Triton's in-process C API](../customization_guide/inference_protocols.md#in-process-triton-server-api)
instead of the GRPC frontend,
then your application should be handling the logic to identify when the final
response associated with a request has been received, such as by checking for
the `TRITONSERVER_RESPONSE_COMPLETE_FINAL` response flag mentioned above.