Skip to content

Commit

Permalink
Add documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
krishung5 committed Nov 7, 2023
1 parent 8ab6186 commit 5c007f8
Showing 1 changed file with 95 additions and 0 deletions.
95 changes: 95 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ any C++ code.
- [Decoupled mode](#decoupled-mode)
- [Use Cases](#use-cases)
- [Known Issues](#known-issues)
- [Request Rescheduling](#request-rescheduling)
- [`finalize`](#finalize)
- [Model Config File](#model-config-file)
- [Inference Request Parameters](#inference-request-parameters)
Expand Down Expand Up @@ -623,6 +624,100 @@ for more details on how to host a decoupled model.

* Currently, decoupled Python models can not make async infer requests.

#### Request Rescheduling

Starting from 23.11, Python backend supports request rescheduling. By calling
the `set_release_flags` function on the request object, you can reschedule the
request for further execution in a future batch. This feature is useful for
handling generative sequences.

The model config must be configured to enable generative sequence batching in
order to use the request rescheduling API:

```
sequence_batching {
generative_sequence : true
}
```

For non-decoupled models, a `None` object must be appended to the response list
for the rescheduled request. For example:

```python
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
...

def execute(self, requests):
responses = []

for request in requests:
# Explicitly reschedule the first request
if self.idx == 0:
request.set_release_flags(
pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE
)
responses.append(None)
self.idx += 1
else:
responses.append(inference_response)

return responses
```

For decoupled models, it is required to reschedule a request *before* returning
from the `execute` function.
Below is an example of a decoupled model using request rescheduling. This model
takes 1 input tensor, an INT32 [ 1 ] input named "IN", and produces an output
tensor "OUT" with the same shape as the input tensor. The input value indicates
the total number of responses to be generated and the output value indicates the
number of remaining responses. For example, if the request input has value 2,
the model will:
- Send a response with value 1.
- Release request with RESCHEDULE flag.
- When execute on the same request, send the last response with value 0.
- Release request with ALL flag.

```python
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
...

def execute(self, requests):
responses = []

for request in requests:
in_input = pb_utils.get_input_tensor_by_name(request, "IN").as_numpy()

if self.reset_flag:
self.remaining_response = in_input[0]
self.reset_flag = False

response_sender = request.get_response_sender()

self.remaining_response -= 1

out_output = pb_utils.Tensor(
"OUT", np.array([self.remaining_response], np.int32)
)
response = pb_utils.InferenceResponse(output_tensors=[out_output])

if self.remaining_response <= 0:
response_sender.send(
response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
)
self.reset_flag = True
else:
request.set_release_flags(
pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE
)
response_sender.send(response)

return None
```

### `finalize`

Implementing `finalize` is optional. This function allows you to do any clean
Expand Down

0 comments on commit 5c007f8

Please sign in to comment.