This repository contains the Stateful Backend for Triton Inference Server. You can learn more about backends in the backend repo. Ask questions or report problems in the main Triton issues page.
The backend code automatically manages the input and output states of a model. The states are associated with a sequence id and need to be tracked for inference requests associated with the sequence id. The backend currently handles ONNX models, however, it can be extended to other model types.
The example stateful model above has matching input and output tensors representing the model states. An output state tensor for a sequence id is passed as the input state during the next inference execution of the same sequence id. Therefore, we do not need to communicate the state tensors between server and client, and they can be kept on the GPU (or CPU) memory for GPU (or CPU) execution. The backend code stores the state tensors for all active sequences on the GPU (or CPU) memory and passes the stored state tensors as model input when the sequence id associated with the state tensors has an inference request.
The state tensors are provided in the model configuration file at the state_pairs
section. For the example model in models/accumulate_fp32, the state tensor input and output pairs are specified in the parameters
section as below:
{
key: "state_pairs"
value: { string_value: "<<<Accumulate_In, Accumulate_Out>>>" }
}
In general, each state pair must be surrounded by 3 pairs of angle brackets and the state pairs must be separated by a space ' '
e.g.
"<<<State_In_1, State_Out_1>>> <<<State_In_2, State_Out_2>>> ..."
.
During the model instance initialization, the stateful backend reserves GPU (or CPU) memory as large as max_candidate_sequences * sum_of_all_state_tensor_sizes
to store model state tensors.
Run:
python3 ./build.py
The backend binary will be produced in build/install/backends
directory.
Alternatively, you can do the following steps to build manually:
- Build the custom docker image which we will use to build the backend:
NGC_VERSION=$(head -1 ./NGC_VERSION) # read the container version to use
docker build --tag triton-${NGC_VERSION}-backend -f docker/Dockerfile.backend --build-arg BASE_IMAGE_VERSION=${NGC_VERSION} .
- Create a container of the previously built image:
docker run --gpus all -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v${PWD}:/workspace/stateful triton-${NGC_VERSION}-backend
- Inside the container, run the following:
mkdir -p /workspace/stateful/build && cd /workspace/stateful/build
cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ..
make -j
make install
- Build the backend. Run Triton server docker image, and copy the backend files to the triton backend folder. Delete existing onnxruntime backend and set the LD_LIBRARY_PATH variable:
NGC_VERSION=$(head -1 ./NGC_VERSION) # read the container version to use
docker run --gpus all -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8005:8005 -p8006:8006 -p8007:8007 -v${PWD}:/workspace/stateful_backend nvcr.io/nvidia/tritonserver:${NGC_VERSION}-py3
rm -rf /opt/tritonserver/backends/onnxruntime # Remove existing ORT backend to avoid having two copies
cp -R /workspace/stateful_backend/build/install/backends/stateful ./backends/ # Copy the stateful backend
export LD_LIBRARY_PATH=/workspace/stateful_backend/build/custom-ort/lib/ # Add ORT to the LD_LIBRARY_PATH
tritonserver --grpc-port 8005 --model-repository /workspace/stateful_backend/models/ # Run the triton inference server
-
Create an ONNX model that exposes input and output state tensors. The model should also have a mechanism to reset the initial values of state tensors for the beginning of the sequence. See the example model for a reference.
-
Create a model config file that matches the ONNX model. The model config file only needs to have the standard Input and Outputs excluding the state tensors listed. The state pairs are listed in the
parameters
section. For the example ONNX model:
{
key: "state_pairs"
value: { string_value: "<<<Accumulate_In, Accumulate_Out>>>" }
}
- We also need a mapping between
CONTROL_SEQUENCE_START
toResetSequence
boolean input tensor to reset the values of state tensors. If the boolean input tensor is set totrue
for an inference request, the input state values will be ignored and the model will use the initial values of the states stored in the ONNX model file as constants. This mapping allows the stateful backend to reset the state tensor values for the start of a sequence.
{
name: "ResetSequence"
control [
{
kind: CONTROL_SEQUENCE_START
int32_false_true: [ 0, 1 ]
}
]
}
- Incorporate the model file in Triton's Model Repository
model_repository
└── accumulate_fp32
├── 1
│ └── accumulate_fp32.onnx
└── config.pbtxt
Run:
python3 scripts/test.py
It will build the backend, start the tritonserver with the backend, run a simple client with the accumulate_fp32
model.
models/accumulate_fp32 folder contains a simple Triton model with state tensors and reset state boolean input. The ONNX file contains a simple accumulation graph where the input tensor are summed over the last dimension and added to a running sum. Stateful Backend keeps track of the running sum value for all sequences and provides the output state (the running sum) as input to the model when the corresponding sequence has an inference request.
The model configuration file maps CONTROL_SEQUENCE_START
signal to
ResetSequence
model input to initialize the state values with 0 constants that are stored
in the ONNX model. The files and folder structure can be used
to serve similar stateful ONNX models.
-
CUDA Graph based execution support was added for ORT-TRT path only.
{ key: "enable_cuda_graph" # Enable cuda graph based execution # WARNING: ALWAYS CHECK CORRECTNESS WHEN ENABLING THIS # Default: 0, disabled value: { string_value: "1" } },
NOTE: There are several restrictions on properly using CUDA Graph-based execution:
- CUDA Graph capture in ORT is not thread-safe. If you have multiple model instances, each one needs to be loaded sequentially (Triton's default behavior) to capture the graphs sequentially.
- If you have multiple models to load in the same Triton server, you need to load them sequentially as well if more than one model is trying to utilize CUDA Graph (use
--model-load-thread-count 1
when launching the server). - Even if both of the above conditions are met, you should check ORT logs to make sure that the graph is captured correctly during model loading so that it will only replay during inference requests.
- If ORT needs to rebuild the engine due to an input shape change, for example, it will trigger another graph capture which might crash the server if another model instance is trying to replay their own CUDA graph.
-
Stateful backend can do dynamic batching along any tensor dimension. The batch dimension should be marked with -1 in the model configuration file for the input and output tensors.
-
The state tensors can only have one dynamic dimension that is assumed to be the batch dimension.
-
The ONNX model should contain the initial values for the state tensors.
CONTROL_SEQUENCE_START
control input can be mapped to a boolean model input tensor that signals when to reset the initial values of the states. -
ort_ep
model config parameter to choose the ORT backend betweentrt
andcuda
-
compute_precision
model config parameter to specify the precision (fp32
orfp16
).fp16
is only supported for ORTtrt
EP. -
always_pad_to_max_batch
model config parameter to specify whether the batch dimension should be padded to max batch size for model execution (set value to1
) -
store_states_as_fp16
model config paramter to specify whether the internal states are stored as FP16 to reduce memory consumption (set value to1
). However, it may impact the result accuracy. -
metric_logging_frequency_seconds
controls how frequently the backend logs the inference statistics. Default is0
to disable such logs. -
enable_trt_caching
andtrt_cache_dir
to control the engine caching from TRT. The default values are0
(disabled) and/tmp
respectively. -
logging_level
controls the level of details in the backend's model runner logs. Possible values areNONE
,INFO
andVERBOSE
. Default isINFO
. Two additional parameters to further control some logs:batch_info_logging_level
controls whether the batch information will be logged during each execution of the model. Possible values areINFO
andVERBOSE
. Default isINFO
which will put these logs into the same stream as thelogging_level
'sINFO
.detailed_metrics_logging_level
controls whether some detailed metrics from the backend will be logged or not. Possible values areINFO
andVERBOSE
. Default isVERBOSE
which will put these logs into the same stream as thelogging_level
'sVERBOSE
. Note that metric logging must be enabled for this. Seemetric_logging_frequency_seconds
.
-
max_candidate_sequence_use_ratio
controls whether the number of maximum simultaneous sequences should less than what Triton uses i.e.max_candidate_sequences
. The ratio can be used to enable prompt error handling in the backend for overloaded servers. Default value for this is0.9
. -
infer_end_requests
controls whether to run inference on requests withend
signal. If specified as0
, the backend will not run inference for such requests, release the state buffers for those sequences and send successful response with empty output. Default value is1
. -
Lazy allocation of the states buffer can be enabled and configured using 4 additional parameters:
initial_buffer_size
,subsequent_buffer_size
,buffer_alloc_threshold
, andbuffer_dealloc_threshold
.- By default, the lazy allocation feature is disabled and the backend allocates
the whole buffer during the model loading for the
max_candidate_sequences
simultaneous sequences. - If
initial_buffer_size
is present in the model config, the other 3 configs must be present as well and the lazy allocation behavior is enabled. If enabled, the backend only allocates the buffer for theinitial_buffer_size
sequences during loading. - Once the number of free slots reach below the
buffer_alloc_threshold
, it will create a new buffer of sizesubsequent_buffer_size
. - On the other hand, it will deallocate only the last buffer and the deallocation occurs when the following conditions are met:
- the number of allocated buffers (including the initial buffer) is more than 1
- the total number of free slots reaches above the
buffer_dealloc_threshold
- all the slots of the last buffer are free i.e. it isn't storing states for any active sequences
- The initial buffer is never deallocated until the model gets unloaded.
- For example, the following could be a valid combinations for the config parameters when
max_batch_size
is 16:parameters [ ... { key: "initial_buffer_size" value: { string_value: "512" } }, { key: "subsequent_buffer_size" value: { string_value: "100" } }, { key: "buffer_alloc_threshold" value: { string_value: "32" } }, { key: "buffer_dealloc_threshold" value: { string_value: "200" } }, ... ]
- Refer to the code for any more restrictions on these parameters.
- By default, the lazy allocation feature is disabled and the backend allocates
the whole buffer during the model loading for the
-
ORT Environment is created during the backend initialization so that multiple models can share the same environment. To control the ORT logging level, you can pass a command line argument when starting Triton, e.g.
tritonserver --model-repository <model dir> --backend-config=stateful,ort-logging-level=3
will set the ORT logging level toORT_LOGGING_LEVEL_ERROR
. Refer to this for all possible values. The default logging level isORT_LOGGING_LEVEL_WARNING
.
- Stateful backend only works with ONNX models
- All tensor dimensions except for the batch dimension is fixed for a model instance
- Model state reset tensor should be a boolean tensor
- Only
oldest
sequence batching strategy is supported - NOTE: The backend cannot be loaded/unloaded repeatedly (e.g. Triton's explicit loading mode) since ORT doesn't allow destroying and then re-creating Ort::Env objects. This should be mitigated once Triton is updated so that model unloading doesn't trigger the backend unloading. Also, for the same reason, beware of using this backend along with any other ORT-based backend in the same Triton instance. May cause stability issues in ORT.