diff --git a/docs/Dockerfile.docs b/docs/Dockerfile.docs index ba30a144ac..65a91acd6b 100644 --- a/docs/Dockerfile.docs +++ b/docs/Dockerfile.docs @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -59,6 +59,7 @@ RUN pip3 install \ breathe \ docutils \ exhale \ + httplib2 \ ipython \ myst-nb \ nbclient \ @@ -73,6 +74,12 @@ RUN pip3 install \ sphinx-tabs \ sphinxcontrib-bibtex + +# install nvidia-sphinx-theme +RUN pip3 install \ + --index-url https://urm.nvidia.com/artifactory/api/pypi/ct-omniverse-pypi/simple/ \ + nvidia-sphinx-theme + # Set visitor script to be included on every HTML page ENV VISITS_COUNTING_SCRIPT="//assets.adobedtm.com/b92787824f2e0e9b68dc2e993f9bd995339fe417/satelliteLib-7ba51e58dc61bcb0e9311aadd02a0108ab24cc6c.js" diff --git a/docs/README.md b/docs/README.md index 0f9faba3fe..a9604c0eae 100644 --- a/docs/README.md +++ b/docs/README.md @@ -124,9 +124,9 @@ Triton supports batching individual inference requests to improve compute resour - [Queuing Policies](user_guide/model_configuration.md#queue-policy) - [Ragged Batching](user_guide/ragged_batching.md) - [Sequence Batcher](user_guide/model_configuration.md#sequence-batcher) - - [Stateful Models](user_guide/architecture.md#stateful-models) - - [Control Inputs](user_guide/architecture.md#control-inputs) - - [Implicit State - Stateful Inference Using a Stateless Model](user_guide/architecture.md#implicit-state-management) + - [Stateful Models](user_guide/model_execution.md#stateful-models) + - [Control Inputs](user_guide/model_execution.md#control-inputs) + - [Implicit State - Stateful Inference Using a Stateless Model](user_guide/implicit_state_management.md#implicit-state-management) - [Sequence Scheduling Strategies](user_guide/architecture.md#scheduling-strategies) - [Direct](user_guide/architecture.md#direct) - [Oldest](user_guide/architecture.md#oldest) diff --git a/docs/backend_guide/vllm.rst b/docs/backend_guide/vllm.rst new file mode 100644 index 0000000000..06be17128f --- /dev/null +++ b/docs/backend_guide/vllm.rst @@ -0,0 +1,11 @@ +######## +vLLM +######## + +.. toctree:: + :hidden: + :caption: vLLM + :maxdepth: 2 + + ../vllm_backend/README + Multi-LoRA <../vllm_backend/docs/llama_multi_lora_tutorial> \ No newline at end of file diff --git a/docs/client_guide/api_reference.rst b/docs/client_guide/api_reference.rst new file mode 100644 index 0000000000..0493510e71 --- /dev/null +++ b/docs/client_guide/api_reference.rst @@ -0,0 +1,10 @@ +#### +API Reference +#### + +.. toctree:: + :maxdepth: 1 + :hidden: + + OpenAI API + kserve \ No newline at end of file diff --git a/docs/client_guide/in_process.rst b/docs/client_guide/in_process.rst new file mode 100644 index 0000000000..b1ee46a925 --- /dev/null +++ b/docs/client_guide/in_process.rst @@ -0,0 +1,39 @@ +#### +In-Process Triton Server API +#### + + +The Triton Inference Server provides a backwards-compatible C API/ python-bindings/java-bindings that +allows Triton to be linked directly into a C/C++/java/python application. This API +is called the "Triton Server API" or just "Server API" for short. The +API is implemented in the Triton shared library which is built from +source contained in the `core +repository `__. On Linux +this library is libtritonserver.so and on Windows it is +tritonserver.dll. In the Triton Docker image the shared library is +found in /opt/tritonserver/lib. The header file that defines and +documents the Server API is +`tritonserver.h `__. +`Java bindings for In-Process Triton Server API <../customization_guide/inprocess_java_api.html#java-bindings-for-in-process-triton-server-api>`__ +are built on top of `tritonserver.h` and can be used for Java applications that +need to use Tritonserver in-process. + +All capabilities of Triton server are encapsulated in the shared +library and are exposed via the Server API. The `tritonserver` +executable implements HTTP/REST and GRPC endpoints and uses the Server +API to communicate with core Triton logic. The primary source files +for the endpoints are `grpc_server.cc `__ and +`http_server.cc `__. In these source files you can +see the Server API being used. + +You can use the Server API in your own application as well. A simple +example using the Server API can be found in +`simple.cc `__. + +.. toctree:: + :maxdepth: 1 + :hidden: + + C/C++ <../customization_guide/inprocess_c_api.md> + python + Java <../customization_guide/inprocess_java_api.md> \ No newline at end of file diff --git a/docs/client_guide/kserve.rst b/docs/client_guide/kserve.rst new file mode 100644 index 0000000000..e2ac33c45f --- /dev/null +++ b/docs/client_guide/kserve.rst @@ -0,0 +1,15 @@ +#### +KServe API +#### + + +Triton uses the +`KServe community standard inference protocols `__ +to define HTTP/REST and GRPC APIs plus several extensions. + +.. toctree:: + :maxdepth: 1 + :hidden: + + HTTP/REST and GRPC Protocol <../customization_guide/inference_protocols.md> + kserve_extension \ No newline at end of file diff --git a/docs/client_guide/kserve_extension.rst b/docs/client_guide/kserve_extension.rst new file mode 100644 index 0000000000..7a78484499 --- /dev/null +++ b/docs/client_guide/kserve_extension.rst @@ -0,0 +1,24 @@ +#### +Extensions +#### + +To fully enable all capabilities +Triton also implements `HTTP/REST and GRPC +extensions `__ +to the KServe inference protocol. + +.. toctree:: + :maxdepth: 1 + :hidden: + + Binary tensor data extension <../protocol/extension_binary_data.md> + Classification extension <../protocol/extension_classification.md> + Schedule policy extension <../protocol/extension_schedule_policy.md> + Sequence extension <../protocol/extension_sequence.md> + Shared-memory extension <../protocol/extension_shared_memory.md> + Model configuration extension <../protocol/extension_model_configuration.md> + Model repository extension <../protocol/extension_model_repository.md> + Statistics extension <../protocol/extension_statistics.md> + Trace extension <../protocol/extension_trace.md> + Logging extension <../protocol/extension_logging.md> + Parameters extension <../protocol/extension_parameters.md> \ No newline at end of file diff --git a/docs/client_guide/openai_readme.md b/docs/client_guide/openai_readme.md new file mode 120000 index 0000000000..05ca8a99c5 --- /dev/null +++ b/docs/client_guide/openai_readme.md @@ -0,0 +1 @@ +../../python/openai/README.md \ No newline at end of file diff --git a/docs/client_guide/python.rst b/docs/client_guide/python.rst new file mode 100644 index 0000000000..2610ce2d87 --- /dev/null +++ b/docs/client_guide/python.rst @@ -0,0 +1,12 @@ +#### +Python +#### + +.. include:: python_readme.rst + +.. toctree:: + :maxdepth: 1 + :hidden: + + Kafka I/O <../tutorials/Triton_Inference_Server_Python_API/examples/kafka-io/README.md> + Rayserve <../tutorials/Triton_Inference_Server_Python_API/examples/rayserve/README.md> \ No newline at end of file diff --git a/docs/client_guide/python_readme.rst b/docs/client_guide/python_readme.rst new file mode 100644 index 0000000000..91e3f1b26d --- /dev/null +++ b/docs/client_guide/python_readme.rst @@ -0,0 +1,268 @@ +.. raw:: html + + + +Triton Inference Server In-Process Python API [BETA] +==================================================== + +Starting with release 24.01 Triton Inference Server will include a +Python package enabling developers to embed Triton Inference Server +instances in their Python applications. The in-process Python API is +designed to match the functionality of the in-process C API while +providing a higher level abstraction. At its core the API relies on a +1:1 python binding of the C API and provides all the flexibility and +power of the C API with a simpler to use interface. + + [!Note] As the API is in BETA please expect some changes as we test + out different features and get feedback. All feedback is weclome and + we look forward to hearing from you! + +| `Requirements <#requirements>`__ \| `Installation <#installation>`__ + \| `Hello World <#hello-world>`__ \| `Stable + Diffusion <#stable-diffusion>`__ \| `Ray Serve + Deployment <../tutorials/Triton_Inference_Server_Python_API/examples/rayserve>`__ \| +Requirements +------------ + +The following instructions require a linux system with Docker installed. +For CUDA support, make sure your CUDA driver meets the requirements in +“NVIDIA Driver” section of Deep Learning Framework support matrix: +https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html + +Installation +------------ + +The tutorial and Python API package are designed to be installed and run +within the ``nvcr.io/nvidia/tritonserver:24.01-py3`` docker image. + +A set of convenience scripts are provided to create a docker image based +on the ``nvcr.io/nvidia/tritonserver:24.01-py3`` image with the Python +API installed plus additional dependencies required for the examples. + +Triton Inference Server 24.01 + Python API +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Clone Repository +^^^^^^^^^^^^^^^^ + +.. code:: bash + git clone https://github.com/triton-inference-server/tutorials.git + cd tutorials/Triton_Inference_Server_Python_API +Build ``triton-python-api:r24.01`` Image +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: bash + ./build.sh +Supported Backends +^^^^^^^^^^^^^^^^^^ + +The built image includes all the backends shipped by default in the +tritonserver ``nvcr.io/nvidia/tritonserver:24.01-py3`` container. + +:: + + dali fil identity onnxruntime openvino python pytorch repeat square tensorflow tensorrt + +Included Models +^^^^^^^^^^^^^^^ + +The ``default`` build includes an ``identity`` model that can be used +for exercising basic operations including sending input tensors of +different data types. The ``identity`` model copies provided inputs of +``shape [-1, -1]`` to outputs of shape ``[-1, -1]``. Inputs are named +``data_type_input`` and outputs are named ``data_type_output`` +(e.g. ``string_input``, ``string_output``, ``fp16_input``, +``fp16_output``). + +Hello World +----------- + +Start ``triton-python-api:r24.01`` Container +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following command starts a container and volume mounts the current +directory as ``workspace``. + +.. code:: bash + ./run.sh +Enter Python Shell +~~~~~~~~~~~~~~~~~~ + +.. code:: bash + python3 +Create and Start a Server Instance +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + import tritonserver + server = tritonserver.Server(model_repository="/workspace/identity-models") + server.start() +List Models +~~~~~~~~~~~ + +:: + + server.models() + +Example Output +^^^^^^^^^^^^^^ + +``server.models()`` returns a dictionary of the available models with +their current state. + +.. code:: python + {('identity', 1): {'name': 'identity', 'version': 1, 'state': 'READY'}} +Send an Inference Request +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + model = server.model("identity") + responses = model.infer(inputs={"string_input":[["hello world!"]]}) +Iterate through Responses +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``model.infer()`` returns an iterator that can be used to process the +results of an inference request. + +.. code:: python + for response in responses: + print(response.outputs["string_output"].to_string_array()) +.. _example-output-1: + +Example Output +^^^^^^^^^^^^^^ + +.. code:: python + [['hello world!']] +Stable Diffusion +---------------- + +This example is based on the +`Popular_Models_Guide/StableDiffusion <../tutorials/Popular_Models_Guide/StableDiffusion/README.html>`__ +tutorial. + +Build ``triton-python-api:r24.01-diffusion`` Image and Stable Diffusion Models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Please note the following command will take many minutes depending on +your hardware configuration and network connection. + +.. code:: bash + ./build.sh --framework diffusion --build-models +.. _supported-backends-1: + +Supported Backends +^^^^^^^^^^^^^^^^^^ + +The built image includes all the backends shipped by default in the +tritonserver ``nvcr.io/nvidia/tritonserver:24.01-py3`` container. + +:: + + dali fil identity onnxruntime openvino python pytorch repeat square tensorflow tensorrt + +.. _included-models-1: + +Included Models +^^^^^^^^^^^^^^^ + +The ``diffusion`` build includes a ``stable_diffustion`` pipeline that +takes a text prompt and returns a generated image. For more details on +the models and pipeline please see the +`Popular_Models_Guide/StableDiffusion <../tutorials/Popular_Models_Guide/StableDiffusion/README.html>`__ +tutorial. + +Start Container +~~~~~~~~~~~~~~~ + +The following command starts a container and volume mounts the current +directory as ``workspace``. + +.. code:: bash + ./run.sh --framework diffusion +.. _enter-python-shell-1: + +Enter Python Shell +~~~~~~~~~~~~~~~~~~ + +.. code:: bash + python3 +.. _create-and-start-a-server-instance-1: + +Create and Start a Server Instance +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + import tritonserver + import numpy + from PIL import Image + server = tritonserver.Server(model_repository="/workspace/diffusion-models") + server.start() +.. _list-models-1: + +List Models +~~~~~~~~~~~ + +:: + + server.models() + +.. _example-output-2: + +Example Output +^^^^^^^^^^^^^^ + +.. code:: python + {('stable_diffusion', 1): {'name': 'stable_diffusion', 'version': 1, 'state': 'READY'}, ('text_encoder', 1): {'name': 'text_encoder', 'version': 1, 'state': 'READY'}, ('vae', 1): {'name': 'vae', 'version': 1, 'state': 'READY'}} +.. _send-an-inference-request-1: + +Send an Inference Request +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + model = server.model("stable_diffusion") + responses = model.infer(inputs={"prompt":[["butterfly in new york, realistic, 4k, photograph"]]}) +Iterate through Responses and save image +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + for response in responses: + generated_image = numpy.from_dlpack(response.outputs["generated_image"]) + generated_image = generated_image.squeeze().astype(numpy.uint8) + image_ = Image.fromarray(generated_image) + image_.save("sample_generated_image.jpg") +.. _example-output-3: + +Example Output +^^^^^^^^^^^^^^ + +.. figure:: ../tutorials/Triton_Inference_Server_Python_API/docs/sample_generated_image.jpg + :alt: sample_generated_image + + sample_generated_image \ No newline at end of file diff --git a/docs/conf.py b/docs/conf.py index 505af4351d..6c59e45c72 100755 --- a/docs/conf.py +++ b/docs/conf.py @@ -34,27 +34,50 @@ # -- Path setup -------------------------------------------------------------- +import json +import os +import re +from datetime import date + # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # -import os - +import httplib2 +import nvidia_sphinx_theme from docutils import nodes +from packaging.version import Version from sphinx import search # import sys # sys.path.insert(0, os.path.abspath('.')) +# -- conf.py setup ----------------------------------------------------------- + +# conf.py needs to be run in the top level 'docs' +# directory but the calling build script needs to +# be called from the current working directory. We +# change to the 'docs' dir here and then revert back +# at the end of the file. +# current_dir = os.getcwd() +# os.chdir("docs") + # -- Project information ----------------------------------------------------- project = "NVIDIA Triton Inference Server" -copyright = "2018-2024, NVIDIA Corporation" +copyright = "2018-{}, NVIDIA Corporation".format(date.today().year) author = "NVIDIA" -# The full version, including alpha/beta/rc tags -# Env only set during riva-release process, otherwise keep as dev for all internal builds -release = os.getenv("TRITON_VERSION", "dev") +# Get the version of Triton this is building. +version_long = "0.0.0" +with open("../TRITON_VERSION") as f: + version_long = f.readline() + version_long = version_long.strip() + +version_short = re.match(r"^[\d]+\.[\d]+\.[\d]+", version_long).group(0) +version_short_split = version_short.split(".") +one_before = f"{version_short_split[0]}.{int(version_short_split[1]) - 1}.{version_short_split[2]}" + # maintain left-side bar toctrees in `contents` file # so it doesn't show up needlessly in the index page @@ -123,66 +146,54 @@ myst_heading_anchors = 5 # Add any paths that contain templates here, relative to this directory. -templates_path = ["_templates"] +# templates_path = ["_templates"] # disable it for nvidia-sphinx-theme to show footer # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This pattern also affects html_static_path and html_extra_path. -exclude_patterns = ["README.md", "examples/README.md", "user_guide/perf_analyzer.md"] +exclusions = None +with open("exclusions.txt", "r") as f: + exclusions = f.read() + f.close() +exclude_patterns = exclusions.strip().split("\n") +print(f"exclude_patterns: {exclude_patterns}") # -- Options for HTML output ------------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # -html_theme = "sphinx_book_theme" -html_logo = "_static/nvidia-logo-horiz-rgb-blk-for-screen.png" -html_title = "NVIDIA Triton Inference Server" -html_short_title = "Triton" -html_copy_source = True -html_sourcelink_suffix = "" -html_favicon = "_static/nvidia-logo-vert-rgb-blk-for-screen.png" -html_last_updated_fmt = "" -html_additional_files = ["index.html"] +html_theme = "nvidia_sphinx_theme" # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] -html_css_files = ["custom.css"] +# html_css_files = ["custom.css"] # Not needed with new theme html_theme_options = { - "path_to_docs": "docs", - # "launch_buttons": { - # "binderhub_url": "https://mybinder.org", - # "colab_url": "https://colab.research.google.com/", - # "deepnote_url": "https://deepnote.com/", - # "notebook_interface": "jupyterlab", - # "thebe": True, - # # "jupyterhub_url": "https://datahub.berkeley.edu", # For testing - # }, - "use_edit_page_button": False, - "use_issues_button": True, - "use_repository_button": True, - "use_download_button": False, - "logo_only": False, - "show_toc_level": 2, - "extra_navbar": "", - "extra_footer": """ - Privacy Policy | - Manage My Privacy | - Do Not Sell or Share My - Data | - Terms of Service | - Accessibility | - Corporate Policies | - Product Security | - Contact""", - "repository_url": "https://github.com/triton-inference-server/server", - "use_repository_button": True, + "collapse_navigation": False, + "github_url": "https://github.com/triton-inference-server/server", + "switcher": { + # use for local testing + # "json_url": "http://localhost:8000/_static/switcher.json", + "json_url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/_static/switcher.json", + "version_match": one_before if "dev" in version_long else version_short, + }, + "navbar_start": ["navbar-logo", "version-switcher"], + "primary_sidebar_end": [], } -version_short = release +# Theme options are theme-specific and customize the look and feel of a theme +# further. For a list of options available for each theme, see the +# documentation. +# +html_theme_options.update( + { + "collapse_navigation": False, + } +) + deploy_ngc_org = "nvidia" deploy_ngc_team = "triton" myst_substitutions = { @@ -218,6 +229,82 @@ def ultimateReplace(app, docname, source): nb_execution_mode = "off" # Global execution disable # execution_excludepatterns = ['tutorials/tts-python-basics.ipynb'] # Individual notebook disable +############################### +# SETUP SWITCHER +############################### +switcher_path = os.path.join(html_static_path[0], "switcher.json") +versions = [] +# Triton 2 releases +correction = -1 if "dev" in version_long else 0 +upper_bound = version_short.split(".")[1] +for i in range(2, int(version_short.split(".")[1]) + correction): + versions.append((f"2.{i}.0", f"triton-inference-server-2{i}0")) + +# Triton 1 releases +for i in range(0, 15): + versions.append((f"1.{i}.0", f"tensorrt_inference_server_1{i}0")) + +# Triton Beta Releases +for i in range(1, 11): + versions.append((f"0.{i}.0_beta", f"inference_server_0{i}0_beta")) + +# Patch releases +# Add here. + +versions = sorted(versions, key=lambda v: Version(v[0]), reverse=True) + +# Build switcher data +json_data = [] +for v in versions: + json_data.append( + { + "name": v[0], + "version": v[0], + "url": f"https://docs.nvidia.com/deeplearning/triton-inference-server/archives/{v[1]}/user-guide/docs", + } + ) +if "dev" in version_long: + json_data.insert( + 0, + { + "name": f"{one_before} (current_release)", + "version": f"{one_before}", + "url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html", + }, + ) +else: + json_data.insert( + 0, + { + "name": f"{version_short} (current release)", + "version": f"{version_short}", + "url": "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html", + }, + ) + +# Trim to last N releases. +json_data = json_data[0:12] + +json_data.append( + { + "name": "older releases", + "version": "archives", + "url": "https://docs.nvidia.com/deeplearning/triton-inference-server/archives/", + } +) + +# validate the links +for i, d in enumerate(json_data): + h = httplib2.Http() + resp = h.request(d["url"], "HEAD") + if int(resp[0]["status"]) >= 400: + print(d["url"], "NOK", resp[0]["status"]) + exit(1) + +# Write switcher data to file +with open(switcher_path, "w") as f: + json.dump(json_data, f, ensure_ascii=False, indent=4) + def setup(app): app.add_config_value("ultimate_replacements", {}, True) @@ -246,43 +333,5 @@ def setup(app): # ) -# Patch for sphinx.search stemming short terms (i.e. tts -> tt) -# https://github.com/sphinx-doc/sphinx/blob/4.5.x/sphinx/search/__init__.py#L380 -def sphinxSearchIndexFeed( - self, docname: str, filename: str, title: str, doctree: nodes.document -): - """Feed a doctree to the index.""" - self._titles[docname] = title - self._filenames[docname] = filename - - visitor = search.WordCollector(doctree, self.lang) - doctree.walk(visitor) - - # memoize self.lang.stem - def stem(word: str) -> str: - try: - return self._stem_cache[word] - except KeyError: - self._stem_cache[word] = self.lang.stem(word).lower() - return self._stem_cache[word] - - _filter = self.lang.word_filter - - for word in visitor.found_title_words: - stemmed_word = stem(word) - if len(stemmed_word) > 3 and _filter(stemmed_word): - self._title_mapping.setdefault(stemmed_word, set()).add(docname) - elif _filter(word): # stemmer must not remove words from search index - self._title_mapping.setdefault(word.lower(), set()).add(docname) - - for word in visitor.found_words: - stemmed_word = stem(word) - # again, stemmer must not remove words from search index - if len(stemmed_word) <= 3 or not _filter(stemmed_word) and _filter(word): - stemmed_word = word.lower() - already_indexed = docname in self._title_mapping.get(stemmed_word, set()) - if _filter(stemmed_word) and not already_indexed: - self._mapping.setdefault(stemmed_word, set()).add(docname) - - -search.IndexBuilder.feed = sphinxSearchIndexFeed +# cleanup +# os.chdir(current_dir) diff --git a/docs/contents.md b/docs/contents.md deleted file mode 100644 index 5aaafa7afa..0000000000 --- a/docs/contents.md +++ /dev/null @@ -1,156 +0,0 @@ - - -```{toctree} -:maxdepth: 1 -:caption: Getting Started - -getting_started/quickstart -``` - -```{toctree} -:maxdepth: 1 -:caption: User Guide - -user_guide/performance_tuning -user_guide/architecture -user_guide/model_repository -customization_guide/repository_agents -user_guide/model_configuration -user_guide/request_cancellation -user_guide/optimization -user_guide/ragged_batching -user_guide/rate_limiter -user_guide/model_analyzer -user_guide/model_management -user_guide/custom_operations -user_guide/decoupled_models -user_guide/response_cache -user_guide/metrics -user_guide/trace -user_guide/jetson -user_guide/v1_to_v2 -customization_guide/deploy -``` - -```{toctree} -:maxdepth: 1 -:caption: Debugging - -user_guide/debugging_guide -user_guide/faq -``` - -```{toctree} -:maxdepth: 1 -:caption: Protocol Guides - -protocol/README -customization_guide/inference_protocols -protocol/extension_binary_data -protocol/extension_classification -protocol/extension_generate -protocol/extension_logging -protocol/extension_model_configuration -protocol/extension_model_repository -protocol/extension_schedule_policy -protocol/extension_sequence -protocol/extension_shared_memory -protocol/extension_statistics -protocol/extension_trace -protocol/extension_parameters -``` - -```{toctree} -:maxdepth: 1 -:caption: Customization Guide - -customization_guide/build -customization_guide/compose -customization_guide/test -``` - -```{toctree} -:maxdepth: 1 -:caption: Examples - -examples/jetson/README -examples/jetson/concurrency_and_dynamic_batching/README -``` - -```{toctree} -:maxdepth: 1 -:caption: Client - -client/README -_reference/tritonclient_api.rst -client/src/java/README -client/src/grpc_generated/go/README -client/src/grpc_generated/javascript/README -client/src/grpc_generated/java/README -``` - -```{toctree} -:maxdepth: 1 -:caption: Performance Analyzer - -perf_analyzer/README -perf_analyzer/docs/README -perf_analyzer/docs/install -perf_analyzer/docs/quick_start -perf_analyzer/docs/cli -perf_analyzer/docs/inference_load_modes -perf_analyzer/docs/input_data -perf_analyzer/docs/measurements_metrics -perf_analyzer/docs/benchmarking -perf_analyzer/genai-perf/README -perf_analyzer/genai-perf/docs/compare -perf_analyzer/genai-perf/docs/embeddings -perf_analyzer/genai-perf/docs/files -perf_analyzer/genai-perf/docs/lora -perf_analyzer/genai-perf/docs/multi_modal -perf_analyzer/genai-perf/docs/rankings -perf_analyzer/genai-perf/docs/tutorial -perf_analyzer/genai-perf/examples/tutorial -``` - -```{toctree} -:maxdepth: 1 -:caption: Python Backend - -python_backend/README -python_backend/inferentia/README -python_backend/examples/auto_complete/README -python_backend/examples/bls/README -python_backend/examples/bls_decoupled/README -python_backend/examples/custom_metrics/README -python_backend/examples/decoupled/README -python_backend/examples/instance_kind/README -python_backend/examples/jax/README -python_backend/examples/preprocessing/README -``` diff --git a/docs/contents.rst b/docs/contents.rst new file mode 100644 index 0000000000..ff132c729d --- /dev/null +++ b/docs/contents.rst @@ -0,0 +1,120 @@ +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +.. toctree:: + :hidden: + + Home + Release notes + Compatibility matrix + +.. toctree:: + :hidden: + :caption: Getting Started + + getting_started/quick_deployment_by_backend + LLM With TRT-LLM + Multimodal model <../tutorials/Popular_Models_Guide/Llava1.5/llava_trtllm_guide.md> + Stable diffusion <../tutorials/Popular_Models_Guide/StableDiffusion/README.md> + +.. toctree:: + :hidden: + :caption: Scaling guide + + Multi-Node (AWS) <../tutorials/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md> + Multi-Instance <../tutorials/Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md> + +.. toctree:: + :hidden: + :caption: AI Agents + + Constrained Decoding <../tutorials/AI_Agents_Guide/Constrained_Decoding/README.md> + Function Calling <../tutorials/AI_Agents_Guide/Function_Calling/README.md> + +.. toctree:: + :hidden: + :caption: Client + + client_guide/api_reference + client_guide/in_process + Client Libraries + _reference/tritonclient_api.rst + +.. toctree:: + :hidden: + :caption: Server + + Model_execution + Scheduler + Batcher + server_guide/model_pipelines + server_guide/state_management + Request Cancellation + Rate Limiter + Caching + Metrics + Tracing + +.. toctree:: + :hidden: + :caption: Model Management + + + Repository + Configuration + Optimization + Controls + Decoupled models + Custom operators + +.. toctree:: + :hidden: + :caption: Backends + + TRT-LLM + vLLM + Python + Pytorch + ONNX Runtime + TensorFlow + TensorRT + FIL + DALI + Custom + +.. toctree:: + :hidden: + :caption: Perf benchmarking and tuning + + GenAI Perf Analyzer + Performance Analyzer + Model Analyzer + Model Navigator + +.. toctree:: + :hidden: + :caption: Debugging + + Guide diff --git a/docs/customization_guide/inference_protocols.md b/docs/customization_guide/inference_protocols.md index a241f097da..dea3e0459e 100644 --- a/docs/customization_guide/inference_protocols.md +++ b/docs/customization_guide/inference_protocols.md @@ -1,5 +1,5 @@ + +# C API Description + +Triton server functionality is encapsulated in a shared library which +is built from source contained in the [core +repository](https://github.com/triton-inference-server/core). You can +include the full capabilities of Triton by linking the shared library +into your application and by using the C API defined in +[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + +When you link the Triton shared library into your application you are +*not* spawning a separate Triton process, instead, you are including +the Triton core logic directly in your application. The Triton +HTTP/REST or GRPC protocols are not used to communicate with this +Triton core logic, instead all communication between your application +and the Triton core logic must take place via the [Server +API](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + +The top-level abstraction used by Server API is `TRITONSERVER_Server`, +which represents the Triton core logic that is capable of implementing +all of the features and capabilities of Triton. A +`TRITONSERVER_Server` object is created by calling +`TRITONSERVER_ServerNew` with a set of options that indicate how the +object should be initialized. Use of `TRITONSERVER_ServerNew` is +demonstrated in [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc). Once you have created a +`TRITONSERVER_Server` object, you can begin using the rest of the +Server API as described below. + +## Error Handling + +Most Server API functions return an error object indicating success or +failure. Success is indicated by return `nullptr` (`NULL`). Failure is +indicated by returning a `TRITONSERVER_Error` object. The error code +and message can be retrieved from a `TRITONSERVER_Error` object with +`TRITONSERVER_ErrorCode` and `TRITONSERVER_ErrorMessage`. + +The lifecycle and ownership of all Server API objects is documented in +[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). For +`TRITONSERVER_Error`, ownership of the object passes to the caller of +the Server API function. As a result, your application is responsible +for managing the lifecycle of the returned `TRITONSERVER_Error` +object. You must delete the error object using +`TRITONSERVER_ErrorDelete` when you are done using it. Macros such as +`FAIL_IF_ERR` shown in [common.h](https://github.com/triton-inference-server/server/blob/main/src/common.h) are useful for +managing error object lifetimes. + +## Versioning and Backwards Compatibility + +A typical pattern, demonstrated in [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) and +shown below, shows how you can compare the Server API version provided +by the shared library against the Server API version that you compiled +your application against. The Server API is backwards compatible, so +as long as the major version provided by the shared library matches +the major version that you compiled against, and the minor version +provided by the shared library is greater-than-or-equal to the minor +version that you compiled against, then your application can use the +Server API. + +``` +#include "tritonserver.h" +// Error checking removed for clarity... +uint32_t api_version_major, api_version_minor; +TRITONSERVER_ApiVersion(&api_version_major, &api_version_minor); +if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major) || + (TRITONSERVER_API_VERSION_MINOR > api_version_minor)) { + // Error, the shared library implementing the Server API is older than + // the version of the Server API that you compiled against. +} +``` + +### Non-Inference APIs + +The Server API contains functions for checking health and readiness, +getting model information, getting model statistics and metrics, +loading and unloading models, etc. The use of these functions is +straightforward and some of these functions are demonstrated in +[simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) and all are documented in +[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + +### Inference APIs + +Performing an inference request requires the use of many Server API +functions and objects, as demonstrated in +[simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc). The general usage requires the +following steps. + +* Create a `TRITONSERVER_ResponseAllocator` using + `TRITONSERVER_ResponseAllocatorNew`. You can use the same response + allocator for all of your inference requests, or you can create + multiple response allocators. When Triton produces an output + tensor, it needs a memory buffer into which it can store the + contents of that tensor. Triton defers the allocation of these + output buffers by invoking callback functions in your + application. You communicate these callback functions to Triton with + the `TRITONSERVER_ResponseAllocator` object. You must implement two + callback functions, one for buffer allocation and one for buffer + free. The signatures for these functions are + `TRITONSERVER_ResponseAllocatorAllocFn_t` and + `TRITONSERVER_ResponseAllocatorReleaseFn_t` as defined in + [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). In + [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc), these callback functions are + implemented as `ResponseAlloc` and `ResponseRelease`. + +* Create an inference request as a `TRITONSERVER_InferenceRequest` + object. The inference request is where you specify what model you + want to use, the input tensors and their values, the output tensors + that you want returned, and other request parameters. You create an + inference request using `TRITONSERVER_InferenceRequestNew`. You + create each input tensor in the request using + `TRITONSERVER_InferenceRequestAddInput` and set the data for the + input tensor using `TRITONSERVER_InferenceRequestAppendInputData` + (or one of the `TRITONSERVER_InferenceRequestAppendInputData*` + variants defined in + [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h)). By + default, Triton will return all output tensors, but you can limit + Triton to only return some outputs by using + `TRITONSERVER_InferenceRequestAddRequestedOutput`. + + To correctly manage the lifecycle of the inference request, you must + use `TRITONSERVER_InferenceRequestSetReleaseCallback` to set a + callback into a function in your application. This callback will be + invoke by Triton to return ownership of the + `TRITONSERVER_InferenceRequest` object. Typically, in this callback + you will just delete the `TRITONSERVER_InferenceRequest` object by + using `TRITONSERVER_InferenceRequestDelete`. But you may also + implement a different lifecycle management; for example, if you are + reusing inference request objects you would want to make the object + available for reuse. + + You can optionally use `TRITONSERVER_InferenceRequestSetId` to set a + user-defined ID on the request. This ID is not used by Triton but + will be returned in the response. + + You can reuse an existing `TRITONSERVER_InferenceRequest` object for + a new inference request. A couple of examples of how this is done + and why it is useful are shown in [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc). + +* Ask Triton to execute the inference request using + `TRITONSERVER_ServerInferAsync`. `TRITONSERVER_ServerInferAsync` is + a asynchronous call that returns immediately. The inference response + is returned via a callback into your application. You register this + callback using `TRITONSERVER_InferenceRequestSetResponseCallback` + before you invoke `TRITONSERVER_ServerInferAsync`. In + [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) this callback is + `InferResponseComplete`. + + When you invoke `TRITONSERVER_ServerInferAsync` and it returns + without error, you are passing ownership of the + `TRITONSERVER_InferenceRequest` object to Triton, and so you must + not access that object in any way until Triton returns ownership to + you via the callback you registered with + `TRITONSERVER_InferenceRequestSetReleaseCallback`. + +* Process the inference response. The inference response is returned + to the callback function you registered with + `TRITONSERVER_InferenceRequestSetResponseCallback`. Your callback + receives the response as a `TRITONSERVER_InferenceResponse` + object. Your callback takes ownership of the + `TRITONSERVER_InferenceResponse` object and so must free it with + `TRITONSERVER_InferenceResponseDelete` when it is no longer needed. + + The first step in processing a response is to use + `TRITONSERVER_InferenceResponseError` to check if the response is + returning an error or if it is returning valid results. If the + response is valid you can use + `TRITONSERVER_InferenceResponseOutputCount` to iterate over the + output tensors, and `TRITONSERVER_InferenceResponseOutput` to get + information about each output tensor. + + Note that the [simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc) example uses a + std::promise to simply wait for the response, but synchronizing + response handling in this way is not required. You can have multiple + inference requests in flight at the same time and can issue + inference requests from the same thread or from multiple different + threads. +allows Triton to be linked directly to a C/C++ application. The API +is documented in +[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + +A simple example using the C API can be found in +[simple.cc](https://github.com/triton-inference-server/server/blob/main/src/simple.cc). A more complicated example can be +found in the source that implements the HTTP/REST and GRPC endpoints +for Triton. These endpoints use the C API to communicate with the core +of Triton. The primary source files for the endpoints are +[grpc_server.cc](https://github.com/triton-inference-server/server/blob/main/src/grpc/grpc_server.cc) and +[http_server.cc](https://github.com/triton-inference-server/server/blob/main/src/http_server.cc). \ No newline at end of file diff --git a/docs/customization_guide/inprocess_java_api.md b/docs/customization_guide/inprocess_java_api.md new file mode 100644 index 0000000000..0e736df585 --- /dev/null +++ b/docs/customization_guide/inprocess_java_api.md @@ -0,0 +1,144 @@ + + +# Java bindings for In-Process Triton Server API + +The Triton Inference Server uses [Java CPP](https://github.com/bytedeco/javacpp) +to create bindings around Tritonserver to create Java API. + +The API is documented in +[tritonserver.java](https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/src/gen/java/org/bytedeco/tritonserver/global/tritonserver.java). +Alternatively, the user can refer to the web version [API docs](http://bytedeco.org/javacpp-presets/tritonserver/apidocs/) +generated from `tritonserver.java`. +**Note:** Currently, `tritonserver.java` contains bindings for both the `In-process C-API` +and the bindings for `C-API Wrapper`. More information about the [developer_tools/server C-API wrapper](https://github.com/triton-inference-server/developer_tools/blob/main/server/README.md) can be found in the [developer_tools repository](https://github.com/triton-inference-server/developer_tools/). + +A simple example using the Java API can be found in +[Samples folder](https://github.com/bytedeco/javacpp-presets/tree/master/tritonserver/samples) +which includes `Simple.java` which is similar to +[`simple.cc`](https://github.com/triton-inference-server/server/blob/main/src/simple.cc). +Please refer to +[sample usage documentation](https://github.com/bytedeco/javacpp-presets/tree/master/tritonserver#sample-usage) +to learn about how to build and run `Simple.java`. + +In the [QA folder](https://github.com/triton-inference-server/server/blob/main/qa), folders starting with L0_java include Java API tests. +These can be useful references for getting started, such as the +[ResNet50 test](https://github.com/triton-inference-server/server/blob/main/qa/L0_java_resnet). + +## Java API setup instructions + +To use the Tritonserver Java API, you will need to have the Tritonserver library +and dependencies installed in your environment. There are two ways to do this: + +1. Use a Tritonserver docker container with + 1. `.jar` Java bindings to C API (recommended) + 2. maven and build bindings yourself +2. Build Triton from your environment without Docker (not recommended) + +### Run Tritonserver container and install dependencies + +To set up your environment with Triton Java API, please follow the following steps: +1. First run Docker container: +``` + $ docker run -it --gpus=all -v ${pwd}:/workspace nvcr.io/nvidia/tritonserver:-py3 bash +``` +2. Install `jdk`: +```bash + $ apt update && apt install -y openjdk-11-jdk +``` +3. Install `maven` (only if you want to build the bindings yourself): +```bash +$ cd /opt/tritonserver + $ wget https://archive.apache.org/dist/maven/maven-3/3.8.4/binaries/apache-maven-3.8.4-bin.tar.gz + $ tar zxvf apache-maven-3.8.4-bin.tar.gz + $ export PATH=/opt/tritonserver/apache-maven-3.8.4/bin:$PATH +``` + +### Run Java program with Java bindings Jar + +After ensuring that Tritonserver and dependencies are installed, you can run your +Java program with the Java bindings with the following steps: + +1. Place Java bindings into your environment. You can do this by either: + + a. Building Java API bindings with provided build script: + ```bash + # Clone Triton client repo. Recommended client repo tag is: main + $ git clone --single-branch --depth=1 -b + https://github.com/triton-inference-server/client.git clientrepo + # Run build script + ## For In-Process C-API Java Bindings + $ source clientrepo/src/java-api-bindings/scripts/install_dependencies_and_build.sh + ## For C-API Wrapper (Triton with C++ bindings) Java Bindings + $ source clientrepo/src/java-api-bindings/scripts/install_dependencies_and_build.sh --enable-developer-tools-server + ``` + This will install the Java bindings to `/workspace/install/java-api-bindings/tritonserver-java-bindings.jar` + + *or* + + b. Copying "Uber Jar" from Triton SDK container to your environment + ```bash + $ id=$(docker run -dit nvcr.io/nvidia/tritonserver:-py3-sdk bash) + $ docker cp ${id}:/workspace/install/java-api-bindings/tritonserver-java-bindings.jar /tritonserver-java-bindings.jar + $ docker stop ${id} + ``` + **Note:** `tritonserver-java-bindings.jar` only includes the `In-Process Java Bindings`. To use the `C-API Wrapper Java Bindings`, please use the build script. +2. Use the built "Uber Jar" that contains the Java bindings + ```bash + $ java -cp /tritonserver-java-bindings.jar + ``` + +#### Build Java bindings and run Java program with Maven + +If you want to make changes to the Java bindings, then you can use Maven to +build yourself. You can refer to part 1.a of [Run Java program with Java +bindings Jar](#run-java-program-with-java-bindings-jar) to also build the jar +yourself without any modifications to the Tritonserver bindings in +JavaCPP-presets. +You can do this using the following steps: + +1. Create the JNI binaries in your local repository (`/root/.m2/repository`) + with [`javacpp-presets/tritonserver`](https://github.com/bytedeco/javacpp-presets/tree/master/tritonserver). + For C-API Wrapper Java bindings (Triton with C++ bindings), you need to + install some build specific dependencies including cmake and rapidjson. + Refer to [java installation script](https://github.com/triton-inference-server/client/blob/main/src/java-api-bindings/scripts/install_dependencies_and_build.sh) + for dependencies you need to install and modifications you need to make for your container. +After installing dependencies, you can build the tritonserver project on javacpp-presets: +```bash + $ git clone https://github.com/bytedeco/javacpp-presets.git + $ cd javacpp-presets + $ mvn clean install --projects .,tritonserver + $ mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform=linux-x86_64 +``` +2. Create your custom `*.pom` file for Maven. Please refer to + [samples/simple/pom.xml](https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/samples/simple/pom.xml) as + reference for how to create your pom file. +3. After creating your `pom.xml` file you can build your application with: +```bash + $ mvn compile exec:java -Djavacpp.platform=linux-x86_64 -Dexec.args="" +``` \ No newline at end of file diff --git a/docs/exclusions.txt b/docs/exclusions.txt new file mode 100644 index 0000000000..3bc5006471 --- /dev/null +++ b/docs/exclusions.txt @@ -0,0 +1,3 @@ +README.md +examples/README.md +user_guide/perf_analyzer.md diff --git a/docs/generate_docs.py b/docs/generate_docs.py index 9bc3fd0878..065c14de1e 100755 --- a/docs/generate_docs.py +++ b/docs/generate_docs.py @@ -34,8 +34,6 @@ from collections import defaultdict from functools import partial -from conf import exclude_patterns - # Global constants server_abspath = os.environ.get("SERVER_ABSPATH", os.getcwd()) server_docs_abspath = os.path.join(server_abspath, "docs") @@ -65,13 +63,20 @@ # Hyperlink in a .md file, excluding embedded images. hyperlink_reg = re.compile(r"((? + +1. ## Retrieve and launch the Docker container (optional) + + + + # Pre-install the environment using the NVIDIA Container Toolkit to avoid manual environment configuration + docker run --rm --ipc=host --runtime=nvidia --gpus '"device=0"' --entrypoint /bin/bash -it nvidia/cuda:12.4.1-devel-ubuntu22.04 + +2. ## Install TensorRT-LLM + + + + # Install dependencies, TensorRT-LLM requires Python 3.10 + apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs + + # Install TensorRT-LLM (v0.11.0) + pip3 install tensorrt_llm==0.11.0 --extra-index-url https://pypi.nvidia.com + + # Check installation + python3 -c "import tensorrt_llm" + +3. ## Clone the TRT-LLM repo with the Phi-3 conversion script + + + + git clone -b v0.11.0 https://github.com/NVIDIA/TensorRT-LLM.git + cd TensorRT-LLM/examples/phi/ + + # only need to install requirements.txt if you want to test the summarize.py example + # if so, modify requirements.txt such that tensorrt_llm==0.11.0 + # pip install -r requirements.txt + + +## Build the TRT-LLM Engine + +Reference: + +4. ## Download Phi-3-mini-4k-instruct + + + + git lfs install + git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct + +5. ## Convert weights from HF Transformers to TensorRT-LLM format + + + + python3 ./convert_checkpoint.py \ + --model_dir ./Phi-3-mini-4k-instruct \ + --output_dir ./phi-checkpoint \ + --dtype float16 + +6. ## Build TensorRT engine(s) + + + + # Build a float16 engine using a single GPU and HF weights. + # Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time. + # --tp_size and --pp_size are the model shard size + trtllm-build \ + --checkpoint_dir ./phi-checkpoint \ + --output_dir ./phi-engine \ + --gemm_plugin float16 \ + --max_batch_size 8 \ + --max_input_len 1024 \ + --max_seq_len 2048 \ + --tp_size 1 \ + --pp_size 1 + +7. ## Run the model + + + + python3 ../run.py --engine_dir ./phi-engine \ + --max_output_len 500 \ + --tokenizer_dir ./Phi-3-mini-4k-instruct \ + --input_text "How do I count to nine in French?" + +8. ## Summarization test using the Phi model + +The TensorRT-LLM Phi model can be tested to summarize the articles from the [cnn\_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset. For each summary, the script can compute the [ROUGE](https://en.wikipedia.org/wiki/ROUGE_\(metric\)) scores and use the ROUGE-1 score to validate the implementation. The script can also perform the same summarization using the HF Phi model. + + # Run the summarization task using a TensorRT-LLM model and a single GPU. + python3 ../summarize.py --engine_dir ./phi-engine \ + --hf_model_dir ./Phi-3-mini-4k-instruct \ + --batch_size 1 \ + --test_trt_llm \ + --test_hf \ + --data_type fp16 \ + --check_accuracy \ + --tensorrt_llm_rouge1_threshold=20 + + +## Deploy with Triton Inference Server + +9. ## Copy engine files from the Docker container to the host + + + + # In another terminal instance, before exiting the current container + docker cp : + + # For example + docker cp 452ee1c1d8a1:/TensorRT-LLM/examples/phi/phi-engine /home/user/phi-engine + +10. ## Copy the compiled model to the skeleton repository with TRT-LLM backend + + + + # After exiting the TensorRT-LLM Docker container + git clone https://github.com/triton-inference-server/tensorrtllm_backend.git + cd tensorrtllm_backend + cp ../phi-engine/* all_models/inflight_batcher_llm/tensorrt_llm/1/ + +11. ## Modify the configuration files from the model repository + +The following configuration files need to be updated: + +- ensemble/config.pbtxt + +- postprocessing/config.pbtxt + +- preprocessing/config.pbtxt + +- tensorrt\_llm/config.pbxt + +- tensorrt\_llm/1/config.json + + +### Update ensemble/config.pbtxt + + python3 tools/fill_template.py --in_place \ + all_models/inflight_batcher_llm/ensemble/config.pbtxt \ + triton_max_batch_size:128 + + +### Update preprocessing/config.pbtxt + + python3 tools/fill_template.py --in_place \ + all_models/inflight_batcher_llm/postprocessing/config.pbtxt \ + tokenizer_type:auto,\ + tokenizer_dir:../Phi-3-mini-4k-instruct,\ + triton_max_batch_size:128,\ + postprocessing_instance_count:2 + + +### Update postprocessing/config.pbtxt + + python3 tools/fill_template.py --in_place \ + all_models/inflight_batcher_llm/preprocessing/config.pbtxt \ + tokenizer_type:auto,\ + tokenizer_dir:../Phi-3-mini-4k-instruct,\ + triton_max_batch_size:128,\ + preprocessing_instance_count:2 + + +### Update tensorrt\_llm/config.pbxt + + python3 tools/fill_template.py --in_place \ + all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \ + decoupled_mode:true,\ + engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\ + max_tokens_in_paged_kv_cache:,\ + batch_scheduler_policy:guaranteed_completion,\ + kv_cache_free_gpu_mem_fraction:0.2,\ + max_num_sequences:4,\ + triton_backend:tensorrtllm,\ + triton_max_batch_size:128,\ + max_queue_delay_microseconds:10,\ + max_beam_width:1,\ + batching_strategy:inflight_fused_batching,\ + engine_dir:/opt/all_models/inflight_batcher_llm/tensorrt_llm/1,\ + max_tokens_in_paged_kv_cache:1,\ + batch_scheduler_policy:guaranteed_completion,\ + kv_cache_free_gpu_mem_fraction:0.2 + + + # manually access tensort_llm/config.pbtxt and change the CPU instances to > 1 + # unfortunately this was hard-coded and cannot be update with the above script + + # instance_group [ + # { + # count: 2 + # kind : KIND_CPU + # } + # ] + + +#### Max Tokens in Paged KV Cache + +This is only required for Phi-3-mini-128k-instruct, and it is not necessary to modify this parameter for Phi-3-mini-4k-instruct. + +To accommodate for the 128k context, remove the following from tensorrt\_llm/config.pbxt - which will allow the max tokens to be determined by the KV cache manager. If you don’t want to remove it, you can also set maxTokensInPagedKvCache such that it is large enough (e.g. 4096) to process at least 1 sequence to completion (i.e. must be larger than beam\_width \* tokensPerBlock \* maxBlocksPerSeq) + + parameters: { + key: "max_tokens_in_paged_kv_cache" + value: { + string_value: "4096" + } + } + + +### Update tensorrt\_llm/1/config.json + +In the engine config (tensorrtllm\_backend/all\_models/inflight\_batcher\_llm/tensorrt\_llm/1/config.json), add the following under plugin\_config + + "Use_context_fmha_for_generation": false + + # for example: + "plugin_config": { + "dtype": "float16", + "bert_attention_plugin": "auto", + "streamingllm": false, + "Use_context_fmha_for_generation": false + +The above needs to be done manually with your favorite editor. Once finished, please be sure your working directory is \~/tensorrtllm\_backend + +12. ## Delete tensorrt\_llm\_bls + + + + # Recommended to remove the BLS directory if not needed + rm -rf all_models/inflight_batcher_llm/tensorrt_llm_bls/ + +13. ## Download model repository + + + + # for tokenizer + git lfs install + git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct + +14. ## Launch Triton Inference Server (trtllm-python3-py3) + + + + docker run -it --rm --gpus all --network host --shm-size=1g \ + -v $(pwd)/all_models:/opt/all_models \ + -v $(pwd)/scripts:/opt/scripts \ + -v $(pwd)/Phi-3-mini-4k-instruct:/opt/Phi-3-mini-4k-instruct \ + nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 + + # Launch Server + python3 ../scripts/launch_triton_server.py --model_repo ../all_models/inflight_batcher_llm --world_size 1 + +15. ## Send Requests + + + + curl -X POST localhost:8000/v2/models/ensemble/generate -d \ + '{ + "text_input": "A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. How can they cross the river without anything being eaten?", + "parameters": { + "max_tokens": 256, + "bad_words":[""], + "stop_words":[""] + } + }' | jq + + +## Benchmark with GenAI-Perf + +16. ## Launch Triton Inference Server (py3-sdk) + + + + export RELEASE="24.07" + docker run -it --net=host --gpus '"device=0"' nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + +17. ## Download the Phi-3 tokenizer + +Login to Hugging Face (with User Access Tokens) to get the Phi-3 tokenizer. This step is not necessary but helps with interpreting token metrics from prompts and responses. If you skip this step, be sure to remove the --tokenizer flag from the GenAI-Perf script in Step 18. + + git lfs install + git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct + + pip install huggingface_hub + huggingface-cli login --token hf_*** + +18. ## Run GenAI-Perf + + + + export INPUT_SEQUENCE_LENGTH=128 + export OUTPUT_SEQUENCE_LENGTH=128 + export CONCURRENCY=25 + + genai-perf \ + -m ensemble \ + --service-kind triton \ + --backend tensorrtllm \ + --random-seed 123 \ + --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \ + --synthetic-input-tokens-stddev 0 \ + --streaming \ + --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \ + --output-tokens-stddev 0 \ + --output-tokens-mean-deterministic \ + --concurrency $CONCURRENCY \ + --tokenizer microsoft/Phi-3-mini-4k-instruct \ + --measurement-interval 4000 \ + --url localhost:8001 + +More details on performance benchmarking with GenAI-Perf can be found [here](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/README.md). + +## Reference Configurations + +All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm are shown below. + +
+ ensemble/config.pbtxt + + # Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + name: "ensemble" + platform: "ensemble" + max_batch_size: 128 + input [ + { + name: "text_input" + data_type: TYPE_STRING + dims: [ 1 ] + }, + { + name: "decoder_text_input" + data_type: TYPE_STRING + dims: [ 1 ] + optional: true + }, + { + name: "image_input" + data_type: TYPE_FP16 + dims: [ 3, 224, 224 ] + optional: true + }, + { + name: "max_tokens" + data_type: TYPE_INT32 + dims: [ 1 ] + }, + { + name: "bad_words" + data_type: TYPE_STRING + dims: [ -1 ] + optional: true + }, + { + name: "stop_words" + data_type: TYPE_STRING + dims: [ -1 ] + optional: true + }, + { + name: "end_id" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "pad_id" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "top_k" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "top_p" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "temperature" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "length_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "repetition_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "min_length" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "presence_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "frequency_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "random_seed" + data_type: TYPE_UINT64 + dims: [ 1 ] + optional: true + }, + { + name: "return_log_probs" + data_type: TYPE_BOOL + dims: [ 1 ] + optional: true + }, + { + name: "return_context_logits" + data_type: TYPE_BOOL + dims: [ 1 ] + optional: true + }, + { + name: "return_generation_logits" + data_type: TYPE_BOOL + dims: [ 1 ] + optional: true + }, + { + name: "beam_width" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "stream" + data_type: TYPE_BOOL + dims: [ 1 ] + optional: true + }, + { + name: "prompt_embedding_table" + data_type: TYPE_FP16 + dims: [ -1, -1 ] + optional: true + }, + { + name: "prompt_vocab_size" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "embedding_bias_words" + data_type: TYPE_STRING + dims: [ -1 ] + optional: true + }, + { + name: "embedding_bias_weights" + data_type: TYPE_FP32 + dims: [ -1 ] + optional: true + } + ] + output [ + { + name: "text_output" + data_type: TYPE_STRING + dims: [ -1 ] + }, + { + name: "cum_log_probs" + data_type: TYPE_FP32 + dims: [ -1 ] + }, + { + name: "output_log_probs" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + }, + { + name: "context_logits" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + }, + { + name: "generation_logits" + data_type: TYPE_FP32 + dims: [ -1, -1, -1 ] + }, + { + name: "batch_index" + data_type: TYPE_INT32 + dims: [ 1 ] + } + ] + ensemble_scheduling { + step [ + { + model_name: "preprocessing" + model_version: -1 + input_map { + key: "QUERY" + value: "text_input" + } + input_map { + key: "DECODER_QUERY" + value: "decoder_text_input" + } + input_map { + key: "IMAGE" + value: "image_input" + } + input_map { + key: "REQUEST_OUTPUT_LEN" + value: "max_tokens" + } + input_map { + key: "BAD_WORDS_DICT" + value: "bad_words" + } + input_map { + key: "STOP_WORDS_DICT" + value: "stop_words" + } + input_map { + key: "EMBEDDING_BIAS_WORDS" + value: "embedding_bias_words" + } + input_map { + key: "EMBEDDING_BIAS_WEIGHTS" + value: "embedding_bias_weights" + } + input_map { + key: "END_ID" + value: "end_id" + } + input_map { + key: "PAD_ID" + value: "pad_id" + } + input_map { + key: "PROMPT_EMBEDDING_TABLE" + value: "prompt_embedding_table" + } + output_map { + key: "REQUEST_INPUT_LEN" + value: "_REQUEST_INPUT_LEN" + } + output_map { + key: "INPUT_ID" + value: "_INPUT_ID" + } + output_map { + key: "REQUEST_DECODER_INPUT_LEN" + value: "_REQUEST_DECODER_INPUT_LEN" + } + output_map { + key: "DECODER_INPUT_ID" + value: "_DECODER_INPUT_ID" + } + output_map { + key: "REQUEST_OUTPUT_LEN" + value: "_REQUEST_OUTPUT_LEN" + } + output_map { + key: "STOP_WORDS_IDS" + value: "_STOP_WORDS_IDS" + } + output_map { + key: "BAD_WORDS_IDS" + value: "_BAD_WORDS_IDS" + } + output_map { + key: "EMBEDDING_BIAS" + value: "_EMBEDDING_BIAS" + } + output_map { + key: "OUT_END_ID" + value: "_PREPROCESSOR_END_ID" + } + output_map { + key: "OUT_PAD_ID" + value: "_PREPROCESSOR_PAD_ID" + } + output_map { + key: "OUT_PROMPT_EMBEDDING_TABLE" + value: "out_prompt_embedding_table" + } + }, + { + model_name: "tensorrt_llm" + model_version: -1 + input_map { + key: "input_ids" + value: "_INPUT_ID" + } + input_map { + key: "decoder_input_ids" + value: "_DECODER_INPUT_ID" + } + input_map { + key: "input_lengths" + value: "_REQUEST_INPUT_LEN" + } + input_map { + key: "decoder_input_lengths" + value: "_REQUEST_DECODER_INPUT_LEN" + } + input_map { + key: "request_output_len" + value: "_REQUEST_OUTPUT_LEN" + } + input_map { + key: "end_id" + value: "_PREPROCESSOR_END_ID" + } + input_map { + key: "pad_id" + value: "_PREPROCESSOR_PAD_ID" + } + input_map { + key: "embedding_bias" + value: "_EMBEDDING_BIAS" + } + input_map { + key: "runtime_top_k" + value: "top_k" + } + input_map { + key: "runtime_top_p" + value: "top_p" + } + input_map { + key: "temperature" + value: "temperature" + } + input_map { + key: "len_penalty" + value: "length_penalty" + } + input_map { + key: "repetition_penalty" + value: "repetition_penalty" + } + input_map { + key: "min_length" + value: "min_length" + } + input_map { + key: "presence_penalty" + value: "presence_penalty" + } + input_map { + key: "frequency_penalty" + value: "frequency_penalty" + } + input_map { + key: "random_seed" + value: "random_seed" + } + input_map { + key: "return_log_probs" + value: "return_log_probs" + } + input_map { + key: "return_context_logits" + value: "return_context_logits" + } + input_map { + key: "return_generation_logits" + value: "return_generation_logits" + } + input_map { + key: "beam_width" + value: "beam_width" + } + input_map { + key: "streaming" + value: "stream" + } + input_map { + key: "prompt_embedding_table" + value: "out_prompt_embedding_table" + } + input_map { + key: "prompt_vocab_size" + value: "prompt_vocab_size" + } + input_map { + key: "stop_words_list" + value: "_STOP_WORDS_IDS" + } + input_map { + key: "bad_words_list" + value: "_BAD_WORDS_IDS" + } + output_map { + key: "output_ids" + value: "_TOKENS_BATCH" + } + output_map { + key: "sequence_length" + value: "_SEQUENCE_LENGTH" + }, + output_map { + key: "cum_log_probs" + value: "_CUM_LOG_PROBS" + } + output_map { + key: "output_log_probs" + value: "_OUTPUT_LOG_PROBS" + }, + output_map { + key: "context_logits" + value: "_CONTEXT_LOGITS" + }, + output_map { + key: "generation_logits" + value: "_GENERATION_LOGITS" + }, + output_map { + key: "batch_index" + value: "_BATCH_INDEX" + } + }, + { + model_name: "postprocessing" + model_version: -1 + input_map { + key: "TOKENS_BATCH" + value: "_TOKENS_BATCH" + } + input_map { + key: "CUM_LOG_PROBS" + value: "_CUM_LOG_PROBS" + } + input_map { + key: "OUTPUT_LOG_PROBS" + value: "_OUTPUT_LOG_PROBS" + } + input_map { + key: "CONTEXT_LOGITS" + value: "_CONTEXT_LOGITS" + } + input_map { + key: "GENERATION_LOGITS" + value: "_GENERATION_LOGITS" + } + input_map { + key: "SEQUENCE_LENGTH" + value: "_SEQUENCE_LENGTH" + } + input_map { + key: "BATCH_INDEX" + value: "_BATCH_INDEX" + } + output_map { + key: "OUTPUT" + value: "text_output" + } + output_map { + key: "OUT_OUTPUT_LOG_PROBS" + value: "output_log_probs" + } + output_map { + key: "OUT_CUM_LOG_PROBS" + value: "cum_log_probs" + } + output_map { + key: "OUT_CONTEXT_LOGITS" + value: "context_logits" + } + output_map { + key: "OUT_GENERATION_LOGITS" + value: "generation_logits" + } + output_map { + key: "OUT_BATCH_INDEX" + value: "batch_index" + } + } + ] + } +
+ +
+postprocessing/config.pbtxt + + # Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + name: "postprocessing" + backend: "python" + max_batch_size: 128 + input [ + { + name: "TOKENS_BATCH" + data_type: TYPE_INT32 + dims: [ -1, -1 ] + }, + { + name: "SEQUENCE_LENGTH" + data_type: TYPE_INT32 + dims: [ -1 ] + }, + { + name: "CUM_LOG_PROBS" + data_type: TYPE_FP32 + dims: [ -1 ] + optional: true + }, + { + name: "OUTPUT_LOG_PROBS" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + optional: true + }, + { + name: "CONTEXT_LOGITS" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + optional: true + }, + { + name: "GENERATION_LOGITS" + data_type: TYPE_FP32 + dims: [ -1, -1, -1 ] + optional: true + }, + { + name: "BATCH_INDEX" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + } + ] + output [ + { + name: "OUTPUT" + data_type: TYPE_STRING + dims: [ -1 ] + }, + { + name: "OUT_CUM_LOG_PROBS" + data_type: TYPE_FP32 + dims: [ -1 ] + }, + { + name: "OUT_OUTPUT_LOG_PROBS" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + }, + { + name: "OUT_CONTEXT_LOGITS" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + }, + { + name: "OUT_GENERATION_LOGITS" + data_type: TYPE_FP32 + dims: [ -1, -1, -1 ] + }, + { + name: "OUT_BATCH_INDEX" + data_type: TYPE_INT32 + dims: [ 1 ] + } + ] + + parameters { + key: "tokenizer_dir" + value: { + string_value: "../Phi-3-mini-4k-instruct" + } + } + + parameters { + key: "skip_special_tokens" + value: { + string_value: "${skip_special_tokens}" + } + } + + instance_group [ + { + count: 4 + kind: KIND_CPU + } + ] +
+ +
+ preprocessing/config.pbtxt + + # Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + name: "preprocessing" + backend: "python" + max_batch_size: 128 + input [ + { + name: "QUERY" + data_type: TYPE_STRING + dims: [ 1 ] + }, + { + name: "DECODER_QUERY" + data_type: TYPE_STRING + dims: [ 1 ] + optional: true + }, + { + name: "IMAGE" + data_type: TYPE_FP16 + dims: [ 3, 224, 224 ] + optional: true + }, + { + name: "REQUEST_OUTPUT_LEN" + data_type: TYPE_INT32 + dims: [ 1 ] + }, + { + name: "BAD_WORDS_DICT" + data_type: TYPE_STRING + dims: [ -1 ] + optional: true + }, + { + name: "STOP_WORDS_DICT" + data_type: TYPE_STRING + dims: [ -1 ] + optional: true + }, + { + name: "EMBEDDING_BIAS_WORDS" + data_type: TYPE_STRING + dims: [ -1 ] + optional: true + }, + { + name: "EMBEDDING_BIAS_WEIGHTS" + data_type: TYPE_FP32 + dims: [ -1 ] + optional: true + }, + { + name: "END_ID" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "PAD_ID" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + }, + { + name: "PROMPT_EMBEDDING_TABLE" + data_type: TYPE_FP16 + dims: [ -1, -1 ] + optional: true + allow_ragged_batch: true + } + ] + output [ + { + name: "INPUT_ID" + data_type: TYPE_INT32 + dims: [ -1 ] + }, + { + name: "REQUEST_INPUT_LEN" + data_type: TYPE_INT32 + dims: [ 1 ] + }, + { + name: "DECODER_INPUT_ID" + data_type: TYPE_INT32 + dims: [ -1 ] + }, + { + name: "REQUEST_DECODER_INPUT_LEN" + data_type: TYPE_INT32 + dims: [ 1 ] + }, + { + name: "BAD_WORDS_IDS" + data_type: TYPE_INT32 + dims: [ 2, -1 ] + }, + { + name: "STOP_WORDS_IDS" + data_type: TYPE_INT32 + dims: [ 2, -1 ] + }, + { + name: "EMBEDDING_BIAS" + data_type: TYPE_FP32 + dims: [ -1 ] + }, + { + name: "REQUEST_OUTPUT_LEN" + data_type: TYPE_INT32 + dims: [ -1 ] + }, + { + name: "OUT_END_ID" + data_type: TYPE_INT32 + dims: [ 1 ] + }, + { + name: "OUT_PAD_ID" + data_type: TYPE_INT32 + dims: [ 1 ] + }, + { + name: "OUT_PROMPT_EMBEDDING_TABLE" + data_type: TYPE_FP16 + dims: [ -1, -1 ] + } + ] + + parameters { + key: "tokenizer_dir" + value: { + string_value: "../Phi-3-mini-4k-instruct" + } + } + + parameters { + key: "add_special_tokens" + value: { + string_value: "${add_special_tokens}" + } + } + + parameters { + key: "visual_model_path" + value: { + string_value: "${visual_model_path}" + } + } + + parameters: { + key: "gpt_model_path" + value: { + string_value: "${engine_dir}" + } + } + + instance_group [ + { + count: 4 + kind: KIND_CPU + } + ] + +
+ +
+ tensorrt_llm/config.pbtxt + + + # Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + name: "tensorrt_llm" + backend: "tensorrtllm" + max_batch_size: 128 + + model_transaction_policy { + decoupled: true + } + + dynamic_batching { + preferred_batch_size: [ 128 ] + max_queue_delay_microseconds: 10 + } + + input [ + { + name: "input_ids" + data_type: TYPE_INT32 + dims: [ -1 ] + allow_ragged_batch: true + }, + { + name: "input_lengths" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + }, + { + name: "request_output_len" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + }, + { + name: "draft_input_ids" + data_type: TYPE_INT32 + dims: [ -1 ] + optional: true + allow_ragged_batch: true + }, + { + name: "decoder_input_ids" + data_type: TYPE_INT32 + dims: [ -1 ] + optional: true + allow_ragged_batch: true + }, + { + name: "decoder_input_lengths" + data_type: TYPE_INT32 + dims: [ 1 ] + optional: true + reshape: { shape: [ ] } + }, + { + name: "draft_logits" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + optional: true + allow_ragged_batch: true + }, + { + name: "draft_acceptance_threshold" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "end_id" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "pad_id" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "stop_words_list" + data_type: TYPE_INT32 + dims: [ 2, -1 ] + optional: true + allow_ragged_batch: true + }, + { + name: "bad_words_list" + data_type: TYPE_INT32 + dims: [ 2, -1 ] + optional: true + allow_ragged_batch: true + }, + { + name: "embedding_bias" + data_type: TYPE_FP32 + dims: [ -1 ] + optional: true + allow_ragged_batch: true + }, + { + name: "beam_width" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "temperature" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "runtime_top_k" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "runtime_top_p" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "runtime_top_p_min" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "runtime_top_p_decay" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "runtime_top_p_reset_ids" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "len_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "early_stopping" + data_type: TYPE_BOOL + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "repetition_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "min_length" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "beam_search_diversity_rate" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "presence_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "frequency_penalty" + data_type: TYPE_FP32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "random_seed" + data_type: TYPE_UINT64 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "return_log_probs" + data_type: TYPE_BOOL + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "return_context_logits" + data_type: TYPE_BOOL + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "return_generation_logits" + data_type: TYPE_BOOL + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "stop" + data_type: TYPE_BOOL + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "streaming" + data_type: TYPE_BOOL + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + { + name: "prompt_embedding_table" + data_type: TYPE_FP16 + dims: [ -1, -1 ] + optional: true + allow_ragged_batch: true + }, + { + name: "prompt_vocab_size" + data_type: TYPE_INT32 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + # the unique task ID for the given LoRA. + # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given. + # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`. + # If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached. + { + name: "lora_task_id" + data_type: TYPE_UINT64 + dims: [ 1 ] + reshape: { shape: [ ] } + optional: true + }, + # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ] + # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer + # each of the in / out tensors are first flattened and then concatenated together in the format above. + # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out. + { + name: "lora_weights" + data_type: TYPE_FP16 + dims: [ -1, -1 ] + optional: true + allow_ragged_batch: true + }, + # module identifier (same size a first dimension of lora_weights) + # See LoraModule::ModuleType for model id mapping + # + # "attn_qkv": 0 # compbined qkv adapter + # "attn_q": 1 # q adapter + # "attn_k": 2 # k adapter + # "attn_v": 3 # v adapter + # "attn_dense": 4 # adapter for the dense layer in attention + # "mlp_h_to_4h": 5 # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection + # "mlp_4h_to_h": 6 # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection + # "mlp_gate": 7 # for llama2 adapter for gated mlp later after attention / RMSNorm: gate + # + # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ] + { + name: "lora_config" + data_type: TYPE_INT32 + dims: [ -1, 3 ] + optional: true + allow_ragged_batch: true + } + ] + output [ + { + name: "output_ids" + data_type: TYPE_INT32 + dims: [ -1, -1 ] + }, + { + name: "sequence_length" + data_type: TYPE_INT32 + dims: [ -1 ] + }, + { + name: "cum_log_probs" + data_type: TYPE_FP32 + dims: [ -1 ] + }, + { + name: "output_log_probs" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + }, + { + name: "context_logits" + data_type: TYPE_FP32 + dims: [ -1, -1 ] + }, + { + name: "generation_logits" + data_type: TYPE_FP32 + dims: [ -1, -1, -1 ] + }, + { + name: "batch_index" + data_type: TYPE_INT32 + dims: [ 1 ] + } + ] + instance_group [ + { + count: 4 + kind : KIND_CPU + } + ] + parameters: { + key: "max_beam_width" + value: { + string_value: "1" + } + } + parameters: { + key: "FORCE_CPU_ONLY_INPUT_TENSORS" + value: { + string_value: "no" + } + } + parameters: { + key: "gpt_model_type" + value: { + string_value: "inflight_fused_batching" + } + } + parameters: { + key: "gpt_model_path" + value: { + string_value: "/opt/all_models/inflight_batcher_llm/tensorrt_llm/1" + } + } + parameters: { + key: "encoder_model_path" + value: { + string_value: "${encoder_engine_dir}" + } + } + +
+ parameters: { + key: "max_tokens_in_paged_kv_cache" + value: { + string_value: "" + } + } + parameters: { + key: "max_attention_window_size" + value: { + string_value: "${max_attention_window_size}" + } + } + parameters: { + key: "sink_token_length" + value: { + string_value: "${sink_token_length}" + } + } + parameters: { + key: "batch_scheduler_policy" + value: { + string_value: "guaranteed_completion" + } + } + parameters: { + key: "kv_cache_free_gpu_mem_fraction" + value: { + string_value: "0.2" + } + } + parameters: { + key: "kv_cache_host_memory_bytes" + value: { + string_value: "${kv_cache_host_memory_bytes}" + } + } + parameters: { + key: "kv_cache_onboard_blocks" + value: { + string_value: "${kv_cache_onboard_blocks}" + } + } + # enable_trt_overlap is deprecated and doesn't have any effect on the runtime + # parameters: { + # key: "enable_trt_overlap" + # value: { + # string_value: "${enable_trt_overlap}" + # } + # } + parameters: { + key: "exclude_input_in_output" + value: { + string_value: "${exclude_input_in_output}" + } + } + parameters: { + key: "cancellation_check_period_ms" + value: { + string_value: "${cancellation_check_period_ms}" + } + } + parameters: { + key: "stats_check_period_ms" + value: { + string_value: "${stats_check_period_ms}" + } + } + parameters: { + key: "iter_stats_max_iterations" + value: { + string_value: "${iter_stats_max_iterations}" + } + } + parameters: { + key: "request_stats_max_iterations" + value: { + string_value: "${request_stats_max_iterations}" + } + } + parameters: { + key: "enable_kv_cache_reuse" + value: { + string_value: "${enable_kv_cache_reuse}" + } + } + parameters: { + key: "normalize_log_probs" + value: { + string_value: "${normalize_log_probs}" + } + } + parameters: { + key: "enable_chunked_context" + value: { + string_value: "${enable_chunked_context}" + } + } + parameters: { + key: "gpu_device_ids" + value: { + string_value: "${gpu_device_ids}" + } + } + parameters: { + key: "lora_cache_optimal_adapter_size" + value: { + string_value: "${lora_cache_optimal_adapter_size}" + } + } + parameters: { + key: "lora_cache_max_adapter_size" + value: { + string_value: "${lora_cache_max_adapter_size}" + } + } + parameters: { + key: "lora_cache_gpu_memory_fraction" + value: { + string_value: "${lora_cache_gpu_memory_fraction}" + } + } + parameters: { + key: "lora_cache_host_memory_bytes" + value: { + string_value: "${lora_cache_host_memory_bytes}" + } + } + parameters: { + key: "decoding_mode" + value: { + string_value: "${decoding_mode}" + } + } + parameters: { + key: "executor_worker_path" + value: { + string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" + } + } + parameters: { + key: "medusa_choices" + value: { + string_value: "${medusa_choices}" + } + } + parameters: { + key: "gpu_weights_percent" + value: { + string_value: "${gpu_weights_percent}" + } + } \ No newline at end of file diff --git a/docs/getting_started/quick_deployment_by_backend.rst b/docs/getting_started/quick_deployment_by_backend.rst new file mode 100644 index 0000000000..c8e461c00c --- /dev/null +++ b/docs/getting_started/quick_deployment_by_backend.rst @@ -0,0 +1,17 @@ +#### +Quick Deployment Guide by backend +#### + +.. include:: quick_start.rst + +.. toctree:: + :maxdepth: 1 + :hidden: + + TRT-LLM + vLLM <../tutorials/Popular_Models_Guide/Llama2/vllm_guide.md> + Python with HuggingFace <../tutorials/Quick_Deploy/HuggingFaceTransformers/README.md> + PyTorch <../tutorials/Quick_Deploy/PyTorch/README.md> + ONNX <../tutorials/Quick_Deploy/ONNX/README.md> + TensorFlow <../tutorials/Quick_Deploy/TensorFlow/README.md> + Openvino <../tutorials/Quick_Deploy/OpenVINO/README.md> \ No newline at end of file diff --git a/docs/getting_started/quick_start.rst b/docs/getting_started/quick_start.rst new file mode 100644 index 0000000000..8af21534a3 --- /dev/null +++ b/docs/getting_started/quick_start.rst @@ -0,0 +1,175 @@ +.. raw:: html + + + +Quickstart +========== + +**New to Triton Inference Server and want do just deploy your model +quickly?** Make use of `these +tutorials <../tutorials/README.html#quick-deploy>`__ to begin your Triton +journey! + +The Triton Inference Server is available as `buildable source +code <../customization_guide/build.html>`__, but the easiest way to +install and run Triton is to use the pre-built Docker image available +from the `NVIDIA GPU Cloud (NGC) `__. + +Launching and maintaining Triton Inference Server revolves around the +use of building model repositories. This tutorial will cover: + +- Creating a Model Repository +- Launching Triton +- Send an Inference Request + +Create A Model Repository +------------------------- + +The `model repository <../user_guide/model_repository.html>`__ is the +directory where you place the models that you want Triton to serve. An +example model repository is included in the +`docs/examples/model_repository `__. +Before using the repository, you must fetch any missing model definition +files from their public model zoos via the provided script. + +:: + + $ cd docs/examples + $ ./fetch_models.sh + +Launch Triton +------------- + +Triton is optimized to provide the best inferencing performance by using +GPUs, but it can also work on CPU-only systems. In both cases you can +use the same Triton Docker image. + +Run on System with GPUs +~~~~~~~~~~~~~~~~~~~~~~~ + +Use the following command to run Triton with the example model +repository you just created. The `NVIDIA Container +Toolkit `__ must be installed +for Docker to recognize the GPU(s). The –gpus=1 flag indicates that 1 +system GPU should be made available to Triton for inferencing. + +:: + + $ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models + +Where is the version of Triton that you want to use (and pulled +above). After you start Triton you will see output on the console +showing the server starting up and loading the model. When you see +output like the following, Triton is ready to accept inference requests. + +:: + + +----------------------+---------+--------+ + | Model | Version | Status | + +----------------------+---------+--------+ + | | | READY | + | .. | . | .. | + | .. | . | .. | + +----------------------+---------+--------+ + ... + ... + ... + I1002 21:58:57.891440 62 grpc_server.cc:3914] Started GRPCInferenceService at 0.0.0.0:8001 + I1002 21:58:57.893177 62 http_server.cc:2717] Started HTTPService at 0.0.0.0:8000 + I1002 21:58:57.935518 62 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002 + +All the models should show “READY” status to indicate that they loaded +correctly. If a model fails to load the status will report the failure +and a reason for the failure. If your model is not displayed in the +table check the path to the model repository and your CUDA drivers. + +Run on CPU-Only System +~~~~~~~~~~~~~~~~~~~~~~ + +On a system without GPUs, Triton should be run without using the –gpus +flag to Docker, but is otherwise identical to what is described above. + +:: + + $ docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models + +Because the –gpus flag is not used, a GPU is not available and Triton +will therefore be unable to load any model configuration that requires a +GPU. + +Verify Triton Is Running Correctly +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use Triton’s *ready* endpoint to verify that the server and the models +are ready for inference. From the host system use curl to access the +HTTP endpoint that indicates server status. + +:: + + $ curl -v localhost:8000/v2/health/ready + ... + < HTTP/1.1 200 OK + < Content-Length: 0 + < Content-Type: text/plain + +The HTTP request returns status 200 if Triton is ready and non-200 if it +is not ready. + +Send an Inference Request +------------------------- + +Use docker pull to get the client libraries and examples image from NGC. + +:: + + $ docker pull nvcr.io/nvidia/tritonserver:-py3-sdk + +Where is the version that you want to pull. Run the client +image. + +:: + + $ docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:-py3-sdk + +From within the nvcr.io/nvidia/tritonserver:-py3-sdk image, run +the example image-client application to perform image classification +using the example densenet_onnx model. + +To send a request for the densenet_onnx model use an image from the +/workspace/images directory. In this case we ask for the top 3 +classifications. + +:: + + $ /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg + Request 0, batch size 1 + Image '/workspace/images/mug.jpg': + 15.346230 (504) = COFFEE MUG + 13.224326 (968) = CUP + 10.422965 (505) = COFFEEPOT \ No newline at end of file diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index 1d475e771e..c016572535 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -1,5 +1,5 @@ -::::{grid} -:reverse: -:gutter: 2 1 1 1 -:margin: 4 4 1 1 +# NVIDIA Triton Inference Server -:::{grid-item} -:columns: 4 - -```{image} ./_static/nvidia-logo-vert-rgb-blk-for-screen.png -:width: 300px -``` -::: -:::{grid-item} -:columns: 8 -:class: sd-fs-3 - -NVIDIA Triton Inference Server - -::: -:::: - -Triton Inference Server is an open source inference serving software that streamlines AI inferencing. +Triton Inference Server is an open source inference serving software that streamlines +AI inferencing. Triton Inference Server enables teams to deploy any AI model from multiple deep +learning and machine learning frameworks, including TensorRT, TensorFlow, +PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference +across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM +CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance +for many query types, including real time, batched, ensembles and audio/video +streaming. Triton inference Server is part of +[NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/), +a software platform that accelerates the data science pipeline and streamlines +the development and deployment of production AI. + +[Please visit Deep Learning Framework (DLFW) website for the complete compatibility matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). + +# Release Compatibility Matrix + 1. [Container Name: trtllm-python-py3](#container-name-trtllm-python-py3) + 2. [Container Name: vllm-python-py3](#container-name-vllm-python-py3) + 3. [ONNX Versions](#onnx-versions) + +## Container Name: trtllm-python-py3 + +| Triton release version | NGC Tag | Python version | Torch version | TensorRT version | TensorRT-LLM version | CUDA version | CUDA Driver version | Size | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | +| 24.10 | nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 | Python 3.10.12 | 2.4.0a0%2B3bcc3cddb5.nv24.7 | 10.4.0 | 0.14.0 | 12.5.1.007 | 555.42.06 | 21G | +| 24.09 | nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3 | Python 3.10.12 | 2.4.0a0%2B3bcc3cddb5.nv24.7 | 10.4.0 | 0.13.0 | 12.5.1.007 | 555.42.06 | 21G | +| 24.08 | nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 | Python 3.10.12 | 2.4.0a0%2B3bcc3cddb5.nv24.7 | 10.3.0 | 0.12.0 | 12.5.1.007 | 555.42.06 | 21G | +| 24.07 | nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 | Python 3.10.12 | 2.4.0a0%2B07cecf4168.nv24.5 | 10.1.0 | 0.11.0 | 12.4.1.003 | 550.54.15 | 23G | +| 24.06 | nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 | Python 3.10.12 | 2.3.0a0%2B40ec155e58.nv24.3 | 10.0.1 | 0.10.0 | 12.4.0.041 | 550.54.14 | 31G | +| 24.05 | nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 | Python 3.10.12 | 2.3.0a0%2Bebedce2 | 10.0.1.6 | 0.9.0 | 12.3.2.001 | 545.23.08 | 34G | +| 24.04 | nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 | Python 3.10.12 | 2.3.0a0%2Bebedce2 | 9.3.0.post12.dev1 | 0.9.0 | 12.3.2.001 | 545.23.08 | 34G | + +## Container Name: vllm-python-py3 + +| Triton release version | NGC Tag | Python version | vLLM version | CUDA version | CUDA Driver version | Size | +| --- | --- | --- | --- | --- | --- | --- | +| 24.10 | nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3 | Python 3.10.12 | 0.5.5 | 12.6.2.004 | 560.35.03 | 19G | +| 24.09 | nvcr.io/nvidia/tritonserver:24.09-vllm-python-py3 | Python 3.10.12 | 0.5.3.post1 | 12.6.1.006 | 560.35.03 | 19G | +| 24.08 | nvcr.io/nvidia/tritonserver:24.08-vllm-python-py3 | Python 3.10.12 | 0.5.0 post1 | 12.6.0.022 | 560.35.03 | 19G | +| 24.07 | nvcr.io/nvidia/tritonserver:24.07-vllm-python-py3 | Python 3.10.12 | 0.5.0 post1 | 12.5.1 | 555.42.06 | 19G | +| 24.06 | nvcr.io/nvidia/tritonserver:24.06-vllm-python-py3 | Python 3.10.12 | 0.4.3 | 12.5.0.23 | 555.42.02 | 18G | +| 24.05 | nvcr.io/nvidia/tritonserver:24.05-vllm-python-py3 | Python 3.10.12 | 0.4.0 post1 | 12.4.1 | 550.54.15 | 18G | +| 24.04 | nvcr.io/nvidia/tritonserver:24.04-vllm-python-py3 | Python 3.10.12 | 0.4.0 post1 | 12.4.1 | 550.54.15 | 17G | + +## ONNX Versions + +| Triton release version | ONNX Runtime | +| --- | --- | +| 24.10 | 1.19.2 | +| 24.09 | 1.19.2 | +| 24.08 | 1.18.1 | +| 24.07 | 1.18.1 | +| 24.06 | 1.18.0 | +| 24.05 | 1.18.0 | +| 24.04 | 1.17.3 | diff --git a/docs/introduction/index.md b/docs/introduction/index.md new file mode 100644 index 0000000000..306c2082e7 --- /dev/null +++ b/docs/introduction/index.md @@ -0,0 +1,121 @@ + + +# NVIDIA Triton Inference Server + +Triton Inference Server is an open source inference serving software that streamlines +AI inferencing. Triton Inference Server enables teams to deploy any AI model from multiple deep +learning and machine learning frameworks, including TensorRT, TensorFlow, +PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference +across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM +CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance +for many query types, including real time, batched, ensembles and audio/video +streaming. Triton inference Server is part of +[NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/), +a software platform that accelerates the data science pipeline and streamlines +the development and deployment of production AI. + + + +
+ +
+ + + +## Triton Architecture + +The following figure shows the Triton Inference Server high-level +architecture. The [model repository](user_guide/model_repository.md) is a +file-system based repository of the models that Triton will make +available for inferencing. Inference requests arrive at the server via +either [HTTP/REST or GRPC](customization_guide/inference_protocols.md) or by the [C +API](customization_guide/inference_protocols.md) and are then routed to the appropriate per-model +scheduler. Triton implements [multiple scheduling and batching +algorithms](#models-and-schedulers) that can be configured on a +model-by-model basis. Each model's scheduler optionally performs +batching of inference requests and then passes the requests to the +[backend](https://github.com/triton-inference-server/backend/blob/main/README.md) +corresponding to the model type. The backend performs inferencing +using the inputs provided in the batched requests to produce the +requested outputs. The outputs are then returned. + +Triton supports a [backend C +API](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api) +that allows Triton to be extended with new functionality such as +custom pre- and post-processing operations or even a new deep-learning +framework. + +The models being served by Triton can be queried and controlled by a +dedicated [model management API](user_guide/model_management.md) that is +available by HTTP/REST or GRPC protocol, or by the C API. + +Readiness and liveness health endpoints and utilization, throughput +and latency metrics ease the integration of Triton into deployment +framework such as Kubernetes. + +![Triton Architecture Diagram](../user_guide/images/arch.jpg) + +## Triton major features + +Major features include: + +- [Supports multiple deep learning + frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton) +- [Supports multiple machine learning + frameworks](https://github.com/triton-inference-server/fil_backend) +- [Concurrent model + execution](user_guide/model_execution.md#concurrent-model-execution) +- [Dynamic batching](user_guide/batcher.md#dynamic-batcher) +- [Sequence batching](user_guide/batcher.md#sequence-batcher) and + [implicit state management](user_guide/implicit_state_management.md#implicit-state-management) + for stateful models +- Provides [Backend API](https://github.com/triton-inference-server/backend) that + allows adding custom backends and pre/post processing operations +- Model pipelines using + [Ensembling](user_guide/ensemble_models.md#ensemble-models) or [Business + Logic Scripting + (BLS)](user_guide/bls.md#business-logic-scripting) +- [HTTP/REST and GRPC inference + protocols](customization_guide/inference_protocols.md) based on the community + developed [KServe + protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) +- A [C API](customization_guide/inprocess_c_api.md) and + [Java API](customization_guide/inprocess_java_api.md) + allow Triton to link directly into your application for edge and other in-process use cases +- [Metrics](user_guide/metrics.md) indicating GPU utilization, server + throughput, server latency, and more + +Join the [Triton and TensorRT community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and stay current on the latest product updates, bug fixes, content, best +practices, and more. Need enterprise support? NVIDIA global support is available +for Triton Inference Server with the [NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/). + +See the [Latest Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/) for updates on the newest features and bug fixes. \ No newline at end of file diff --git a/docs/introduction/release_notes.md b/docs/introduction/release_notes.md new file mode 100644 index 0000000000..63f72e0c15 --- /dev/null +++ b/docs/introduction/release_notes.md @@ -0,0 +1,124 @@ + +# [Triton Inference Server Release 24.10](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-10.html#rel-24-10) + +The Triton Inference Server container image, release 24.10, is available on [NGC](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver) and is open source on [GitHub](https://github.com/triton-inference-server/server). + + +## **Contents of the Triton Inference Server container** + +The [Triton Inference Server](https://github.com/triton-inference-server/server) Docker image contains the inference server executable and related shared libraries in `/opt/tritonserver`. + +For a complete list of what the container includes, refer to [Deep Learning Frameworks Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). + +The container also includes the following: + +- [Ubuntu 22.04](http://releases.ubuntu.com/22.04/) including [Python 3.10](https://www.python.org/downloads/release/python-3100/) + +- [NVIDIA CUDA 12.6.2](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) + +- [NVIDIA cuBLAS 12.6.3.3](https://docs.nvidia.com/cuda/cublas/index.html) + +- [cuDNN 9.5.0.50](https://docs.nvidia.com/deeplearning/cudnn/release-notes/) + +- [NVIDIA NCCL 2.22.3](https://docs.nvidia.com/deeplearning/nccl/release-notes/) (optimized for [NVIDIA NVLink](http://www.nvidia.com/object/nvlink.html)®) + +- [NVIDIA TensorRT™ 10.5.0.18](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html) + +- OpenUCX 1.15.0 + +- GDRCopy 2.3 + +- NVIDIA HPC-X 2.20 + +- OpenMPI 4.1.7 + +- [FIL](https://github.com/triton-inference-server/fil_backend) + +- [NVIDIA DALI® 1.42](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html) + +- [nvImageCodec 0.2.0.7](https://docs.nvidia.com/cuda/nvimagecodec/release_notes_v0.2.0.html) + +- ONNX Runtime 1.19.2 + +- Intel[ OpenVINO ](https://github.com/openvinotoolkit/openvino/tree/2022.1.0)2024.0.0 + +- DCGM 3.2.6 + +- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/) version [release/0.13.0](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.13.0) + +- [vLLM](https://github.com/vllm-project/vllm) version 0.5.3 post 1 + + +## **Driver Requirements** + +Release 24.10 is based on [CUDA 12.6.2](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) which requires [NVIDIA Driver](http://www.nvidia.com/Download/index.aspx?lang=en-us) release 560 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 470.57 (or later R470), 525.85 (or later R525), 535.86 (or later R535), or 545.23 (or later R545). + +The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, R450, R460, R510, R520, R530, R545 and R555 drivers, which are not forward-compatible with CUDA 12.6. For a complete list of supported drivers, see the [CUDA Application Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package) topic. For more information, see [CUDA Compatibility and Upgrades](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades). + + +## **GPU Requirements** + +Release 24.10 supports CUDA compute capability 6.0 and later. This corresponds to GPUs in the NVIDIA Pascal, NVIDIA Volta™, NVIDIA Turing™, NVIDIA Ampere architecture, NVIDIA Hopper™, and NVIDIA Ada Lovelace architecture families. For a list of GPUs to which this compute capability corresponds, see [CUDA GPUs](https://developer.nvidia.com/cuda-gpus). For additional support details, see [Deep Learning Frameworks Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). + + +## **Key Features and Enhancements** + +This Inference Server release includes the following key features and enhancements. + +- Optimized vLLM performance with custom metrics. + +## **Known Issues** +- Numpy 2.x is not currently supported for Python Backend models and may cause them to return empty tensors unxpectedly, please use Numpy 1.x until support is added. +- To build the Llama 3.1 engine inside the 24.09-trtllm-python-py3 image, make sure to upgrade the transformer library to 4.43+ due to the bug in 4.43.x. One option to do so is to run `pip install -U transformers`. For more information, please refer to the discussion: https://github.com/NVIDIA/TensorRT-LLM/issues/2121. +- Triton vLLM container comes with the vLLM version, which has a known vulnerability: https://github.com/advisories/GHSA-w2r7-9579-27hf. Note, that the affected code is not invoked at runtime, therefore the Triton vLLM container is not affected by this issue. +- When running Torch TRT models, the output may differ from running the same model on a previous release. +- When using TensorRT models, if auto-complete configuration is disabled and is_non_linear_format_io:true for reformat-free tensors is not provided in the model configuration, the model may not load successfully. +- When using Python models indecoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. +- Restart support was temporarily removed for Python models. +- Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.14.0 and built out of nvcr.io/nvidia/tritonserver:24.07-py3-min. Please refer to the Triton TRT-LLM - - Container Support Matrix section in the GitHub release note for more details. +- The Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at the `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify distributed_executor_backend:ray in the model.json when deploying vllm models with tensor parallelism > 1. + +- When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter config : instead of custom configuration file in the following format: file:configs/.pbtxt : . +- Perf Analyzer no longer supports the --trace-file option. +- TensorRT-LLM backend provides limited support of Triton extensions and features. +- The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. +- The Java CAPI is known to have intermittent segfaults. +- Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. NVIDIA recommends experimenting with both tcmalloc and jemalloc to determine which one works better for your use case. +- Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config. +- Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:https://github.com/pytorch/pytorch/issues/38273. +- Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. +- Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information. +- Triton cannot retrieve GPU metrics with MIG-enabled GPU devices. +- Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. +- When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. +- Python backend support for Windows is limited and does not currently support the following features: + - GPU tensors + - CPU and GPU-related metrics + - Custom execution environments + - The model load/unload APIs \ No newline at end of file diff --git a/docs/perf_benchmark/genai-perf-README.rst b/docs/perf_benchmark/genai-perf-README.rst new file mode 100644 index 0000000000..ea6a2d0d01 --- /dev/null +++ b/docs/perf_benchmark/genai-perf-README.rst @@ -0,0 +1,686 @@ +.. raw:: html + + + +GenAI-Perf +========== + +GenAI-Perf is a command line tool for measuring the throughput and +latency of generative AI models as served through an inference server. +For large language models (LLMs), GenAI-Perf provides metrics such as +`output token throughput <#output_token_throughput_metric>`__, `time to +first token <#time_to_first_token_metric>`__, `inter token +latency <#inter_token_latency_metric>`__, and `request +throughput <#request_throughput_metric>`__. For a full list of metrics +please see the `Metrics section <#metrics>`__. + +Users specify a model name, an inference server URL, the type of inputs +to use (synthetic or from dataset), and the type of load to generate +(number of concurrent requests, request rate). + +GenAI-Perf generates the specified load, measures the performance of the +inference server and reports the metrics in a simple table as console +output. The tool also logs all results in a csv and json file that can +be used to derive additional metrics and visualizations. The inference +server must already be running when GenAI-Perf is run. + +You can use GenAI-Perf to run performance benchmarks on - `Large +Language Models `__ - `Vision Language +Models `__ - `Embedding +Models `__ - `Ranking Models `__ - +`Multiple LoRA Adapters `__ + + [!Note] GenAI-Perf is currently in early release and under rapid + development. While we will try to remain consistent, command line + options and functionality are subject to change as the tool matures. + +.. raw:: html + + + +Installation +------------ + +The easiest way to install GenAI-Perf is through `Triton Server SDK +container `__. +Install the latest release using the following command: + +.. code:: bash + + export RELEASE="yy.mm" # e.g. export RELEASE="24.06" + + docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + + # Check out genai_perf command inside the container: + genai-perf --help + +.. raw:: html + +
+ +Alternatively, to install from source: + +Since GenAI-Perf depends on Perf Analyzer, you’ll need to install the +Perf Analyzer binary: + +Install Perf Analyzer (Ubuntu, Python 3.8+) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**NOTE**: you must already have CUDA 12 installed (checkout the `CUDA +installation +guide `__). + +.. code:: bash + + pip install tritonclient + + apt update && apt install -y --no-install-recommends libb64-0d libcurl4 + +You can also build Perf Analyzer `from +source <../docs/install.md#build-from-source>`__ as well. + +Install GenAI-Perf from source +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: bash + + git clone https://github.com/triton-inference-server/perf_analyzer.git && cd perf_analyzer + + pip install -e genai-perf + +.. raw:: html + +
+ +.. raw:: html + + + +Quick Start +----------- + +In this quick start, we will use GenAI-Perf to run performance +benchmarking on the GPT-2 model running on Triton Inference Server with +a TensorRT-LLM engine. + +Serve GPT-2 TensorRT-LLM model using Triton CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can follow the `quickstart +guide `__ +on Triton CLI github repo to run GPT-2 model locally. The full +instructions are copied below for convenience: + +.. code:: bash + + # This container comes with all of the dependencies for building TRT-LLM engines + # and serving the engine with Triton Inference Server. + docker run -ti \ + --gpus all \ + --network=host \ + --shm-size=1g --ulimit memlock=-1 \ + -v /tmp:/tmp \ + -v ${HOME}/models:/root/models \ + -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \ + nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 + + # Install the Triton CLI + pip install git+https://github.com/triton-inference-server/triton_cli.git@0.0.8 + + # Build TRT LLM engine and generate a Triton model repository pointing at it + triton remove -m all + triton import -m gpt2 --backend tensorrtllm + + # Start Triton pointing at the default model repository + triton start + +Running GenAI-Perf +~~~~~~~~~~~~~~~~~~ + +Now we can run GenAI-Perf from Triton Inference Server SDK container: + +.. code:: bash + + export RELEASE="yy.mm" # e.g. export RELEASE="24.06" + + docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + + # Run GenAI-Perf in the container: + genai-perf profile \ + -m gpt2 \ + --service-kind triton \ + --backend tensorrtllm \ + --num-prompts 100 \ + --random-seed 123 \ + --synthetic-input-tokens-mean 200 \ + --synthetic-input-tokens-stddev 0 \ + --streaming \ + --output-tokens-mean 100 \ + --output-tokens-stddev 0 \ + --output-tokens-mean-deterministic \ + --tokenizer hf-internal-testing/llama-tokenizer \ + --concurrency 1 \ + --measurement-interval 4000 \ + --profile-export-file my_profile_export.json \ + --url localhost:8001 + +Example output: + +:: + + LLM Metrics + ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ + ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ + ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ + │ Time to first token (ms) │ 11.70 │ 9.88 │ 17.21 │ 14.35 │ 12.01 │ 11.87 │ + │ Inter token latency (ms) │ 1.46 │ 1.08 │ 1.89 │ 1.87 │ 1.62 │ 1.52 │ + │ Request latency (ms) │ 161.24 │ 153.45 │ 200.74 │ 200.66 │ 179.43 │ 162.23 │ + │ Output sequence length │ 103.39 │ 95.00 │ 134.00 │ 120.08 │ 107.30 │ 105.00 │ + │ Input sequence length │ 200.01 │ 200.00 │ 201.00 │ 200.13 │ 200.00 │ 200.00 │ + └──────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘ + Output token throughput (per sec): 635.61 + Request throughput (per sec): 6.15 + +See `Tutorial `__ for additional examples. + +.. raw:: html + + + +Visualization +------------- + +GenAI-Perf can also generate various plots that visualize the +performance of the current profile run. This is disabled by default but +users can easily enable it by passing the ``--generate-plots`` option +when running the benchmark: + +.. code:: bash + + genai-perf profile \ + -m gpt2 \ + --service-kind triton \ + --backend tensorrtllm \ + --streaming \ + --concurrency 1 \ + --generate-plots + +This will generate a `set of default +plots `__ such as: - Time to first token +(TTFT) analysis - Request latency analysis - TTFT vs Input sequence +lengths - Inter token latencies vs Token positions - Input sequence +lengths vs Output sequence lengths + +Using ``compare`` Subcommand to Visualize Multiple Runs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``compare`` subcommand in GenAI-Perf facilitates users in comparing +multiple profile runs and visualizing the differences through plots. + +Usage +^^^^^ + +Assuming the user possesses two profile export JSON files, namely +``profile1.json`` and ``profile2.json``, they can execute the +``compare`` subcommand using the ``--files`` option: + +.. code:: bash + + genai-perf compare --files profile1.json profile2.json + +Executing the above command will perform the following actions under the +``compare`` directory: 1. Generate a YAML configuration file +(e.g. ``config.yaml``) containing the metadata for each plot generated +during the comparison process. 2. Automatically generate the `default +set of plots `__ (e.g. TTFT vs. Input +Sequence Lengths) that compare the two profile runs. + +:: + + compare + ├── config.yaml + ├── distribution_of_input_sequence_lengths_to_output_sequence_lengths.jpeg + ├── request_latency.jpeg + ├── time_to_first_token.jpeg + ├── time_to_first_token_vs_input_sequence_lengths.jpeg + ├── token-to-token_latency_vs_output_token_position.jpeg + └── ... + +Customization +^^^^^^^^^^^^^ + +Users have the flexibility to iteratively modify the generated YAML +configuration file to suit their specific requirements. They can make +alterations to the plots according to their preferences and execute the +command with the ``--config`` option followed by the path to the +modified configuration file: + +.. code:: bash + + genai-perf compare --config compare/config.yaml + +This command will regenerate the plots based on the updated +configuration settings, enabling users to refine the visual +representation of the comparison results as per their needs. + +See `Compare documentation `__ for more details. + +.. raw:: html + + + +Model Inputs +------------ + +GenAI-Perf supports model input prompts from either synthetically +generated inputs, or from the HuggingFace +`OpenOrca `__ or +`CNN_DailyMail `__ +datasets. This is specified using the ``--input-dataset`` CLI option. + +When the dataset is synthetic, you can specify the following options: \* +``--num-prompts ``: The number of unique prompts to generate as +stimulus, >= 1. \* ``--synthetic-input-tokens-mean ``: The mean of +number of tokens in the generated prompts when using synthetic data, >= +1. \* ``--synthetic-input-tokens-stddev ``: The standard deviation +of number of tokens in the generated prompts when using synthetic data, +>= 0. \* ``--random-seed ``: The seed used to generate random +values, >= 0. + +When the dataset is coming from HuggingFace, you can specify the +following options: \* ``--input-dataset {openorca,cnn_dailymail}``: +HuggingFace dataset to use for benchmarking. \* ``--num-prompts ``: +The number of unique prompts to generate as stimulus, >= 1. + +When the dataset is coming from a file, you can specify the following +options: \* ``--input-file ``: The input file containing the +prompts to use for benchmarking as JSON objects. + +For any dataset, you can specify the following options: \* +``--output-tokens-mean ``: The mean number of tokens in each +output. Ensure the ``--tokenizer`` value is set correctly, >= 1. \* +``--output-tokens-stddev ``: The standard deviation of the number +of tokens in each output. This is only used when output-tokens-mean is +provided, >= 1. \* ``--output-tokens-mean-deterministic``: When using +``--output-tokens-mean``, this flag can be set to improve precision by +setting the minimum number of tokens equal to the requested number of +tokens. This is currently supported with the Triton service-kind. Note +that there is still some variability in the requested number of output +tokens, but GenAi-Perf attempts its best effort with your model to get +the right number of output tokens. + +You can optionally set additional model inputs with the following +option: \* ``--extra-inputs :``: An additional input +for use with the model with a singular value, such as ``stream:true`` or +``max_tokens:5``. This flag can be repeated to supply multiple extra +inputs. + +For `Large Language Models `__, there is no batch size +(i.e. batch size is always ``1``). Each request includes the inputs for +one individual inference. Other modes such as the +`embeddings `__ and `rankings `__ +endpoints support client-side batching, where ``--batch-size N`` means +that each request sent will include the inputs for ``N`` separate +inferences, allowing them to be processed together. + +.. raw:: html + + + +Metrics +------- + +GenAI-Perf collects a diverse set of metrics that captures the +performance of the inference server. + ++-----------------------+-----------------------+-----------------------+ +| Metric | Description | Aggregations | ++=======================+=======================+=======================+ +| Time to First Token | Time between when a | Avg, min, max, p99, | +| | request is sent and | p90, p75 | +| | when its first | | +| | response is received, | | +| | one value per request | | +| | in benchmark | | ++-----------------------+-----------------------+-----------------------+ +| Inter Token Latency | Time between | Avg, min, max, p99, | +| | intermediate | p90, p75 | +| | responses for a | | +| | single request | | +| | divided by the number | | +| | of generated tokens | | +| | of the latter | | +| | response, one value | | +| | per response per | | +| | request in benchmark | | ++-----------------------+-----------------------+-----------------------+ +| Request Latency | Time between when a | Avg, min, max, p99, | +| | request is sent and | p90, p75 | +| | when its final | | +| | response is received, | | +| | one value per request | | +| | in benchmark | | ++-----------------------+-----------------------+-----------------------+ +| Output Sequence | Total number of | Avg, min, max, p99, | +| Length | output tokens of a | p90, p75 | +| | request, one value | | +| | per request in | | +| | benchmark | | ++-----------------------+-----------------------+-----------------------+ +| Input Sequence Length | Total number of input | Avg, min, max, p99, | +| | tokens of a request, | p90, p75 | +| | one value per request | | +| | in benchmark | | ++-----------------------+-----------------------+-----------------------+ +| Output Token | Total number of | None–one value per | +| Throughput | output tokens from | benchmark | +| | benchmark divided by | | +| | benchmark duration | | ++-----------------------+-----------------------+-----------------------+ +| Request Throughput | Number of final | None–one value per | +| | responses from | benchmark | +| | benchmark divided by | | +| | benchmark duration | | ++-----------------------+-----------------------+-----------------------+ + +.. raw:: html + + + +Command Line Options +-------------------- + +``-h`` +'''''' + +``--help`` +'''''''''' + +Show the help message and exit. + +Endpoint Options: +~~~~~~~~~~~~~~~~~ + +``-m `` +''''''''''''' + +``--model `` +'''''''''''''''''' + +The names of the models to benchmark. A single model is recommended, +unless you are `profiling multiple LoRA adapters `__. +(default: ``None``) + +``--model-selection-strategy {round_robin, random}`` +'''''''''''''''''''''''''''''''''''''''''''''''''''' + +When multiple models are specified, this is how a specific model is +assigned to a prompt. Round robin means that each model receives a +request in order. Random means that assignment is uniformly random +(default: ``round_robin``) + +``--backend {tensorrtllm,vllm}`` +'''''''''''''''''''''''''''''''' + +When using the “triton” service-kind, this is the backend of the model. +For the TRT-LLM backend, you currently must set +``exclude_input_in_output`` to true in the model config to not echo the +input tokens in the output. (default: tensorrtllm) + +``--endpoint `` +'''''''''''''''''''' + +Set a custom endpoint that differs from the OpenAI defaults. (default: +``None``) + +``--endpoint-type {chat,completions,embeddings,rankings}`` +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The endpoint-type to send requests to on the server. This is only used +with the ``openai`` service-kind. (default: ``None``) + +``--service-kind {triton,openai}`` +'''''''''''''''''''''''''''''''''' + +The kind of service perf_analyzer will generate load for. In order to +use ``openai``, you must specify an api via ``--endpoint-type``. +(default: ``triton``) + +``--streaming`` +''''''''''''''' + +An option to enable the use of the streaming API. (default: ``False``) + +``-u `` +'''''''''''' + +``--url `` +''''''''''''''' + +URL of the endpoint to target for benchmarking. (default: ``None``) + +Input Options +~~~~~~~~~~~~~ + +``-b `` +'''''''''''' + +``--batch-size `` +'''''''''''''''''''''' + +The batch size of the requests GenAI-Perf should send. This is currently +only supported with the `embeddings `__, +image_retrieval, and `rankings `__ endpoint types. +(default: ``1``) + +``--extra-inputs `` +'''''''''''''''''''''''' + +Provide additional inputs to include with every request. You can repeat +this flag for multiple inputs. Inputs should be in an input_name:value +format. Alternatively, a string representing a json formatted dict can +be provided. (default: ``None``) + +``--input-dataset {openorca,cnn_dailymail}`` +'''''''''''''''''''''''''''''''''''''''''''' + +The HuggingFace dataset to use for prompts. (default: ``openorca``) + +``--input-file `` +''''''''''''''''''''''' + +The input file containing the prompts to use for profiling. Each line +should be a JSON object with a ‘text_input’ field in JSONL format. +Example: {"text_input": "Your prompt here"}" + +``--num-prompts `` +''''''''''''''''''''''' + +The number of unique prompts to generate as stimulus. (default: ``100``) + +``--output-tokens-mean `` +'''''''''''''''''''''''''''''' + +The mean number of tokens in each output. Ensure the ``--tokenizer`` +value is set correctly. (default: ``-1``) + +``--output-tokens-mean-deterministic`` +'''''''''''''''''''''''''''''''''''''' + +When using ``--output-tokens-mean``, this flag can be set to improve +precision by setting the minimum number of tokens equal to the requested +number of tokens. This is currently supported with the Triton +service-kind. Note that there is still some variability in the requested +number of output tokens, but GenAi-Perf attempts its best effort with +your model to get the right number of output tokens. (default: +``False``) + +``--output-tokens-stddev `` +'''''''''''''''''''''''''''''''' + +The standard deviation of the number of tokens in each output. This is +only used when ``--output-tokens-mean`` is provided. (default: ``0``) + +``--random-seed `` +''''''''''''''''''''''' + +The seed used to generate random values. (default: ``0``) + +``--synthetic-input-tokens-mean `` +''''''''''''''''''''''''''''''''''''''' + +The mean of number of tokens in the generated prompts when using +synthetic data. (default: ``550``) + +``--synthetic-input-tokens-stddev `` +''''''''''''''''''''''''''''''''''''''''' + +The standard deviation of number of tokens in the generated prompts when +using synthetic data. (default: ``0``) + +Profiling Options +~~~~~~~~~~~~~~~~~ + +``--concurrency `` +''''''''''''''''''''''' + +The concurrency value to benchmark. (default: ``None``) + +``--measurement-interval `` +'''''''''''''''''''''''''''''''' + +``-p `` +'''''''''''' + +The time interval used for each measurement in milliseconds. Perf +Analyzer will sample a time interval specified and take measurement over +the requests completed within that time interval. (default: ``10000``) + +``--request-rate `` +'''''''''''''''''''''''''' + +Sets the request rate for the load generated by PA. (default: ``None``) + +``-s `` +'''''''''''''' + +``--stability-percentage `` +'''''''''''''''''''''''''''''''''' + +The allowed variation in latency measurements when determining if a +result is stable. The measurement is considered as stable if the ratio +of max / min from the recent 3 measurements is within (stability +percentage) in terms of both infer per second and latency. (default: +``999``) + +Output Options +~~~~~~~~~~~~~~ + +``--artifact-dir`` +'''''''''''''''''' + +The directory to store all the (output) artifacts generated by +GenAI-Perf and Perf Analyzer. (default: ``artifacts``) + +``--generate-plots`` +'''''''''''''''''''' + +An option to enable the generation of plots. (default: False) + +``--profile-export-file `` +'''''''''''''''''''''''''''''''' + +The path where the perf_analyzer profile export will be generated. By +default, the profile export will be to ``profile_export.json``. The +genai-perf files will be exported to +``_genai_perf.json`` and +``_genai_perf.csv``. For example, if the profile +export file is ``profile_export.json``, the genai-perf file will be +exported to ``profile_export_genai_perf.csv``. (default: +``profile_export.json``) + +Other Options +~~~~~~~~~~~~~ + +``--tokenizer `` +''''''''''''''''''''' + +The HuggingFace tokenizer to use to interpret token metrics from prompts +and responses. (default: ``hf-internal-testing/llama-tokenizer``) + +``-v`` +'''''' + +``--verbose`` +''''''''''''' + +An option to enable verbose mode. (default: ``False``) + +``--version`` +''''''''''''' + +An option to print the version and exit. + +.. raw:: html + + + +Known Issues +------------ + +- GenAI-Perf can be slow to finish if a high request-rate is provided +- Token counts may not be exact diff --git a/docs/perf_benchmark/genai_perf.rst b/docs/perf_benchmark/genai_perf.rst new file mode 100644 index 0000000000..d621431061 --- /dev/null +++ b/docs/perf_benchmark/genai_perf.rst @@ -0,0 +1,15 @@ +#### +GenAI Performance Analyzer +#### +.. include:: genai-perf-README.rst + + +.. toctree:: + :maxdepth: 1 + :hidden: + + Large language models <../perf_analyzer/genai-perf/docs/tutorial.md> + Visual language models <../perf_analyzer/genai-perf/docs/multi_modal.md> + Embedding models <../perf_analyzer/genai-perf/docs/embeddings.md> + Ranking models <../perf_analyzer/genai-perf/docs/rankings.md> + Multiple LoRA adapters <../perf_analyzer/genai-perf/docs/lora.md> \ No newline at end of file diff --git a/docs/perf_benchmark/model-analyzer-README.rst b/docs/perf_benchmark/model-analyzer-README.rst new file mode 100644 index 0000000000..1c31a578ff --- /dev/null +++ b/docs/perf_benchmark/model-analyzer-README.rst @@ -0,0 +1,191 @@ +.. raw:: html + + + +|License| + +Triton Model Analyzer +===================== + + [!Warning] + + .. rubric:: LATEST RELEASE + :name: latest-release + + You are currently on the ``main`` branch which tracks + under-development progress towards the next release. The latest + release of the Triton Model Analyzer is 1.42.0 and is available on + branch + `r24.07 `__. + +Triton Model Analyzer is a CLI tool which can help you find a more +optimal configuration, on a given piece of hardware, for single, +multiple, ensemble, or BLS models running on a `Triton Inference +Server `__. Model +Analyzer will also generate reports to help you better understand the +trade-offs of the different configurations along with their compute and +memory requirements. + +Features +======== + +Search Modes +~~~~~~~~~~~~ + +- `Optuna Search `__ **-ALPHA + RELEASE-** allows you to search for every parameter that can be + specified in the model configuration, using a hyperparameter + optimization framework. Please see the + `Optuna `__ website if you are interested in + specific details on how the algorithm functions. + +- `Quick Search `__ will + **sparsely** search the `Max Batch + Size `__, + `Dynamic + Batching `__, + and `Instance + Group `__ + spaces by utilizing a heuristic hill-climbing algorithm to help you + quickly find a more optimal configuration + +- `Automatic Brute + Search `__ will + **exhaustively** search the `Max Batch + Size `__, + `Dynamic + Batching `__, + and `Instance + Group `__ + parameters of your model configuration + +- `Manual Brute Search `__ + allows you to create manual sweeps for every parameter that can be + specified in the model configuration + +Model Types +~~~~~~~~~~~ + +- `Ensemble `__: Model Analyzer can help + you find the optimal settings when profiling an ensemble model + +- `BLS `__: Model Analyzer can help you find + the optimal settings when profiling a BLS model + +- `Multi-Model `__: Model Analyzer can + help you find the optimal settings when profiling multiple concurrent + models + +- `LLM `__: Model Analyzer can help you find + the optimal settings when profiling Large Language Models + +Other Features +~~~~~~~~~~~~~~ + +- `Detailed and summary reports `__: Model Analyzer is + able to generate summarized and detailed reports that can help you + better understand the trade-offs between different model + configurations that can be used for your model. + +- `QoS Constraints `__: Constraints can help + you filter out the Model Analyzer results based on your QoS + requirements. For example, you can specify a latency budget to filter + out model configurations that do not satisfy the specified latency + threshold. + +Examples and Tutorials +====================== + +**Single Model** +~~~~~~~~~~~~~~~~ + +See the `Single Model Quick Start `__ for a guide +on how to use Model Analyzer to profile, analyze and report on a simple +PyTorch model. + +**Multi Model** +~~~~~~~~~~~~~~~ + +See the `Multi-model Quick Start `__ for a guide +on how to use Model Analyzer to profile, analyze and report on two +models running concurrently on the same GPU. + +**Ensemble Model** +~~~~~~~~~~~~~~~~~~ + +See the `Ensemble Model Quick Start `__ +for a guide on how to use Model Analyzer to profile, analyze and report +on a simple Ensemble model. + +**BLS Model** +~~~~~~~~~~~~~ + +See the `BLS Model Quick Start `__ for a guide +on how to use Model Analyzer to profile, analyze and report on a simple +BLS model. + +Documentation +============= + +- `Installation `__ +- `Model Analyzer CLI `__ +- `Launch Modes `__ +- `Configuring Model Analyzer `__ +- `Model Analyzer Metrics `__ +- `Model Config Search `__ +- `Model Types `__ +- `Checkpointing `__ +- `Model Analyzer Reports `__ +- `Deployment with Kubernetes `__ + +Terminology +=========== + +Below are definitions of some commonly used terms in Model Analyzer: + +- **Model Type** - Category of model being profiled. Examples of this + include single, multi, ensemble, BLS, etc.. +- **Search Mode** - How Model Analyzer explores the possible + configuration space when profiling. This is either exhaustive (brute) + or heuristic (quick/optuna). +- **Model Config Search** - The cross product of model type and search + mode. +- **Launch Mode** - How the Triton Server is deployed and used by Model + Analyzer. + +Reporting problems, asking questions +==================================== + +We appreciate any feedback, questions or bug reporting regarding this +project. When help with code is needed, follow the process outlined in +the Stack Overflow (https://stackoverflow.com/help/mcve) document. +Ensure posted examples are: + +- minimal – use as little code as possible that still produces the same + problem + +- complete – provide all parts needed to reproduce the problem. Check + if you can strip external dependency and still show the problem. The + less time we spend on reproducing problems the more time we have to + fix it + +- verifiable – test the code you’re about to provide to make sure it + reproduces the problem. Remove all other problems that are not + related to your request/question. + +.. |License| image:: https://img.shields.io/badge/License-Apache_2.0-lightgrey.svg + :target: https://opensource.org/licenses/Apache-2.0 diff --git a/docs/perf_benchmark/model_analyzer.rst b/docs/perf_benchmark/model_analyzer.rst new file mode 100644 index 0000000000..d66005c336 --- /dev/null +++ b/docs/perf_benchmark/model_analyzer.rst @@ -0,0 +1,18 @@ +#### +Model Analyzer +#### + +.. include:: model-analyzer-README.rst + +.. toctree:: + :maxdepth: 1 + :hidden: + + ../model_analyzer/docs/cli.md + ../model_analyzer/docs/launch_modes.md + ../model_analyzer/docs/config.md + ../model_analyzer/docs/metrics.md + ../model_analyzer/docs/config_search.md + ../model_analyzer/docs/checkpoints.md + ../model_analyzer/docs/report.md + ../model_analyzer/docs/kubernetes_deploy.md \ No newline at end of file diff --git a/docs/perf_benchmark/perf-analyzer-README.rst b/docs/perf_benchmark/perf-analyzer-README.rst new file mode 100644 index 0000000000..f51d19deb9 --- /dev/null +++ b/docs/perf_benchmark/perf-analyzer-README.rst @@ -0,0 +1,180 @@ +.. raw:: html + + + +Triton Performance Analyzer +=========================== + +Triton Performance Analyzer is CLI tool which can help you optimize the +inference performance of models running on Triton Inference Server by +measuring changes in performance as you experiment with different +optimization strategies. + +Features +======== + +Inference Load Modes +~~~~~~~~~~~~~~~~~~~~ + +- `Concurrency Mode `__ + simlulates load by maintaining a specific concurrency of outgoing + requests to the server + +- `Request Rate + Mode `__ simulates + load by sending consecutive requests at a specific rate to the server + +- `Custom Interval + Mode `__ simulates + load by sending consecutive requests at specific intervals to the + server + +Performance Measurement Modes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- `Time Windows Mode `__ + measures model performance repeatedly over a specific time interval + until performance has stabilized + +- `Count Windows Mode `__ + measures model performance repeatedly over a specific number of + requests until performance has stabilized + +Other Features +~~~~~~~~~~~~~~ + +- `Sequence Models <../user_guide/architecture.md#stateful-models>`__, + `Ensemble Models <../user_guide/architecture.md#ensemble-models>`__, + and `Decoupled Models <../user_guide/decoupled_models.md>`__ can be + profiled in addition to standard/stateless/coupled models + +- `Input Data `__ to model inferences can be + auto-generated or specified as well as verifying output + +- `TensorFlow + Serving `__ and + `TorchServe `__ can be + used as the inference server in addition to the default Triton server + +Quick Start +=========== + +The steps below will guide you on how to start using Perf Analyzer. + +Step 1: Start Triton Container +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: bash + + export RELEASE= # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02` + + docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3 + + docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3 + +Step 2: Download ``simple`` Model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: bash + + # inside triton container + git clone --depth 1 https://github.com/triton-inference-server/server + + mkdir model_repository ; cp -r server/docs/examples/model_repository/simple model_repository + +Step 3: Start Triton Server +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: bash + + # inside triton container + tritonserver --model-repository $(pwd)/model_repository &> server.log & + + # confirm server is ready, look for 'HTTP/1.1 200 OK' + curl -v localhost:8000/v2/health/ready + + # detach (CTRL-p CTRL-q) + +Step 4: Start Triton SDK Container +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: bash + + docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + + docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + +Step 5: Run Perf Analyzer +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: bash + + # inside sdk container + perf_analyzer -m simple + +See the full `quick start guide `__ for additional +tips on how to analyze output. + +Documentation +============= + +- `Installation `__ +- `Perf Analyzer CLI `__ +- `Inference Load Modes `__ +- `Input Data `__ +- `Measurements & Metrics `__ +- `Benchmarking `__ + +Contributing +============ + +Contributions to Triton Perf Analyzer are more than welcome. To +contribute please review the `contribution +guidelines `__, +then fork and create a pull request. + +Reporting problems, asking questions +==================================== + +We appreciate any feedback, questions or bug reporting regarding this +project. When help with code is needed, follow the process outlined in +the Stack Overflow (https://stackoverflow.com/help/mcve) document. +Ensure posted examples are: + +- minimal - use as little code as possible that still produces the same + problem + +- complete - provide all parts needed to reproduce the problem. Check + if you can strip external dependency and still show the problem. The + less time we spend on reproducing problems the more time we have to + fix it + +- verifiable - test the code you’re about to provide to make sure it + reproduces the problem. Remove all other problems that are not + related to your request/question. diff --git a/docs/perf_benchmark/perf_analyzer.rst b/docs/perf_benchmark/perf_analyzer.rst new file mode 100644 index 0000000000..0aa5172c88 --- /dev/null +++ b/docs/perf_benchmark/perf_analyzer.rst @@ -0,0 +1,15 @@ +#### +Performance Analyzer +#### + +.. include:: perf-analyzer-README.rst + +.. toctree:: + :maxdepth: 1 + :hidden: + + ../perf_analyzer/docs/install.md + ../perf_analyzer/docs/CLI.md + ../perf_analyzer/docs/inference_load_modes.md + ../perf_analyzer/docs/input_data.md + ../perf_analyzer/docs/measurements_metrics.md \ No newline at end of file diff --git a/docs/repositories.txt b/docs/repositories.txt new file mode 100644 index 0000000000..62ecc91db5 --- /dev/null +++ b/docs/repositories.txt @@ -0,0 +1,15 @@ +backend +client +dali_backend +fil_backend +model_analyzer +model_navigator +onnxruntime_backend +perf_analyzer +python_backend +pytorch_backend +tensorflow_backend +tensorrt_backend +tensorrtllm_backend +tutorials +vllm_backend diff --git a/docs/scaling_guide/scaling_guide.rst b/docs/scaling_guide/scaling_guide.rst new file mode 100644 index 0000000000..f4d252f77e --- /dev/null +++ b/docs/scaling_guide/scaling_guide.rst @@ -0,0 +1,11 @@ +######## +Scaling guide +######## + +.. toctree:: + :hidden: + :caption: Scaling guide + :maxdepth: 2 + + Multi-Node (AWS) <../tutorials/Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md> + Multi-Instance <../tutorials/Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md> diff --git a/docs/server_guide/features.rst b/docs/server_guide/features.rst new file mode 100644 index 0000000000..a14fa711c2 --- /dev/null +++ b/docs/server_guide/features.rst @@ -0,0 +1,19 @@ +######## +Features +######## + +.. toctree:: + :hidden: + :caption: Features + :maxdepth: 2 + + Model_execution <../user_guide/model_execution.md> + Scheduler <../user_guide/scheduler.md> + Batcher <../user_guide/batcher.md> + model_pipelines + state_management + Request Cancellation <../user_guide/request_cancellation.md> + Rate Limiter <../user_guide/rate_limiter.md> + Caching <../user_guide/response_cache.md> + Metrics <../user_guide/metrics.md> + Tracing <../user_guide/trace.md> \ No newline at end of file diff --git a/docs/server_guide/model_pipelines.rst b/docs/server_guide/model_pipelines.rst new file mode 100644 index 0000000000..5f4dcffaaa --- /dev/null +++ b/docs/server_guide/model_pipelines.rst @@ -0,0 +1,11 @@ +######## +Model Pipelines +######## + +.. toctree:: + :hidden: + :caption: Model Pipelines + :maxdepth: 2 + + Ensemble <../user_guide/ensemble_models> + Business Logic Scripting <../user_guide/bls> \ No newline at end of file diff --git a/docs/server_guide/state_management.rst b/docs/server_guide/state_management.rst new file mode 100644 index 0000000000..75f6b44b23 --- /dev/null +++ b/docs/server_guide/state_management.rst @@ -0,0 +1,10 @@ +######## +State Management +######## + +.. toctree:: + :hidden: + :caption: State Management + :maxdepth: 2 + + Implicit State Management <../user_guide/implicit_state_management.md> \ No newline at end of file diff --git a/docs/user_guide/batcher.md b/docs/user_guide/batcher.md new file mode 100644 index 0000000000..556412e455 --- /dev/null +++ b/docs/user_guide/batcher.md @@ -0,0 +1,295 @@ + + + +# Batchers + +## Dynamic Batcher + +Dynamic batching is a feature of Triton that allows inference requests +to be combined by the server, so that a batch is created +dynamically. Creating a batch of requests typically results in +increased throughput. The dynamic batcher should be used for +[stateless models](architecture.md#stateless-models). The dynamically created +batches are distributed to all [model instances](model_configuration.md#instance-groups) +configured for the model. + +Dynamic batching is enabled and configured independently for each +model using the *ModelDynamicBatching* property in the model +configuration. These settings control the preferred size(s) of the +dynamically created batches, the maximum time that requests can be +delayed in the scheduler to allow other requests to join the dynamic +batch, and queue properties such a queue size, priorities, and +time-outs. Refer to +[this guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization#what-is-dynamic-batching) +for a more detailed example of dynamic batching. + +### Recommended Configuration Process + +The individual settings are described in detail below. The following +steps are the recommended process for tuning the dynamic batcher for +each model. It is also possible to use the [Model +Analyzer](model_analyzer.md) to automatically search across different +dynamic batcher configurations. + +* Decide on a [maximum batch size](#maximum-batch-size) for the model. + +* Add the following to the model configuration to enable the dynamic + batcher with all default settings. By default the dynamic batcher + will create batches as large as possible up to the maximum batch + size and will not [delay](#delayed-batching) when forming batches. + +``` + dynamic_batching { } +``` + +* Use the + [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) + to determine the latency and throughput provided by the default dynamic + batcher configuration. + +* If the default configuration results in latency values that are + within your latency budget, try one or both of the following to + trade off increased latency for increased throughput: + + * Increase maximum batch size. + + * Set [batch delay](#delayed-batching) to a non-zero value. Try + increasing delay values until the latency budget is exceeded to + see the impact on throughput. + +* [Preferred batch sizes](#preferred-batch-sizes) should not be used + for most models. A preferred batch size(s) should only be configured + if that batch size results in significantly higher performance than + other batch sizes. + +### Preferred Batch Sizes + +The *preferred_batch_size* property indicates the batch sizes that the +dynamic batcher should attempt to create. For most models, +*preferred_batch_size* should not be specified, as described in +[Recommended Configuration +Process](#recommended-configuration-process). An exception is TensorRT +models that specify multiple optimization profiles for different batch +sizes. In this case, because some optimization profiles may give +significant performance improvement compared to others, it may make +sense to use *preferred_batch_size* for the batch sizes supported by +those higher-performance optimization profiles. + +The following example shows the configuration that enables dynamic +batching with preferred batch sizes of 4 and 8. + +``` + dynamic_batching { + preferred_batch_size: [ 4, 8 ] + } +``` + +When a model instance becomes available for inferencing, the dynamic +batcher will attempt to create batches from the requests that are +available in the scheduler. Requests are added to the batch in the +order the requests were received. If the dynamic batcher can form a +batch of a preferred size(s) it will create a batch of the largest +possible preferred size and send it for inferencing. If the dynamic +batcher cannot form a batch of a preferred size (or if the dynamic +batcher is not configured with any preferred batch sizes), it will +send a batch of the largest size possible that is less than the +maximum batch size allowed by the model (but see the following section +for the delay option that changes this behavior). + +The size of generated batches can be examined in aggregate using +[count metrics](metrics.md#inference-request-metrics). + +### Delayed Batching + +The dynamic batcher can be configured to allow requests to be delayed +for a limited time in the scheduler to allow other requests to join +the dynamic batch. For example, the following configuration sets the +maximum delay time of 100 microseconds for a request. + +``` + dynamic_batching { + max_queue_delay_microseconds: 100 + } +``` + +The *max_queue_delay_microseconds* property setting changes the +dynamic batcher behavior when a maximum size (or preferred size) batch +cannot be created. When a batch of a maximum or preferred size cannot +be created from the available requests, the dynamic batcher will delay +sending the batch as long as no request is delayed longer than the +configured *max_queue_delay_microseconds* value. If a new request +arrives during this delay and allows the dynamic batcher to form a +batch of a maximum or preferred batch size, then that batch is sent +immediately for inferencing. If the delay expires the dynamic batcher +sends the batch as is, even though it is not a maximum or preferred +size. + +### Preserve Ordering + +The *preserve_ordering* property is used to force all responses to be +returned in the same order as requests were received. See the +[protobuf +documentation](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto) +for details. + +### Priority Levels + +By default the dynamic batcher maintains a single queue that holds all +inference requests for a model. The requests are processed and batched +in order. The *priority_levels* property can be used to create +multiple priority levels within the dynamic batcher so that requests +with higher priority are allowed to bypass requests with lower +priority. Requests at the same priority level are processed in +order. Inference requests that do not set a priority are scheduled +using the *default_priority_level* property. + +### Queue Policy + +The dynamic batcher provides several settings that control how +requests are queued for batching. + +When *priority_levels* is not defined, the *ModelQueuePolicy* for the +single queue can be set with *default_queue_policy*. When +*priority_levels* is defined, each priority level can have a different +*ModelQueuePolicy* as specified by *default_queue_policy* and *priority_queue_policy*. + +The *ModelQueuePolicy* property allows a maximum queue size to be set +using the *max_queue_size*. The *timeout_action*, +*default_timeout_microseconds* and *allow_timeout_override* settings +allow the queue to be configured so that individual requests are +rejected or deferred if their time in the queue exceeds a specified +timeout. + +## Custom Batching + +You can set custom batching rules that work _in addition to_ the specified behavior of the dynamic batcher. +To do so, you would implement five functions in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h) +and create a shared library. These functions are described below. + +| Function | Description| +| :-- | :-- | +| TRITONBACKEND_ModelBatchIncludeRequest | Determines whether a request should be included in the current batch | +| TRITONBACKEND_ModelBatchInitialize | Initializes a record-keeping data structure for a new batch | +| TRITONBACKEND_ModelBatchFinalize | Deallocates the record-keeping data structure after a batch is formed | +| TRITONBACKEND_ModelBatcherInitialize | Initializes a read-only data structure for use with all batches | +| TRITONBACKEND_ModelBatcherFinalize | Deallocates the read-only data structure after the model is unloaded | + +The path to the shared library can be passed into the model configuration via the parameter +`TRITON_BATCH_STRATEGY_PATH`. If not provided, the dynamic batcher will look for a custom +batching strategy named batchstrategy.so in the model version, model, and backend directories, +in that order. If found, it will load it. This lets you easily share a custom batching strategy +among all models using the same backend. + +For a tutorial of how to create and use a custom batching library, please see the +[backend examples directory](https://github.com/triton-inference-server/backend/tree/main/examples#volume-batching). + +## Sequence Batcher + +Like the dynamic batcher, the sequence batcher combines non-batched +inference requests, so that a batch is created dynamically. Unlike the +dynamic batcher, the sequence batcher should be used for +[stateful models](architecture.md#stateful-models) where a sequence of +inference requests must be routed to the same model instance. The +dynamically created batches are distributed to all [model +instances](#instance-groups) configured for the model. + +Sequence batching is enabled and configured independently for each +model using the *ModelSequenceBatching* property in the model +configuration. These settings control the sequence timeout as well as +configuring how Triton will send control signals to the model +indicating sequence start, end, ready and correlation ID. See +[Stateful Models](architecture.md#stateful-models) for more +information and examples. + +## Iterative Sequences + +> [!NOTE] +> Iterative sequences are *provisional* and likely to change in future versions. +The sequence batcher supports stateful execution of "iterative +sequences" where a single request is processed over a number of +scheduling iterations. "Iterative sequences" enable the scheduler to +batch multiple inflight requests at each step and allow the model or +backend to complete a request at any iteration. + +For models and backends that support "iterative sequences", users can +enable support in the sequence batcher by specifying: + +``` + sequence_batching { + iterative_sequence: true + } +``` + +An "iterative sequence" refers to stateful models that iteratively +process a single request until a complete response is generated. When +iterative sequence is enabled, the sequence scheduler will expect a +single incoming request to initiate the sequence. Backends that +support iterative sequences can then yield back to the sequence +batcher to reschedule the request for further execution in a future +batch. + +Because only one request is used to represent the "iterative +sequence", the user doesn't need to set [control +inputs](architecture.md#control-inputs) mentioned in the previous +section. They will be filled internally by the scheduler. + +"Iterative sequences" can be [decoupled](architecture.md#decoupled) where more than +one response can be generated during execution or non-decoupled where +a single response is generated when the full response is complete. + +The main advantage of "iterative sequences" is the ability to use +Triton's native batching capabilities to form batches of requests at +different iteration stages without having to maintain additional state +in the backend. Typically batches executed by backends are completed +in the same execution which can waste resources if the execution of +one of the requests in the batch takes much longer than the rest. With +"iterative sequences", processing for each request in a batch can be +broken down into multiple iterations and a backend can start +processing new requests as soon as any request is complete. + +### Continuous/Inflight Batching with Iterative Sequences + +Continuous batching, iteration level batching, and inflight batching +are terms used in large language model (LLM) inferencing to describe +batching strategies that form batches of requests at each iteration +step. By forming batches "continuously" inference servers can increase +throughput by reusing batch slots as soon as they are free without +waiting for all requests in a batch to complete. + +As the number of steps required to process a request can vary +significantly, batching existing requests and new requests continuously +can have a significant improvement on throughput and latency. + +To achieve inflight batching with iterative sequences, the backend +should break request processing into a number of steps, where each +step corresponds to one Triton model instance execution. At the end of +each step, the model instance will release requests that have been +completed and reschedule requests that are still inflight. Triton will +then form and schedule the next batch of requests that mixes new and +rescheduled requests. \ No newline at end of file diff --git a/docs/user_guide/bls.md b/docs/user_guide/bls.md new file mode 100644 index 0000000000..a0c0eee87f --- /dev/null +++ b/docs/user_guide/bls.md @@ -0,0 +1,415 @@ + + +# Business Logic Scripting + +Triton's +[ensemble](ensemble_models.md#ensemble-models) +feature supports many use cases where multiple models are composed into a +pipeline (or more generally a DAG, directed acyclic graph). However, there are +many other use cases that are not supported because as part of the model +pipeline they require loops, conditionals (if-then-else), data-dependent +control-flow and other custom logic to be intermixed with model execution. We +call this combination of custom logic and model executions *Business Logic +Scripting (BLS)*. + +Starting from 21.08, you can implement BLS in your Python model. A new set of +utility functions allows you to execute inference requests on other models +being served by Triton as a part of executing your Python model. Note that BLS +should only be used inside the `execute` function and is not supported +in the `initialize` or `finalize` methods. Example below shows how to use this +feature: + +```python +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + ... + def execute(self, requests): + ... + # Create an InferenceRequest object. `model_name`, + # `requested_output_names`, and `inputs` are the required arguments and + # must be provided when constructing an InferenceRequest object. Make + # sure to replace `inputs` argument with a list of `pb_utils.Tensor` + # objects. + inference_request = pb_utils.InferenceRequest( + model_name='model_name', + requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + inputs=[]) + + # `pb_utils.InferenceRequest` supports request_id, correlation_id, + # model version, timeout and preferred_memory in addition to the + # arguments described above. + # Note: Starting from the 24.03 release, the `correlation_id` parameter + # supports both string and unsigned integer values. + # These arguments are optional. An example containing all the arguments: + # inference_request = pb_utils.InferenceRequest(model_name='model_name', + # requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + # inputs=[], + # request_id="1", correlation_id=4, model_version=1, flags=0, timeout=5, + # preferred_memory=pb_utils.PreferredMemory( + # pb_utils.TRITONSERVER_MEMORY_GPU, # or pb_utils.TRITONSERVER_MEMORY_CPU + # 0)) + + # Execute the inference_request and wait for the response + inference_response = inference_request.exec() + + # Check if the inference response has an error + if inference_response.has_error(): + raise pb_utils.TritonModelException( + inference_response.error().message()) + else: + # Extract the output tensors from the inference response. + output1 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_1') + output2 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_2') + + # Decide the next steps for model execution based on the received + # output tensors. It is possible to use the same output tensors + # to for the final inference response too. +``` + + +In addition to the `inference_request.exec` function that allows you to +execute blocking inference requests, `inference_request.async_exec` allows +you to perform async inference requests. This can be useful when you do not +need the result of the inference immediately. Using `async_exec` function, it +is possible to have multiple inflight inference requests and wait for the +responses only when needed. Example below shows how to use `async_exec`: + +```python +import triton_python_backend_utils as pb_utils +import asyncio + + +class TritonPythonModel: + ... + + # You must add the Python 'async' keyword to the beginning of `execute` + # function if you want to use `async_exec` function. + async def execute(self, requests): + ... + # Create an InferenceRequest object. `model_name`, + # `requested_output_names`, and `inputs` are the required arguments and + # must be provided when constructing an InferenceRequest object. Make + # sure to replace `inputs` argument with a list of `pb_utils.Tensor` + # objects. + inference_request = pb_utils.InferenceRequest( + model_name='model_name', + requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + inputs=[]) + + infer_response_awaits = [] + for i in range(4): + # async_exec function returns an + # [Awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) + # object. + infer_response_awaits.append(inference_request.async_exec()) + + # Wait for all of the inference requests to complete. + infer_responses = await asyncio.gather(*infer_response_awaits) + + for infer_response in infer_responses: + # Check if the inference response has an error + if inference_response.has_error(): + raise pb_utils.TritonModelException( + inference_response.error().message()) + else: + # Extract the output tensors from the inference response. + output1 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_1') + output2 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_2') + + # Decide the next steps for model execution based on the received + # output tensors. +``` + +A complete example for sync and async BLS in Python backend is included in the +[Examples](../python_backend/README.md#examples) section. + +## Using BLS with Decoupled Models + +Starting from 23.03 release, you can execute inference requests on decoupled +models in both [default mode](../python_backend/README.md#default-mode) and +[decoupled mode](../python_backend/README.md#decoupled-mode). By setting the `decoupled` parameter to +`True`, the `exec` and `async_exec` function will return an +[iterator](https://docs.python.org/3/glossary.html#term-iterator) of +inference responses returned by a decoupled model. If the `decoupled` parameter +is set to `False`, the `exec` and `async_exec` function will return a single +response as shown in the example above. Besides, you can set the timeout via +the parameter 'timeout' in microseconds within the constructor of +`InferenceRequest`. If the request times out, the request will respond with an +error. The default of 'timeout' is 0 which indicates that the request has no +timeout. + +Additionally, starting from the 23.04 release, you have the flexibility to +select a specific device to receive output tensors from BLS calls. This +can be achieved by setting the optional `preferred_memory` parameter within the +`InferenceRequest` constructor. To do this, you can create a `PreferredMemory` +object and specify the `preferred_memory_type` as either +`TRITONSERVER_MEMORY_GPU` or `TRITONSERVER_MEMORY_CPU`, as well as the +`preferred_device_id` as an integer to indicate the memory type and device ID +on which you wish to receive output tensors. If you do not specify the +`preferred_memory` parameter, the output tensors will be allocated on the +same device where the output tensors were received from the model to which the +BLS call is made. + +Example below shows how to use this feature: + +```python +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + ... + def execute(self, requests): + ... + # Create an InferenceRequest object. `model_name`, + # `requested_output_names`, and `inputs` are the required arguments and + # must be provided when constructing an InferenceRequest object. Make + # sure to replace `inputs` argument with a list of `pb_utils.Tensor` + # objects. + inference_request = pb_utils.InferenceRequest( + model_name='model_name', + requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + inputs=[]) + + # `pb_utils.InferenceRequest` supports request_id, correlation_id, + # model version, timeout and preferred_memory in addition to the + # arguments described above. + # Note: Starting from the 24.03 release, the `correlation_id` parameter + # supports both string and unsigned integer values. + # These arguments are optional. An example containing all the arguments: + # inference_request = pb_utils.InferenceRequest(model_name='model_name', + # requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + # inputs=[], + # request_id="1", correlation_id="ex-4", model_version=1, flags=0, timeout=5, + # preferred_memory=pb_utils.PreferredMemory( + # pb_utils.TRITONSERVER_MEMORY_GPU, # or pb_utils.TRITONSERVER_MEMORY_CPU + # 0)) + + # Execute the inference_request and wait for the response. Here we are + # running a BLS request on a decoupled model, hence setting the parameter + # 'decoupled' to 'True'. + inference_responses = inference_request.exec(decoupled=True) + + for inference_response in inference_responses: + # Check if the inference response has an error + if inference_response.has_error(): + raise pb_utils.TritonModelException( + inference_response.error().message()) + + # For some models, it is possible that the last response is empty + if len(infer_response.output_tensors()) > 0: + # Extract the output tensors from the inference response. + output1 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_1') + output2 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_2') + + # Decide the next steps for model execution based on the received + # output tensors. It is possible to use the same output tensors to + # for the final inference response too. +``` + + +In addition to the `inference_request.exec(decoupled=True)` function that +allows you to execute blocking inference requests on decoupled models, +`inference_request.async_exec(decoupled=True)` allows you to perform async +inference requests. This can be useful when you do not need the result of the +inference immediately. Using `async_exec` function, it is possible to have +multiple inflight inference requests and wait for the responses only when +needed. Example below shows how to use `async_exec`: + +```python +import triton_python_backend_utils as pb_utils +import asyncio + + +class TritonPythonModel: + ... + + # You must add the Python 'async' keyword to the beginning of `execute` + # function if you want to use `async_exec` function. + async def execute(self, requests): + ... + # Create an InferenceRequest object. `model_name`, + # `requested_output_names`, and `inputs` are the required arguments and + # must be provided when constructing an InferenceRequest object. Make + # sure to replace `inputs` argument with a list of `pb_utils.Tensor` + # objects. + inference_request = pb_utils.InferenceRequest( + model_name='model_name', + requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + inputs=[]) + + infer_response_awaits = [] + for i in range(4): + # async_exec function returns an + # [Awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables) + # object. + infer_response_awaits.append( + inference_request.async_exec(decoupled=True)) + + # Wait for all of the inference requests to complete. + async_responses = await asyncio.gather(*infer_response_awaits) + + for infer_responses in async_responses: + for infer_response in infer_responses: + # Check if the inference response has an error + if inference_response.has_error(): + raise pb_utils.TritonModelException( + inference_response.error().message()) + + # For some models, it is possible that the last response is empty + if len(infer_response.output_tensors()) > 0: + # Extract the output tensors from the inference response. + output1 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_1') + output2 = pb_utils.get_output_tensor_by_name( + inference_response, 'REQUESTED_OUTPUT_2') + + # Decide the next steps for model execution based on the received + # output tensors. +``` + +A complete example for sync and async BLS for decoupled models is included in +the [Examples](../python_backend/README.md#examples) section. + +Starting from the 22.04 release, the lifetime of the BLS output tensors have +been improved such that if a tensor is no longer needed in your Python model it +will be automatically deallocated. This can increase the number of BLS requests +that you can execute in your model without running into the out of GPU or +shared memory error. + +Note: Async BLS is not supported on Python 3.6 or lower due to the `async` +keyword and `asyncio.run` being introduced in Python 3.7. + +## Model Loading API + +Starting from 23.07 release, you can use the model loading API to load models +required by your BLS model. The model loading API is equivalent to the Triton C +API for loading models which are documented in +[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). +Below is an example of how to use the model loading API: + +```python +import triton_python_backend_utils as pb_utils + +class TritonPythonModel: + def initialize(self, args): + self.model_name="onnx_model" + # Check if the model is ready, and load the model if it is not ready. + # You can specify the model version in string format. The version is + # optional, and if not provided, the server will choose a version based + # on the model and internal policy. + if not pb_utils.is_model_ready(model_name=self.model_name, + model_version="1"): + # Load the model from the model repository + pb_utils.load_model(model_name=self.model_name) + + # Load the model with an optional override model config in JSON + # representation. If provided, this config will be used for + # loading the model. + config = "{\"backend\":\"onnxruntime\", \"version_policy\":{\"specific\":{\"versions\":[1]}}}" + pb_utils.load_model(model_name=self.model_name, config=config) + + # Load the mode with optional override files. The override files are + # specified as a dictionary where the key is the file path (with + # "file:" prefix) and the value is the file content as bytes. The + # files will form the model directory that the model will be loaded + # from. If specified, 'config' must be provided to be the model + # configuration of the override model directory. + with open('models/onnx_int32_int32_int32/1/model.onnx', 'rb') as file: + data = file.read() + files = {"file:1/model.onnx": data} + pb_utils.load_model(model_name=self.model_name, + config=config, files=files) + + def execute(self, requests): + # Execute the model + ... + # If the model is no longer needed, you can unload it. You can also + # specify whether the dependents of the model should also be unloaded by + # setting the 'unload_dependents' parameter to True. The default value + # is False. Need to be careful when unloading the model as it can affect + # other model instances or other models that depend on it. + pb_utils.unload_model(model_name=self.model_name, + unload_dependents=True) + +``` + +Note that the model loading API is only supported if the server is running in +[explicit model control mode](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md#model-control-mode-explicit). +Additionally, the model loading API should only be used after the server has +been running, which means that the BLS model should not be loaded during server +startup. You can use different +[client endpoints](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_repository.md) +to load the model after the server has been started. The model loading API is +currently not supported during the `auto_complete_config` and `finalize` +functions. + +## Using BLS with Stateful Models + +[Stateful models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#stateful-models) +require setting additional flags in the inference request to indicate the +start and end of a sequence. The `flags` argument in the `pb_utils.InferenceRequest` +object can be used to indicate whether the request is the first or last request +in the sequence. An example indicating that the request is starting the +sequence: + +```python +inference_request = pb_utils.InferenceRequest(model_name='model_name', + requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + inputs=[], + request_id="1", correlation_id=4, + flags=pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START) +``` + +For indicating the ending of the sequence you can use the +`pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END` flag. If the request is both +starting and ending a sequence at the same time (i.e. the sequence has only a +single request), you can use the bitwise OR operator to enable both of the +flags: + +``` +flags = pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START | pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END +``` + +## Limitation + +- You need to make sure that the inference requests performed as a part of your +model do not create a circular dependency. For example, if model A performs an +inference request on itself and there are no more model instances ready to +execute the inference request, the model will block on the inference execution +forever. + +- Async BLS is not supported when running a Python model in decoupled mode. \ No newline at end of file diff --git a/docs/user_guide/ensemble_models.md b/docs/user_guide/ensemble_models.md new file mode 100644 index 0000000000..4012ec60c7 --- /dev/null +++ b/docs/user_guide/ensemble_models.md @@ -0,0 +1,196 @@ + + +# Ensemble Models + +An ensemble model represents a *pipeline* of one or more models and +the connection of input and output tensors between those +models. Ensemble models are intended to be used to encapsulate a +procedure that involves multiple models, such as "data preprocessing +-> inference -> data postprocessing". Using ensemble models for this +purpose can avoid the overhead of transferring intermediate tensors +and minimize the number of requests that must be sent to Triton. + +The ensemble scheduler must be used for ensemble models, regardless of +the scheduler used by the models within the ensemble. With respect to +the ensemble scheduler, an *ensemble* model is not an actual +model. Instead, it specifies the dataflow between models within the +ensemble as *ModelEnsembling::Step* entries in the model +configuration. The scheduler collects the output tensors in each step, +provides them as input tensors for other steps according to the +specification. In spite of that, the ensemble model is still viewed as +a single model from an external view. + +Note that the ensemble models will inherit the characteristics of the +models involved, so the meta-data in the request header must comply +with the models within the ensemble. For instance, if one of the +models is stateful model, then the inference request for the ensemble +model should contain the information mentioned in [Stateful +Models](architecture.md#stateful-models), which will be provided to the stateful +model by the scheduler. + +As an example consider an ensemble model for image classification and +segmentation that has the following model configuration: + +``` +name: "ensemble_model" +platform: "ensemble" +max_batch_size: 1 +input [ + { + name: "IMAGE" + data_type: TYPE_STRING + dims: [ 1 ] + } +] +output [ + { + name: "CLASSIFICATION" + data_type: TYPE_FP32 + dims: [ 1000 ] + }, + { + name: "SEGMENTATION" + data_type: TYPE_FP32 + dims: [ 3, 224, 224 ] + } +] +ensemble_scheduling { + step [ + { + model_name: "image_preprocess_model" + model_version: -1 + input_map { + key: "RAW_IMAGE" + value: "IMAGE" + } + output_map { + key: "PREPROCESSED_OUTPUT" + value: "preprocessed_image" + } + }, + { + model_name: "classification_model" + model_version: -1 + input_map { + key: "FORMATTED_IMAGE" + value: "preprocessed_image" + } + output_map { + key: "CLASSIFICATION_OUTPUT" + value: "CLASSIFICATION" + } + }, + { + model_name: "segmentation_model" + model_version: -1 + input_map { + key: "FORMATTED_IMAGE" + value: "preprocessed_image" + } + output_map { + key: "SEGMENTATION_OUTPUT" + value: "SEGMENTATION" + } + } + ] +} +``` + +The ensemble\_scheduling section indicates that the ensemble scheduler will be +used and that the ensemble model consists of three different models. Each +element in step section specifies the model to be used and how the inputs and +outputs of the model are mapped to tensor names recognized by the scheduler. For +example, the first element in step specifies that the latest version of +image\_preprocess\_model should be used, the content of its input "RAW\_IMAGE" +is provided by "IMAGE" tensor, and the content of its output +"PREPROCESSED\_OUTPUT" will be mapped to "preprocessed\_image" tensor for later +use. The tensor names recognized by the scheduler are the ensemble inputs, the +ensemble outputs and all values in the input\_map and the output\_map. + +The models composing the ensemble may also have dynamic batching +enabled. Since ensemble models are just routing the data between +composing models, Triton can take requests into an ensemble model +without modifying the ensemble's configuration to exploit the dynamic +batching of the composing models. + +Assuming that only the ensemble model, the preprocess model, the classification +model and the segmentation model are being served, the client applications will +see them as four different models which can process requests independently. +However, the ensemble scheduler will view the ensemble model as the following. + +![Ensemble Example](images/ensemble_example0.png) + +When an inference request for the ensemble model is received, the ensemble +scheduler will: + +1. Recognize that the "IMAGE" tensor in the request is mapped to input + "RAW\_IMAGE" in the preprocess model. + +2. Check models within the ensemble and send an internal request to the + preprocess model because all the input tensors required are ready. + +3. Recognize the completion of the internal request, collect the output + tensor and map the content to "preprocessed\_image" which is an unique name + known within the ensemble. + +4. Map the newly collected tensor to inputs of the models within the ensemble. + In this case, the inputs of "classification\_model" and "segmentation\_model" + will be mapped and marked as ready. + +5. Check models that require the newly collected tensor and send internal + requests to models whose inputs are ready, the classification + model and the segmentation model in this case. Note that the responses will + be in arbitrary order depending on the load and computation time of + individual models. + +6. Repeat step 3-5 until no more internal requests should be sent, and then + response to the inference request with the tensors mapped to the ensemble + output names. + +Unlike other models, ensemble models do not support "instance_group" field in +the model configuration. The reason is that the ensemble scheduler itself +is mainly an event-driven scheduler with very minimal overhead so its +almost never the bottleneck of the pipeline. The composing models +within the ensemble can be individually scaled up or down with their +respective `instance_group` settings. To optimize your model pipeline +performance, you can use +[Model Analyzer](https://github.com/triton-inference-server/model_analyzer) +to find the optimal model configurations. + +## Additional Resources + +You can find additional end-to-end ensemble examples in the links below: +* [This guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles) +explores the concept of ensembles with a running example. +* [Preprocessing in Python Backend Using + Ensemble](https://github.com/triton-inference-server/python_backend#preprocessing) +* [Accelerating Inference with NVIDIA Triton Inference Server and NVIDIA + DALI](https://developer.nvidia.com/blog/accelerating-inference-with-triton-inference-server-and-dali/) +* [Using RAPIDS AI with NVIDIA Triton Inference + Server](https://github.com/rapidsai/rapids-examples/tree/main/rapids_triton_example) \ No newline at end of file diff --git a/docs/user_guide/implicit_state_management.md b/docs/user_guide/implicit_state_management.md new file mode 100644 index 0000000000..3ab323962c --- /dev/null +++ b/docs/user_guide/implicit_state_management.md @@ -0,0 +1,419 @@ + + +# Implicit State Management + +Implicit state management allows a stateful model to store its state inside +Triton. When using implicit state, the stateful model does not need to store +the state required for inference inside the model. + +Below is a portion of the model configuration that indicates the model +is using implicit state. + +``` +sequence_batching { + state [ + { + input_name: "INPUT_STATE" + output_name: "OUTPUT_STATE" + data_type: TYPE_INT32 + dims: [ -1 ] + } + ] +} +``` + +The *state* section in the sequence_batching setting is used to indicate that +the model is using implicit state. The *input_name* field specifies the name of +the input tensor that will contain the input state. The *output_name* field +describes the name of the output tensor produced by the model that contains +output state. The output state provided by the model in the *ith* +request in the sequence will be used as the input state in the +*i+1th* request. The *dims* field specifies the dimensions of the +state tensors. When the *dims* field contains variable-sized dimensions, the +shape of the input state and output state does not have to match. + +For debugging purposes, the client can request the output state. In order to +allow the client to request the output state, the +[*output* section of the model configuration](./model_configuration.md#inputs-and-outputs) +must list the output state as one of the model outputs. Note that requesting the +output state from the client can increase the request latency because of the +additional tensors that have to be transferred. + +Implicit state management requires backend support. Currently, only +[onnxruntime_backend](https://github.com/triton-inference-server/onnxruntime_backend) +[tensorrt_backend](https://github.com/triton-inference-server/tensorrt_backend), +and [pytorch_backend](https://github.com/triton-inference-server/pytorch_backend) +support implicit state. + +## State Initialization + +By default, the starting request in the sequence contains uninitialized data for +the input state. The model can use the start flag in the request to detect the +beginning of a new sequence and initialize the model state by providing the +initial state in the model output. If the *dims* section in the *state* +description of the model contains variable-sized dimensions, Triton will use *1* +for every variable-sized dimension for the starting request. For other +non-starting requests in the sequence, the input state is the output state of +the previous request in the sequence. For an example ONNX model that uses +implicit state you can refer to this onnx model generated from the +`create_onnx_modelfile_wo_initial_state()` +[from this generation script](https://github.com/triton-inference-server/server/blob/main/qa/common/gen_qa_implicit_models.py). +This is a simple accumulator model that stores the partial sum of the requests +in a sequence in Triton using implicit state. For state initialization, if the +request is starting, the model sets the "OUTPUT\_STATE" to be equal to the +"INPUT" tensor. For non-starting requests, it sets the "OUTPUT\_STATE" tensor +to the sum of "INPUT" and "INPUT\_STATE" tensors. + +In addition to the default state initialization discussed above, Triton provides +two other mechanisms for initializing state. + +### Initializing State from Zero. + +Below is an example of initializing state from zero. + +``` +sequence_batching { + state [ + { + input_name: "INPUT_STATE" + output_name: "OUTPUT_STATE" + data_type: TYPE_INT32 + dims: [ -1 ] + initial_state: { + data_type: TYPE_INT32 + dims: [ 1 ] + zero_data: true + name: "initial state" + } + } + ] +} +``` + +Note that in the example above variable dimensions in the state description are +converted to fixed size dimensions. + +### Initializing State from File + +For initializing state from file, you need to create a directory named +"initial\_state" under the model directory. The file that contains the initial +state under this directory needs to be provided in the *data_file* field. +The data stored in this file will be used in row-major order as the initial +state. Below is an example state description initializing state from file. + +``` +sequence_batching { + state [ + { + input_name: "INPUT_STATE" + output_name: "OUTPUT_STATE" + data_type: TYPE_INT32 + dims: [ -1 ] + initial_state: { + data_type: TYPE_INT32 + dims: [ 1 ] + data_file: "initial_state_data" + name: "initial state" + } + } + ] +} +``` + +## Scheduling Strategies + +The sequence batcher can employ one of two scheduling strategies when +deciding how to batch the sequences that are routed to the same model +instance. These strategies are [direct](#direct) and [oldest](#oldest). + +### Direct + +With the Direct scheduling strategy the sequence batcher ensures not +only that all inference requests in a sequence are routed to the same +model instance, but also that each sequence is routed to a dedicated +batch slot within the model instance. This strategy is required when +the model maintains state for each batch slot, and is expecting all +inference requests for a given sequence to be routed to the same slot +so that the state is correctly updated. + +As an example of the sequence batcher using the Direct scheduling +strategy, assume a TensorRT stateful model that has the following +model configuration. + +``` +name: "direct_stateful_model" +platform: "tensorrt_plan" +max_batch_size: 2 +sequence_batching { + max_sequence_idle_microseconds: 5000000 + direct { } + control_input [ + { + name: "START" + control [ + { + kind: CONTROL_SEQUENCE_START + fp32_false_true: [ 0, 1 ] + } + ] + }, + { + name: "READY" + control [ + { + kind: CONTROL_SEQUENCE_READY + fp32_false_true: [ 0, 1 ] + } + ] + } + ] +} +input [ + { + name: "INPUT" + data_type: TYPE_FP32 + dims: [ 100, 100 ] + } +] +output [ + { + name: "OUTPUT" + data_type: TYPE_FP32 + dims: [ 10 ] + } +] +instance_group [ + { + count: 2 + } +] +``` + +The sequence_batching section indicates that the model should use the +sequence batcher and the Direct scheduling strategy. In this example +the model only requires a *start* and *ready* control input from the +sequence batcher so only those controls are listed. The instance_group +indicates two instances of the model should be instantiated and +max_batch_size indicates that each of those instances should perform +batch-size 2 inferences. The following figure shows a representation +of the sequence batcher and the inference resources specified by this +configuration. + +![Sequence Batching Example](images/sequence_example0.png) + +Each model instance is maintaining state for each batch slot, and is +expecting all inference requests for a given sequence to be routed to +the same slot so that the state is correctly updated. For this example +that means that Triton can simultaneously perform inference for up to +four sequences. + +Using the Direct scheduling strategy, the sequence batcher: + +* Recognizes when an inference request starts a new sequence and + allocates a batch slot for that sequence. If no batch slot is + available for the new sequence, Triton places the inference request + in a backlog. + +* Recognizes when an inference request is part of a sequence that has + an allocated batch slot and routes the request to that slot. + +* Recognizes when an inference request is part of a sequence that is + in the backlog and places the request in the backlog. + +* Recognizes when the last inference request in a sequence has been + completed. The batch slot occupied by that sequence is immediately + reallocated to a sequence in the backlog, or freed for a future + sequence if there is no backlog. + +The following figure shows how multiple sequences are scheduled onto +the model instances using the Direct scheduling strategy. On the left +the figure shows several sequences of requests arriving at +Triton. Each sequence could be made up of any number of inference +requests and those individual inference requests could arrive in any +order relative to inference requests in other sequences, except that +the execution order shown on the right assumes that the first +inference request of sequence 0 arrives before any inference request +in sequences 1-5, the first inference request of sequence 1 arrives +before any inference request in sequences 2-5, etc. + +The right of the figure shows how the inference request sequences are +scheduled onto the model instances over time. + +![Sequence Batcher Example](images/sequence_example1.png) + +The following figure shows the sequence batcher uses the control input +tensors to communicate with the model. The figure shows two sequences +assigned to the two batch slots in a model instance. Inference +requests for each sequence arrive over time. The START and READY rows +show the input tensor values used for each execution of the +model. Over time the following happens: + +* The first request arrives for the sequence in slot0. Assuming the + model instance is not already executing an inference, the sequence + scheduler immediately schedules the model instance to execute + because an inference request is available. + +* This is the first request in the sequence so the corresponding + element in the START tensor is set to 1. There is no request + available in slot1 so the READY tensor shows only slot0 as ready. + +* After the inference completes the sequence scheduler sees that there + are no requests available in any batch slot and so the model + instance sits idle. + +* Next, two inference requests arrive close together in time so that + the sequence scheduler sees them both available in their respective + batch slots. The scheduler immediately schedules the model instance + to perform a batch-size 2 inference and uses START and READY to show + that both slots have an inference request available but that only + slot1 is the start of a new sequence. + +* The processing continues in a similar manner for the other inference + requests. + +![Sequence Batcher Example](images/sequence_example2.png) + +### Oldest + +With the Oldest scheduling strategy the sequence batcher ensures that +all inference requests in a sequence are routed to the same model +instance and then uses the [dynamic +batcher](batcher.md#dynamic-batcher) to batch together +multiple inferences from different sequences into a batch that +inferences together. With this strategy the model must typically use +the CONTROL_SEQUENCE_CORRID control so that it knows which sequence +each inference request in the batch belongs to. The +CONTROL_SEQUENCE_READY control is typically not needed because all +inferences in the batch will always be ready for inference. + +As an example of the sequence batcher using the Oldest scheduling +strategy, assume a stateful model that has the following model +configuration: + +``` +name: "oldest_stateful_model" +platform: "tensorflow_savedmodel" +max_batch_size: 2 +sequence_batching { + max_sequence_idle_microseconds: 5000000 + oldest + { + max_candidate_sequences: 4 + } + control_input [ + { + name: "START" + control [ + { + kind: CONTROL_SEQUENCE_START + fp32_false_true: [ 0, 1 ] + } + ] + }, + { + name: "END" + control [ + { + kind: CONTROL_SEQUENCE_END + fp32_false_true: [ 0, 1 ] + } + ] + }, + { + name: "CORRID" + control [ + { + kind: CONTROL_SEQUENCE_CORRID + data_type: TYPE_UINT64 + } + ] + } + ] +} +input [ + { + name: "INPUT" + data_type: TYPE_FP32 + dims: [ 100, 100 ] + } +] +output [ + { + name: "OUTPUT" + data_type: TYPE_FP32 + dims: [ 10 ] + } +] +``` + +The sequence_batching section indicates that the model should use the +sequence batcher and the Oldest scheduling strategy. The Oldest +strategy is configured so that the sequence batcher maintains up to 4 +active candidate sequences from which it prefers to form dynamic +batches of size 2. In this example the model requires a *start*, +*end*, and *correlation ID* control input from the sequence +batcher. The following figure shows a representation of the sequence +batcher and the inference resources specified by this configuration. + +![Sequence Batching Example](images/dyna_sequence_example0.png) + +Using the Oldest scheduling strategy, the sequence batcher: + +* Recognizes when an inference request starts a new sequence and + attempts to find a model instance that has room for a candidate + sequence. If no model instance has room for a new candidate + sequence, Triton places the inference request in a backlog. + +* Recognizes when an inference request is part of a sequence that is + already a candidate sequence in some model instance and routes the + request to that model instance. + +* Recognizes when an inference request is part of a sequence that is + in the backlog and places the request in the backlog. + +* Recognizes when the last inference request in a sequence has been + completed. The model instance immediately removes a sequence from + the backlog and makes it a candidate sequence in the model instance, + or records that the model instance can handle a future sequence if + there is no backlog. + +The following figure shows how multiple sequences are scheduled onto +the model instance specified by the above example configuration. On +the left the figure shows four sequences of requests arriving at +Triton. Each sequence is composed of multiple inference requests as +shown in the figure. The center of the figure shows how the inference +request sequences are batched onto the model instance over time, +assuming that the inference requests for each sequence arrive at the +same rate with sequence A arriving just before B, which arrives just +before C, etc. The Oldest strategy forms a dynamic batch from the +oldest requests but never includes more than one request from a given +sequence in a batch (for example, the last two inferences in sequence +D are not batched together). + +![Sequence Batcher Example](images/dyna_sequence_example1.png) \ No newline at end of file diff --git a/docs/user_guide/model_configuration.md b/docs/user_guide/model_configuration.md index 1b0e64a533..c549e4a297 100644 --- a/docs/user_guide/model_configuration.md +++ b/docs/user_guide/model_configuration.md @@ -872,307 +872,6 @@ cc_model_filenames [ ] ``` -## Scheduling And Batching - -Triton supports batch inferencing by allowing individual inference -requests to specify a batch of inputs. The inferencing for a batch of -inputs is performed at the same time which is especially important for -GPUs since it can greatly increase inferencing throughput. In many use -cases the individual inference requests are not batched, therefore, -they do not benefit from the throughput benefits of batching. - -The inference server contains multiple scheduling and batching -algorithms that support many different model types and use-cases. More -information about model types and schedulers can be found in [Models -And Schedulers](architecture.md#models-and-schedulers). - -### Default Scheduler - -The default scheduler is used for a model if none of the -*scheduling_choice* properties are specified in the model -configuration. The default scheduler simply distributes inference -requests to all [model instances](#instance-groups) configured for the -model. - -### Dynamic Batcher - -Dynamic batching is a feature of Triton that allows inference requests -to be combined by the server, so that a batch is created -dynamically. Creating a batch of requests typically results in -increased throughput. The dynamic batcher should be used for -[stateless models](architecture.md#stateless-models). The dynamically created -batches are distributed to all [model instances](#instance-groups) -configured for the model. - -Dynamic batching is enabled and configured independently for each -model using the *ModelDynamicBatching* property in the model -configuration. These settings control the preferred size(s) of the -dynamically created batches, the maximum time that requests can be -delayed in the scheduler to allow other requests to join the dynamic -batch, and queue properties such a queue size, priorities, and -time-outs. Refer to -[this guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization#what-is-dynamic-batching) -for a more detailed example of dynamic batching. - -#### Recommended Configuration Process - -The individual settings are described in detail below. The following -steps are the recommended process for tuning the dynamic batcher for -each model. It is also possible to use the [Model -Analyzer](model_analyzer.md) to automatically search across different -dynamic batcher configurations. - -* Decide on a [maximum batch size](#maximum-batch-size) for the model. - -* Add the following to the model configuration to enable the dynamic - batcher with all default settings. By default the dynamic batcher - will create batches as large as possible up to the maximum batch - size and will not [delay](#delayed-batching) when forming batches. - -``` - dynamic_batching { } -``` - -* Use the - [Performance Analyzer](https://github.com/triton-inference-server/perf_analyzer/blob/main/README.md) - to determine the latency and throughput provided by the default dynamic - batcher configuration. - -* If the default configuration results in latency values that are - within your latency budget, try one or both of the following to - trade off increased latency for increased throughput: - - * Increase maximum batch size. - - * Set [batch delay](#delayed-batching) to a non-zero value. Try - increasing delay values until the latency budget is exceeded to - see the impact on throughput. - -* [Preferred batch sizes](#preferred-batch-sizes) should not be used - for most models. A preferred batch size(s) should only be configured - if that batch size results in significantly higher performance than - other batch sizes. - -#### Preferred Batch Sizes - -The *preferred_batch_size* property indicates the batch sizes that the -dynamic batcher should attempt to create. For most models, -*preferred_batch_size* should not be specified, as described in -[Recommended Configuration -Process](#recommended-configuration-process). An exception is TensorRT -models that specify multiple optimization profiles for different batch -sizes. In this case, because some optimization profiles may give -significant performance improvement compared to others, it may make -sense to use *preferred_batch_size* for the batch sizes supported by -those higher-performance optimization profiles. - -The following example shows the configuration that enables dynamic -batching with preferred batch sizes of 4 and 8. - -``` - dynamic_batching { - preferred_batch_size: [ 4, 8 ] - } -``` - -When a model instance becomes available for inferencing, the dynamic -batcher will attempt to create batches from the requests that are -available in the scheduler. Requests are added to the batch in the -order the requests were received. If the dynamic batcher can form a -batch of a preferred size(s) it will create a batch of the largest -possible preferred size and send it for inferencing. If the dynamic -batcher cannot form a batch of a preferred size (or if the dynamic -batcher is not configured with any preferred batch sizes), it will -send a batch of the largest size possible that is less than the -maximum batch size allowed by the model (but see the following section -for the delay option that changes this behavior). - -The size of generated batches can be examined in aggregate using -[count metrics](metrics.md#inference-request-metrics). - -#### Delayed Batching - -The dynamic batcher can be configured to allow requests to be delayed -for a limited time in the scheduler to allow other requests to join -the dynamic batch. For example, the following configuration sets the -maximum delay time of 100 microseconds for a request. - -``` - dynamic_batching { - max_queue_delay_microseconds: 100 - } -``` - -The *max_queue_delay_microseconds* property setting changes the -dynamic batcher behavior when a maximum size (or preferred size) batch -cannot be created. When a batch of a maximum or preferred size cannot -be created from the available requests, the dynamic batcher will delay -sending the batch as long as no request is delayed longer than the -configured *max_queue_delay_microseconds* value. If a new request -arrives during this delay and allows the dynamic batcher to form a -batch of a maximum or preferred batch size, then that batch is sent -immediately for inferencing. If the delay expires the dynamic batcher -sends the batch as is, even though it is not a maximum or preferred -size. - -#### Preserve Ordering - -The *preserve_ordering* property is used to force all responses to be -returned in the same order as requests were received. See the -[protobuf -documentation](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto) -for details. - -#### Priority Levels - -By default the dynamic batcher maintains a single queue that holds all -inference requests for a model. The requests are processed and batched -in order. The *priority_levels* property can be used to create -multiple priority levels within the dynamic batcher so that requests -with higher priority are allowed to bypass requests with lower -priority. Requests at the same priority level are processed in -order. Inference requests that do not set a priority are scheduled -using the *default_priority_level* property. - -#### Queue Policy - -The dynamic batcher provides several settings that control how -requests are queued for batching. - -When *priority_levels* is not defined, the *ModelQueuePolicy* for the -single queue can be set with *default_queue_policy*. When -*priority_levels* is defined, each priority level can have a different -*ModelQueuePolicy* as specified by *default_queue_policy* and *priority_queue_policy*. - -The *ModelQueuePolicy* property allows a maximum queue size to be set -using the *max_queue_size*. The *timeout_action*, -*default_timeout_microseconds* and *allow_timeout_override* settings -allow the queue to be configured so that individual requests are -rejected or deferred if their time in the queue exceeds a specified -timeout. - -#### Custom Batching - -You can set custom batching rules that work _in addition to_ the specified behavior of the dynamic batcher. -To do so, you would implement five functions in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h) -and create a shared library. These functions are described below. - -| Function | Description| -| :-- | :-- | -| TRITONBACKEND_ModelBatchIncludeRequest | Determines whether a request should be included in the current batch | -| TRITONBACKEND_ModelBatchInitialize | Initializes a record-keeping data structure for a new batch | -| TRITONBACKEND_ModelBatchFinalize | Deallocates the record-keeping data structure after a batch is formed | -| TRITONBACKEND_ModelBatcherInitialize | Initializes a read-only data structure for use with all batches | -| TRITONBACKEND_ModelBatcherFinalize | Deallocates the read-only data structure after the model is unloaded | - -The path to the shared library can be passed into the model configuration via the parameter -`TRITON_BATCH_STRATEGY_PATH`. If not provided, the dynamic batcher will look for a custom -batching strategy named batchstrategy.so in the model version, model, and backend directories, -in that order. If found, it will load it. This lets you easily share a custom batching strategy -among all models using the same backend. - -For a tutorial of how to create and use a custom batching library, please see the -[backend examples directory](https://github.com/triton-inference-server/backend/tree/main/examples#volume-batching). - -### Sequence Batcher - -Like the dynamic batcher, the sequence batcher combines non-batched -inference requests, so that a batch is created dynamically. Unlike the -dynamic batcher, the sequence batcher should be used for -[stateful models](architecture.md#stateful-models) where a sequence of -inference requests must be routed to the same model instance. The -dynamically created batches are distributed to all [model -instances](#instance-groups) configured for the model. - -Sequence batching is enabled and configured independently for each -model using the *ModelSequenceBatching* property in the model -configuration. These settings control the sequence timeout as well as -configuring how Triton will send control signals to the model -indicating sequence start, end, ready and correlation ID. See -[Stateful Models](architecture.md#stateful-models) for more -information and examples. - -#### Iterative Sequences - -> [!NOTE] -> Iterative sequences are *provisional* and likely to change in future versions. - -The sequence batcher supports stateful execution of "iterative -sequences" where a single request is processed over a number of -scheduling iterations. "Iterative sequences" enable the scheduler to -batch multiple inflight requests at each step and allow the model or -backend to complete a request at any iteration. - -For models and backends that support "iterative sequences", users can -enable support in the sequence batcher by specifying: - -``` - sequence_batching { - iterative_sequence: true - } -``` - -An "iterative sequence" refers to stateful models that iteratively -process a single request until a complete response is generated. When -iterative sequence is enabled, the sequence scheduler will expect a -single incoming request to initiate the sequence. Backends that -support iterative sequences can then yield back to the sequence -batcher to reschedule the request for further execution in a future -batch. - -Because only one request is used to represent the "iterative -sequence", the user doesn't need to set [control -inputs](architecture.md#control-inputs) mentioned in the previous -section. They will be filled internally by the scheduler. - -"Iterative sequences" can be [decoupled](#decoupled) where more than -one response can be generated during execution or non-decoupled where -a single response is generated when the full response is complete. - -The main advantage of "iterative sequences" is the ability to use -Triton's native batching capabilities to form batches of requests at -different iteration stages without having to maintain additional state -in the backend. Typically batches executed by backends are completed -in the same execution which can waste resources if the execution of -one of the requests in the batch takes much longer than the rest. With -"iterative sequences", processing for each request in a batch can be -broken down into multiple iterations and a backend can start -processing new requests as soon as any request is complete. - -##### Continuous/Inflight Batching with Iterative Sequences - -Continuous batching, iteration level batching, and inflight batching -are terms used in large language model (LLM) inferencing to describe -batching strategies that form batches of requests at each iteration -step. By forming batches "continuously" inference servers can increase -throughput by reusing batch slots as soon as they are free without -waiting for all requests in a batch to complete. - -As the number of steps required to process a request can vary -significantly, batching existing requests and new requests continuously -can have a significant improvement on throughput and latency. - -To achieve inflight batching with iterative sequences, the backend -should break request processing into a number of steps, where each -step corresponds to one Triton model instance execution. At the end of -each step, the model instance will release requests that have been -completed and reschedule requests that are still inflight. Triton will -then form and schedule the next batch of requests that mixes new and -rescheduled requests. - -### Ensemble Scheduler - -The ensemble scheduler must be used for [ensemble - models](architecture.md#ensemble-models) and cannot be used for any - other type of model. - -The ensemble scheduler is enabled and configured independently for -each model using the *ModelEnsembleScheduling* property in the model -configuration. The settings describe the models that are included in -the ensemble and the flow of tensor values between the models. See -[Ensemble Models](architecture.md#ensemble-models) for more -information and examples. - ## Optimization Policy The model configuration *ModelOptimizationPolicy* property is used to diff --git a/docs/user_guide/model_execution.md b/docs/user_guide/model_execution.md new file mode 100644 index 0000000000..be89206ca7 --- /dev/null +++ b/docs/user_guide/model_execution.md @@ -0,0 +1,228 @@ + + +# Concurrent Model Execution + +The Triton architecture allows multiple models and/or multiple +instances of the same model to execute in parallel on the same +system. The system may have zero, one, or many GPUs. The following +figure shows an example with two models; model0 and model1. Assuming +Triton is not currently processing any request, when two requests +arrive simultaneously, one for each model, Triton immediately +schedules both of them onto the GPU and the GPU’s hardware scheduler +begins working on both computations in parallel. Models executing on +the system's CPU are handled similarly by Triton except that the +scheduling of the CPU threads execution each model is handled by the +system's OS. + +![Triton Mult-Model Execution Diagram](images/multi_model_exec.png) + +By default, if multiple requests for the same model arrive at the same +time, Triton will serialize their execution by scheduling only one at +a time on the GPU, as shown in the following figure. + +![Triton Mult-Model Serial Execution +Diagram](images/multi_model_serial_exec.png) + +Triton provides a [model configuration option called +instance-group](model_configuration.md#instance-groups) that allows +each model to specify how many parallel executions of that model +should be allowed. Each such enabled parallel execution is referred to +as an *instance*. By default, Triton gives each model a single +instance for each available GPU in the system. By +using the instance_group field in the model configuration, the number +of execution instances for a model can +be changed. The following figure shows model execution when model1 +is configured to allow three instances. As shown in the figure, the +first three model1 inference requests are immediately executed in +parallel. The fourth model1 inference request must wait until one of +the first three executions completes before beginning. + +![Triton Mult-Model Parallel Execution +Diagram](images/multi_model_parallel_exec.png) + +# Models And Schedulers + +Triton supports multiple scheduling and batching algorithms that can +be selected independently for each model. This section describes +*stateless* and *stateful* models and how Triton provides +schedulers to support those model types. For a given model, the +selection and configuration of the scheduler is done with the [model's +configuration file](model_configuration.md). + +## Stateless Models + +With respect to Triton's schedulers, a *stateless* model does not +maintain state between inference requests. Each inference performed on +a stateless model is independent of all other inferences using that +model. + +Examples of stateless models are CNNs such as image classification and +object detection. The [default +scheduler](scheduler.md#default-scheduler) or [dynamic +batcher](batcher.md#dynamic-batcher) can be used as the +scheduler for these stateless models. + +RNNs and similar models which do have internal memory can be stateless +as long as the state they maintain does not span inference +requests. For example, an RNN that iterates over all elements in a +batch is considered stateless by Triton if the internal state is not +carried between batches of inference requests. The [default +scheduler](scheduler.md#default-scheduler) can be used for +these stateless models. The [dynamic +batcher](batcher.md#dynamic-batcher) cannot be used since +the model is typically not expecting the batch to represent multiple +inference requests. + +## Stateful Models + +With respect to Triton's schedulers, a *stateful* model does maintain +state between inference requests. The model is expecting multiple +inference requests that together form a sequence of inferences that +must be routed to the same model instance so that the state being +maintained by the model is correctly updated. Moreover, the model may +require that Triton provide *control* signals indicating, for example, +the start and end of the sequence. + +The [sequence batcher](batcher.md#sequence-batcher) must +be used for these stateful models. As explained below, the sequence +batcher ensures that all inference requests in a sequence get routed +to the same model instance so that the model can maintain state +correctly. The sequence batcher also communicates with the model to +indicate when a sequence is starting, when a sequence is ending, when +a sequence has an inference request ready for execution, and the +*correlation ID* of the sequence. + +When making inference requests for a stateful model, the client +application must provide the same correlation ID to all requests in a +sequence, and must also mark the start and end of the sequence. The +correlation ID allows Triton to identify that the requests belong to +the same sequence. + +### Control Inputs + +For a stateful model to operate correctly with the sequence batcher, +the model must typically accept one or more *control* input tensors +that Triton uses to communicate with the model. The +*ModelSequenceBatching::Control* section of the [model +configuration](model_configuration.md) indicates how the model exposes +the tensors that the sequence batcher should use for these +controls. All controls are optional. Below is portion of a model +configuration that shows an example configuration for all the +available control signals. + +``` +sequence_batching { + control_input [ + { + name: "START" + control [ + { + kind: CONTROL_SEQUENCE_START + fp32_false_true: [ 0, 1 ] + } + ] + }, + { + name: "END" + control [ + { + kind: CONTROL_SEQUENCE_END + fp32_false_true: [ 0, 1 ] + } + ] + }, + { + name: "READY" + control [ + { + kind: CONTROL_SEQUENCE_READY + fp32_false_true: [ 0, 1 ] + } + ] + }, + { + name: "CORRID" + control [ + { + kind: CONTROL_SEQUENCE_CORRID + data_type: TYPE_UINT64 + } + ] + } + ] +} +``` + +* **Start**: The start input tensor is specified using + CONTROL_SEQUENCE_START in the configuration. The example + configuration indicates that the model has an input tensor called + START with a 32-bit floating point data-type. The sequence batcher + will define this tensor when executing an inference on the + model. The START tensor must be 1-dimensional with size equal to the + batch-size. Each element in the tensor indicates if the sequence in + the corresponding batch slot is starting or not. In the example + configuration, fp32_false_true indicates that a sequence start is + indicated by tensor element equal to 1, and non-start is indicated + by tensor element equal to 0. + +* **End**: The end input tensor is specified using + CONTROL_SEQUENCE_END in the configuration. The example configuration + indicates that the model has an input tensor called END with a + 32-bit floating point data-type. The sequence batcher will define + this tensor when executing an inference on the model. The END tensor + must be 1-dimensional with size equal to the batch-size. Each + element in the tensor indicates if the sequence in the corresponding + batch slot is ending or not. In the example configuration, + fp32_false_true indicates that a sequence end is indicated by tensor + element equal to 1, and non-end is indicated by tensor element equal + to 0. + +* **Ready**: The ready input tensor is specified using + CONTROL_SEQUENCE_READY in the configuration. The example + configuration indicates that the model has an input tensor called + READY with a 32-bit floating point data-type. The sequence batcher + will define this tensor when executing an inference on the + model. The READY tensor must be 1-dimensional with size equal to the + batch-size. Each element in the tensor indicates if the sequence in + the corresponding batch slot has an inference request ready for + inference. In the example configuration, fp32_false_true indicates + that a sequence ready is indicated by tensor element equal to 1, and + non-ready is indicated by tensor element equal to 0. + +* **Correlation ID**: The correlation ID input tensor is specified + using CONTROL_SEQUENCE_CORRID in the configuration. The example + configuration indicates that the model has an input tensor called + CORRID with a unsigned 64-bit integer data-type. The sequence + batcher will define this tensor when executing an inference on the + model. The CORRID tensor must be 1-dimensional with size equal to + the batch-size. Each element in the tensor indicates the correlation + ID of the sequence in the corresponding batch slot. + +### State Management for Stateful Models +[Implicit State Management](implicit_state_management.md#implicit-state-management) \ No newline at end of file diff --git a/docs/user_guide/scheduler.md b/docs/user_guide/scheduler.md new file mode 100644 index 0000000000..de45fa4687 --- /dev/null +++ b/docs/user_guide/scheduler.md @@ -0,0 +1,62 @@ + + +# Schedulers + +Triton supports batch inferencing by allowing individual inference +requests to specify a batch of inputs. The inferencing for a batch of +inputs is performed at the same time which is especially important for +GPUs since it can greatly increase inferencing throughput. In many use +cases the individual inference requests are not batched, therefore, +they do not benefit from the throughput benefits of batching. + +The inference server contains multiple scheduling and batching +algorithms that support many different model types and use-cases. More +information about model types and schedulers can be found in [Models +And Schedulers](architecture.md#models-and-schedulers). + +## Default Scheduler + +The default scheduler is used for a model if none of the +*scheduling_choice* properties are specified in the model +configuration. The default scheduler simply distributes inference +requests to all [model instances](model_configuration.md#instance-groups) configured for the +model. + +## Ensemble Scheduler + +The ensemble scheduler must be used for [ensemble + models](architecture.md#ensemble-models) and cannot be used for any + other type of model. + +The ensemble scheduler is enabled and configured independently for +each model using the *ModelEnsembleScheduling* property in the model +configuration. The settings describe the models that are included in +the ensemble and the flow of tensor values between the models. See +[Ensemble Models](architecture.md#ensemble-models) for more +information and examples. \ No newline at end of file