-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor parallel documentation #3359
Open
apbose
wants to merge
6
commits into
main
Choose a base branch
from
nccl_ops_documentation
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
6394d79
Tensor parallel Llama3 tutorial illustrating use of torch.distributed…
apbose d2f83de
tensor_parallel_llama location change
apbose 67115d5
chore: Fix docs and example
peri044 61f57c0
README.rst for examples/distributed_inference/tensor_parallel_llama3
apbose fb0ba7f
documentation changes to include examples/distributed_inference in re…
apbose 58a4bb3
adding README instructions
apbose File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
.. _tensor_parallel_llama3: | ||
|
||
Torch-TensorRT Parallelism for Distributed Inference | ||
==================================================== | ||
|
||
Examples in this folder demonstrate distributed inference on multiple devices with the Torch-TensorRT backend. | ||
|
||
Data Parallel Distributed Inference based on `Accelerate <https://huggingface.co/docs/accelerate/usage_guides/distributed_inference>`_ | ||
----------------------------------------------------------------------------------------------------------------------------------------- | ||
|
||
Using Accelerate, users can achieve data parallel distributed inference with the Torch-TensorRT backend. | ||
In this case, the entire model will be loaded onto each GPU, and different chunks of batch input are processed on each device. | ||
|
||
See the examples: | ||
|
||
- `data_parallel_gpt2.py <https://github.com/pytorch/TensorRT/blob/main/examples/distributed_inference/data_parallel_gpt2.py>`_ | ||
- `data_parallel_stable_diffusion.py <https://github.com/pytorch/TensorRT/blob/main/examples/distributed_inference/data_parallel_stable_diffusion.py>`_ | ||
|
||
for more details. | ||
|
||
Tensor Parallel Distributed Inference | ||
-------------------------------------- | ||
|
||
Here, we use `torch.distributed` as an example, but compilation with tensor parallelism is agnostic to the implementation framework as long as the module is properly sharded. | ||
|
||
.. code-block:: bash | ||
|
||
torchrun --nproc_per_node=2 tensor_parallel_llama2.py | ||
|
||
Tensor Parallel Distributed Inference on a Simple Model using NCCL Ops Plugin | ||
------------------------------------------------------------------------------ | ||
|
||
We use `torch.distributed <https://pytorch.org/docs/stable/distributed.html>`_ to shard the model with Tensor parallelism. | ||
The distributed operations (`all_gather` and `all_reduce`) are then expressed as TensorRT-LLM plugins to avoid graph breaks during Torch-TensorRT compilation. | ||
The `converters for these operators <https://github.com/pytorch/TensorRT/blob/main/py/torch_tensorrt/dynamo/conversion/custom_ops_converters.py#L25-L55>`_ are already available in Torch-TensorRT. | ||
The functional implementation of ops is imported from the `tensorrt_llm` package (specifically, `libnvinfer_plugin_tensorrt_llm.so` is required). | ||
|
||
We have two options: | ||
|
||
Option 1: Install TensorRT-LLM | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Follow the instructions to `install TensorRT-LLM <https://nvidia.github.io/TensorRT-LLM/installation/linux.html>`_. | ||
Please note that before installing TensorRT-LLM, you need to | ||
|
||
1. apt install libmpich-dev | ||
2. apt install libopenmpi-dev | ||
|
||
If the default installation fails due to issues like library version mismatches or Python compatibility, consider using Option 2. | ||
After a successful installation, test by running: | ||
|
||
.. code-block:: python | ||
|
||
import torch_tensorrt | ||
|
||
to ensure it works without errors. | ||
The import might fail if `tensorrt_llm` overrides `torch_tensorrt` dependencies. | ||
Option 2 is preferable if you do not wish to install `tensorrt_llm` and its dependencies. | ||
|
||
Option 2: Link the TensorRT-LLM Directly | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Alternatively, you can load `libnvinfer_plugin_tensorrt_llm.so` manually: | ||
|
||
1. Download the `tensorrt_llm-0.16.0 <https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.16.0-cp310-cp310-linux_x86_64.whl#sha256=f86c6b89647802f49b26b4f6e40824701da14c0f053dbda3e1e7a8709d6939c7>`_ wheel file from NVIDIA's Python index. | ||
2. Extract the wheel file to a directory and locate `libnvinfer_plugin_tensorrt_llm.so` under the `tensorrt_llm/libs` directory. | ||
3. Set the environment variable `TRTLLM_PLUGINS_PATH` to the extracted path at the `initialize_distributed_env() <https://github.com/pytorch/TensorRT/blob/54e36dbafe567c75f36b3edb22d6f49d4278c12a/examples/distributed_inference/tensor_parallel_initialize_dist.py#L45>`_ call. | ||
|
||
After configuring TensorRT-LLM or the TensorRT-LLM plugin library path, run the following command to illustrate tensor parallelism of a simple model and compilation with Torch-TensorRT: | ||
|
||
.. code-block:: bash | ||
|
||
mpirun -n 2 --allow-run-as-root python tensor_parallel_simple_example.py | ||
|
||
We also provide a tensor parallelism compilation example on a more advanced model like `Llama-3`. Run the following command: | ||
|
||
.. code-block:: bash | ||
|
||
mpirun -n 2 --allow-run-as-root python tensor_parallel_llama3.py | ||
|
||
Tutorials | ||
----------------------------------------- | ||
* :ref:`tensor_parallel_llama3`: Illustration of distributed inference on multiple devices with the Torch-TensorRT backend. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6 changes: 6 additions & 0 deletions
6
examples/distributed_inference/tensor_parallel_initialize_dist.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets only recommend option 2 at this point with the fetching tool you are making