Skip to content

Commit

Permalink
[docker] added rdma support (#3619)
Browse files Browse the repository at this point in the history
  • Loading branch information
FrankLeeeee authored Feb 17, 2025
1 parent d03c4c2 commit c9565e4
Show file tree
Hide file tree
Showing 7 changed files with 39 additions and 11 deletions.
22 changes: 18 additions & 4 deletions benchmark/deepseek_v3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html).

## Hardware Recommendation

- 8 x NVIDIA H200 GPUs

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below.
Expand All @@ -18,19 +19,26 @@ For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a si
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.

### Using Docker (Recommended)

```bash
# Pull latest image
# https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest

# Launch
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000
```

If you are using RDMA, please note that:

1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.

Add [performance optimization options](#performance-optimization-options) as needed.

### Using pip

```bash
# Installation
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
Expand All @@ -42,7 +50,9 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-r
Add [performance optimization options](#performance-optimization-options) as needed.

<a id="option_args"></a>

### Performance Optimization Options

[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed.

- [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
Expand All @@ -68,7 +78,8 @@ response = client.chat.completions.create(
print(response)
```

### Example: Serving with two H20*8 nodes
### Example: Serving with two H20\*8 nodes

For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. Please **use the first node's IP** for both commands.

If the command fails, try setting the `GLOO_SOCKET_IFNAME` parameter. For more information, see [Common Environment Variables](https://pytorch.org/docs/stable/distributed.html#common-environment-variables).
Expand All @@ -85,7 +96,8 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20.

> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
### Example: Serving with two H200*8 nodes and docker
### Example: Serving with two H200\*8 nodes and docker

There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`.
A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.

Expand Down Expand Up @@ -120,6 +132,7 @@ docker run --gpus all \
```

To ensure functionality, we include a test from a client Docker container.

```bash
docker run --gpus all \
--shm-size 32g \
Expand All @@ -136,7 +149,8 @@ docker run --gpus all \

> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
### Example: Serving with four A100*8 nodes
### Example: Serving with four A100\*8 nodes

To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.

Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
Expand Down
1 change: 1 addition & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 \
&& update-alternatives --set python3 /usr/bin/python3.10 && apt install python3.10-distutils -y \
&& apt install curl git sudo libibverbs-dev -y \
&& apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \
&& curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py \
&& python3 --version \
&& python3 -m pip --version \
Expand Down
1 change: 1 addition & 0 deletions docker/Dockerfile.dev
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ RUN apt-get update && apt-get install -y \
pkg-config \
libssl-dev \
bear \
&& apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean

Expand Down
2 changes: 2 additions & 0 deletions docker/Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ ARG TRITON_COMMIT="improve_fa_decode_3.0.0"
ARG ATER_REPO="https://github.com/HaiShaw/ater"
ARG CK_COMMITS="fa05ae"

RUN apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1

RUN git clone ${SGL_REPO} \
&& cd sglang \
&& if [ "${SGL_BRANCH}" = ${SGL_DEFAULT} ]; then \
Expand Down
8 changes: 4 additions & 4 deletions docker/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ services:
# If you use modelscope, you need mount this directory
# - ${HOME}/.cache/modelscope:/root/.cache/modelscope
restart: always
network_mode: host
network_mode: host # required by RDMA
privileged: true # required by RDMA
# Or you can only publish port 30000
# ports:
# - 30000:30000
Expand All @@ -16,8 +17,7 @@ services:
# if you use modelscope to download model, you need set this environment
# - SGLANG_USE_MODELSCOPE: true
entrypoint: python3 -m sglang.launch_server
command:
--model-path meta-llama/Llama-3.1-8B-Instruct
command: --model-path meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 30000
ulimits:
Expand All @@ -31,5 +31,5 @@ services:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
device_ids: ["0"]
capabilities: [gpu]
9 changes: 7 additions & 2 deletions docs/developer/development_guide_using_docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,23 @@ tar xf vscode_cli_alpine_x64_cli.tar.gz

The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers.

❗️ **Note on RDMA**

1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them but keeping them there does not harm. Thus, we enable these two flags by default in the commands below.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.

### H100

```bash
# Change the name to yours
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh
```

### H200

```bash
docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh
```

Expand Down
7 changes: 6 additions & 1 deletion docs/references/amd.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,18 @@ docker build -t sglang_image -f Dockerfile.rocm .
2. Create a convenient alias.

```bash
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri \
alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx \
-v /data:/data'
```

If you are using RDMA, please note that:

1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.

3. Launch the server.

**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
Expand Down

0 comments on commit c9565e4

Please sign in to comment.