From c9565e49e71d2fb5ddf439da34cafdcd2f317656 Mon Sep 17 00:00:00 2001 From: Shenggui Li Date: Mon, 17 Feb 2025 15:36:16 +0800 Subject: [PATCH] [docker] added rdma support (#3619) --- benchmark/deepseek_v3/README.md | 22 +++++++++++++++---- docker/Dockerfile | 1 + docker/Dockerfile.dev | 1 + docker/Dockerfile.rocm | 2 ++ docker/compose.yaml | 8 +++---- .../development_guide_using_docker.md | 9 ++++++-- docs/references/amd.md | 7 +++++- 7 files changed, 39 insertions(+), 11 deletions(-) diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md index ddd716560e9..62e0ec48238 100644 --- a/benchmark/deepseek_v3/README.md +++ b/benchmark/deepseek_v3/README.md @@ -7,6 +7,7 @@ Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html). ## Hardware Recommendation + - 8 x NVIDIA H200 GPUs If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below. @@ -18,19 +19,26 @@ For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a si If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. ### Using Docker (Recommended) + ```bash # Pull latest image # https://hub.docker.com/r/lmsysorg/sglang/tags docker pull lmsysorg/sglang:latest # Launch -docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \ +docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000 ``` +If you are using RDMA, please note that: + +1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them. +2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`. + Add [performance optimization options](#performance-optimization-options) as needed. ### Using pip + ```bash # Installation pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python @@ -42,7 +50,9 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-r Add [performance optimization options](#performance-optimization-options) as needed. + ### Performance Optimization Options + [MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed. - [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput. @@ -68,7 +78,8 @@ response = client.chat.completions.create( print(response) ``` -### Example: Serving with two H20*8 nodes +### Example: Serving with two H20\*8 nodes + For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. Please **use the first node's IP** for both commands. If the command fails, try setting the `GLOO_SOCKET_IFNAME` parameter. For more information, see [Common Environment Variables](https://pytorch.org/docs/stable/distributed.html#common-environment-variables). @@ -85,7 +96,8 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20. > **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args). -### Example: Serving with two H200*8 nodes and docker +### Example: Serving with two H200\*8 nodes and docker + There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`. A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage. @@ -120,6 +132,7 @@ docker run --gpus all \ ``` To ensure functionality, we include a test from a client Docker container. + ```bash docker run --gpus all \ --shm-size 32g \ @@ -136,7 +149,8 @@ docker run --gpus all \ > **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args). -### Example: Serving with four A100*8 nodes +### Example: Serving with four A100\*8 nodes + To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first. Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server. diff --git a/docker/Dockerfile b/docker/Dockerfile index ba0ee5bd3f9..3ae74a8cccb 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -14,6 +14,7 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 \ && update-alternatives --set python3 /usr/bin/python3.10 && apt install python3.10-distutils -y \ && apt install curl git sudo libibverbs-dev -y \ + && apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \ && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py \ && python3 --version \ && python3 -m pip --version \ diff --git a/docker/Dockerfile.dev b/docker/Dockerfile.dev index 5ff1fa7a51a..6faa6ffe28b 100644 --- a/docker/Dockerfile.dev +++ b/docker/Dockerfile.dev @@ -21,6 +21,7 @@ RUN apt-get update && apt-get install -y \ pkg-config \ libssl-dev \ bear \ + && apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \ && rm -rf /var/lib/apt/lists/* \ && apt-get clean diff --git a/docker/Dockerfile.rocm b/docker/Dockerfile.rocm index 1c55bd31de2..ff637a0a2f5 100644 --- a/docker/Dockerfile.rocm +++ b/docker/Dockerfile.rocm @@ -20,6 +20,8 @@ ARG TRITON_COMMIT="improve_fa_decode_3.0.0" ARG ATER_REPO="https://github.com/HaiShaw/ater" ARG CK_COMMITS="fa05ae" +RUN apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 + RUN git clone ${SGL_REPO} \ && cd sglang \ && if [ "${SGL_BRANCH}" = ${SGL_DEFAULT} ]; then \ diff --git a/docker/compose.yaml b/docker/compose.yaml index c49d5c5bba5..f7ff1fbd565 100644 --- a/docker/compose.yaml +++ b/docker/compose.yaml @@ -7,7 +7,8 @@ services: # If you use modelscope, you need mount this directory # - ${HOME}/.cache/modelscope:/root/.cache/modelscope restart: always - network_mode: host + network_mode: host # required by RDMA + privileged: true # required by RDMA # Or you can only publish port 30000 # ports: # - 30000:30000 @@ -16,8 +17,7 @@ services: # if you use modelscope to download model, you need set this environment # - SGLANG_USE_MODELSCOPE: true entrypoint: python3 -m sglang.launch_server - command: - --model-path meta-llama/Llama-3.1-8B-Instruct + command: --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 ulimits: @@ -31,5 +31,5 @@ services: reservations: devices: - driver: nvidia - device_ids: ['0'] + device_ids: ["0"] capabilities: [gpu] diff --git a/docs/developer/development_guide_using_docker.md b/docs/developer/development_guide_using_docker.md index 918057d0e96..963d04fb82e 100644 --- a/docs/developer/development_guide_using_docker.md +++ b/docs/developer/development_guide_using_docker.md @@ -16,18 +16,23 @@ tar xf vscode_cli_alpine_x64_cli.tar.gz The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers. +❗️ **Note on RDMA** + + 1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them but keeping them there does not harm. Thus, we enable these two flags by default in the commands below. + 2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`. + ### H100 ```bash # Change the name to yours -docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh +docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh docker exec -it sglang_zhyncs /bin/zsh ``` ### H200 ```bash -docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh +docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh docker exec -it sglang_zhyncs /bin/zsh ``` diff --git a/docs/references/amd.md b/docs/references/amd.md index 9109129d5c7..4b1c8230d06 100644 --- a/docs/references/amd.md +++ b/docs/references/amd.md @@ -63,13 +63,18 @@ docker build -t sglang_image -f Dockerfile.rocm . 2. Create a convenient alias. ```bash -alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri \ +alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \ --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ -v $HOME/dockerx:/dockerx \ -v /data:/data' ``` +If you are using RDMA, please note that: + +1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them. +2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`. + 3. Launch the server. **NOTE:** Replace `` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).