microsoft · saeedmaleki · Nov 22, 2023 · Sep 5, 2023 · Nov 16, 2023 · Nov 16, 2023
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,47 @@
+cff-version: 1.2.0
+title: "MSCCL++: A GPU-driven communication stack for scalable AI applications"
+version: 0.4.0
+message: >-
+  If you use this project in your research, please cite it as below.
+authors:
+  - given-names: Peng
+    family-names: Cheng
+    affiliation: Microsoft Research
+  - given-names: Changho
+    family-names: Hwang
+    affiliation: Microsoft Research
+  - given-names: Abhinav
+    family-names: Jangda
+    affiliation: Microsoft Research
+  - given-names: Suriya
+    family-names: Kalivardhan
+    affiliation: Microsoft Azure
+  - given-names: Binyang
+    family-names: Li
+    affiliation: Microsoft Azure
+  - given-names: Shuguang
+    family-names: Liu
+    affiliation: Microsoft Azure
+  - given-names: Saeed
+    family-names: Maleki
+    affiliation: Microsoft Research
+  - given-names: Madan
+    family-names: Musuvathi
+    affiliation: Microsoft Research
+  - given-names: Olli
+    family-names: Saarikivi
+    affiliation: Microsoft Research
+  - given-names: Wei
+    family-names: Tsui
+    affiliation: Microsoft Research
+  - given-names: Ziyue
+    family-names: Yang
+    affiliation: Microsoft Research
+
+repository-code: 'https://github.com/microsoft/mscclpp'
+abstract: >-
+  MSCCL++ redefines the interface for inter-GPU communication, thereby
+  delivering a highly efficient and customizable communication stack
+  tailored for distributed GPU applications.
+license: MIT
+license-url: https://github.com/microsoft/mscclpp/blob/main/LICENSE
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -2,7 +2,7 @@
 # Licensed under the MIT license.
 
 set(MSCCLPP_MAJOR "0")
-set(MSCCLPP_MINOR "3")
+set(MSCCLPP_MINOR "4")
 set(MSCCLPP_PATCH "0")
 
 set(MSCCLPP_SOVERSION ${MSCCLPP_MAJOR})

diff --git a/README.md b/README.md
@@ -1,32 +1,46 @@
 # MSCCL++
 
-GPU-driven computation & communication stack.
+[![Latest Release](https://img.shields.io/github/release/microsoft/mscclpp.svg)](https://github.com/microsoft/mscclpp/releases/latest)
+[![License](https://img.shields.io/github/license/microsoft/mscclpp.svg)](LICENSE)
+[![CodeQL](https://github.com/microsoft/mscclpp/actions/workflows/codeql-analysis.yml/badge.svg?branch=main)](https://github.com/microsoft/mscclpp/actions/workflows/codeql-analysis.yml)
 
-See [Quick Start](docs/quickstart.md) to quickly get started.
+| Pipelines                | Build Status      |
+|--------------------------|-------------------|
+| Unit Tests (CUDA)        | [![Build Status](https://dev.azure.com/binyli/HPC/_apis/build/status%2Fmscclpp-ut?branchName=main)](https://dev.azure.com/binyli/HPC/_build/latest?definitionId=4&branchName=main) |
+| Integration Tests (CUDA) | [![Build Status](https://dev.azure.com/binyli/HPC/_apis/build/status%2Fmscclpp-test?branchName=main)](https://dev.azure.com/binyli/HPC/_build/latest?definitionId=3&branchName=main) |
 
-See the latest performance evaluation on Azure [NDmv4](docs/performance-ndmv4.md).
+A GPU-driven communication stack for scalable AI applications.
 
-Build our Doxygen document by running `doxygen` in [`docs/`](docs/) directory. Run `python3 -m http.server <PORT>` in `docs/doxygen/html/` directory to serve the generated HTML files.
+See [Quick Start](docs/quickstart.md) to quickly get started.
 
 ## Overview
 
-MSCCL++ is a development kit for implementing highly optimized distributed GPU applications, in terms of both inter-GPU communication and GPU computation. MSCCL++ is specially designed for developers who want to fine-tune inter-GPU communication of their applications at the GPU kernel level, without awareness of detailed communication mechanisms. The key underlying concept of MSCCL++ is GPU-driven execution, where both communication and computation tasks are initiated by GPU not by CPU. That is, the communication and computation interfaces of MSCCL++ are provided as device-side APIs (called inside a GPU kernel), while the host-side APIs of MSCCL++ are for bootstrapping, initial connection setups, or background host threads for inter-GPU DMA and RDMA (called proxies). By using MSCCL++, we expect:
+MSCCL++ redefines inter-GPU communication interfaces, thereby delivering a highly efficient and customizable communication stack for distributed GPU applications. Its design is specifically tailored to accommodate diverse performance optimization scenarios often encountered in state-of-the-art AI applications. The followings highlight the key features of MSCCL++.
+
+* **On-GPU Interfaces.** MSCCL++ provides communication interfaces to be called by a **GPU thread**. Users can easily implement highly optimized communication logics inside a GPU kernel, without awareness of detailed communication mechanisms. This enables users to implement highly fine-grained system pipelining (i.e., hiding communication delays by overlapping with computation), which has been difficult for CPU-based interfaces.
+
+* **Fine-grained Abstracts.** MSCCL++ provides fine-grained abstracts for communication primitives, such as `put()`, `get()`, `signal()`, `flush()`, and `wait()`. This enables users to easily implement flexible communication logics, such as overlapping communication with computation, or implementing customized collective communication algorithms.
+
+* **Converged Interfaces.** MSCCL++ provides consistent interfaces regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink/xGMI or InfiniBand). This simplifies the code for inter-GPU communication, which is often complex and error-prone.
 
-* **Holistic Optimization for High GPU Utilization.** As both communication and computation are scheduled inside a GPU kernel at the same time, we can optimize end-to-end performance of distributed GPU applications from a global view. For example, we can minimize the GPU resource contention between communication and computation, which is known to often substantially degrade throughput of distributed deep learning applications.
+## Performance
 
-* **Fully Pipelined System to Reduce Overhead from the Control Plane.** We can eliminate control overhead from CPU by allowing GPU to autonomously schedule both communication and computation. This significantly reduces GPU scheduling overhead and CPU-GPU synchronization overhead. For example, this allows us to implement a highly fine-grained system pipelining (i.e., hiding communication delays by overlapping with computation), which has been difficult for CPU-controlled applications due to the large control/scheduling overhead.
+While the power of MSCCL++ is fully realized with application-specific optimization, it still delivers performance benefits even in pure-communication scenarios. The following figures provide a comparison of the AllReduce throughput of MSCCL++ against that of the latest version of NCCL. Tested over two [Azure NDmv4 SKUs](https://learn.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series) (8 A100-80G GPUs per node).
 
-* **Runtime Performance Optimization for Dynamic Workload.** As we can easily implement flexible communication logics, we can optimize communication performance even during runtime. For example, we can implement the system to automatically choose different communication paths or different collective communication algorithms depending on the dynamic workload at runtime.
+| <center>Single-node AllReduce | <center>Two-node AllReduce |
+|-------------------------------|----------------------------|
+| <img src="./docs/figs/mscclpp_vs_nccl_comparison_num_nodes_1.jpeg" alt="MSCCL++ vs NCCL AllReduce (Single-node)" style="width: 400px;"/> | <img src="./docs/figs/mscclpp_vs_nccl_comparison_num_nodes_2.jpeg" alt="MSCCL++ vs NCCL AllReduce (Two-node)" style="width: 400px;"/> |
 
-## Key Features (v0.3)
+## Key Concepts
 
-MSCCL++ v0.3 supports the following features.
+The following highlights key concepts of MSCCL++.
 
-### In-Kernel Communication Interfaces
+### On-GPU Communication Interfaces: Channels
 
-MSCCL++ provides inter-GPU communication interfaces to be called by a GPU thread. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU. `channel` is a peer-to-peer communication channel between two GPUs, which consists of information on send/receive buffers. `channel` is initialized from the host side before the kernel execution.
+MSCCL++ provides peer-to-peer communication methods between GPUs. A peer-to-peer connection between two GPUs is called a *Channel*. Channels are constructed by MSCCL++ host-side interfaces and copied to GPUs during initialization. Channels provide *GPU-side interfaces*, which means that all communication methods are defined as a device function to be called from a GPU kernel code. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU.
 
 ```cpp
+// `ProxyChannel` will be explained in the following section.
 __device__ mscclpp::DeviceHandle<mscclpp::SimpleProxyChannel> channel;
 __global__ void gpuKernel() {
   ...
@@ -53,11 +67,17 @@ __device__ void barrier() {
 }
 ```
 
-MSCCL++ provides consistent in-kernel interfaces, i.e., the above interfaces are used regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink or InfiniBand).
+MSCCL++ provides consistent interfaces, i.e., the above interfaces are used regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink or InfiniBand).
+
+### ProxyChannel and SmChannel
+
+MSCCL++ delivers two types of channels, **ProxyChannel** and **SmChannel**. `ProxyChannel` provides (R)DMA-based data copy and synchronization methods. When called, these methods send/receive a signal to/from a host-side proxy (hence the name `ProxyChannel`), which will trigger (R)DMA (such as `cudaMemcpy*` or `ibv_post_send`) or regarding synchronization methods (such as `cudaStreamSynchronize` or `ibv_poll_cq`). Since the key functionalities are run by the proxy, `ProxyChannel` requires only a single GPU thread to call its methods. See all `ProxyChannel` methods from [here](./include/mscclpp/proxy_channel_device.hpp).
+
+On the other hand, `SmChannel` provides memory-mapping-based copy and synchronization methods. When called, these methods will directly use GPU threads to read/write from/to the remote GPU's memory space. Comparing against `ProxyChannel`, `SmChannel` is especially performant for low-latency scenarios, while it may need many GPU threads to call copying methods at the same time to achieve high copying bandwidth. See all `SmChannel` methods from [here](./include/mscclpp/sm_channel_device.hpp).
 
 ### Host-Side Communication Proxy
 
-Some in-kernel communication interfaces of MSCCL++ send requests (called triggers) to a GPU-external helper that conducts key functionalities such as DMA or RDMA. This helper is called a proxy service or a proxy in short. MSCCL++ provides a default implementation of a proxy, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.
+MSCCL++ provides a default implementation of a host-side proxy for ProxyChannels, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.
 
 ```cpp
 // Bootstrap: initialize control-plane connections between all ranks
@@ -120,19 +140,9 @@ public:
 
 Customized proxies can be used for conducting a series of pre-defined data transfers within only a single trigger from GPU at runtime. This would be more efficient than sending a trigger for each data transfer one by one.
 
-### Flexible Customization
-
-Most of key components of MSCCL++ are designed to be easily customized. This enables MSCCL++ to easily adopt a new software / hardware technology and lets users implement algorithms optimized for their own use cases.
-
-### New in MSCCL++ v0.3 (Latest Release)
-* Updated interfaces
-* Add Python bindings and interfaces
-* Add Python unit tests
-* Add more configurable parameters
-* Add a new single-node AllReduce kernel
-* Fix bugs
+### Python Interfaces
 
-See details from https://github.com/microsoft/mscclpp/issues/89.
+MSCCL++ provides Python bindings and interfaces, which simplifies integration with Python applications.
 
 ## Contributing
 

diff --git a/cmake/AddFormatTargets.cmake b/cmake/AddFormatTargets.cmake
@@ -26,11 +26,11 @@ find_program(BLACK black)
 if (BLACK)
     message(STATUS "Found black: ${BLACK}")
     add_custom_target(check-format-py
-        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml --check ${PROJECT_SOURCE_DIR}/python ${PROJECT_SOURCE_DIR}/test
+        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml --check ${PROJECT_SOURCE_DIR}
     )
     add_dependencies(check-format check-format-py)
     add_custom_target(format-py
-        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml ${PROJECT_SOURCE_DIR}/python ${PROJECT_SOURCE_DIR}/test
+        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml ${PROJECT_SOURCE_DIR}
     )
     add_dependencies(format format-py)
 else()

diff --git a/docs/.gitignore b/docs/.gitignore
@@ -1 +1,3 @@
 doxygen/
+_build/
+sphinx/
diff --git a/docs/Doxyfile b/docs/Doxyfile
@@ -2043,7 +2043,7 @@ MAN_LINKS              = NO
 # captures the structure of the code including all documentation.
 # The default value is: NO.
 
-GENERATE_XML           = NO
+GENERATE_XML           = YES
 
 # The XML_OUTPUT tag is used to specify where the XML pages will be put. If a
 # relative path is entered the value of OUTPUT_DIRECTORY will be put in front of

diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,27 @@
+## How to build docs
+
+1. Install `doxygen`.
+
+    ```bash
+    $ sudo apt-get install doxygen
+    ```
+
+2. Install Python packages below. If you install them on the user's local, you need to include `~/.local/bin` to `$PATH` (to use `sphinx-build`).
+
+    ```bash
+    $ sudo python3 -m pip install sphinx sphinx_rtd_theme breathe
+    ```
+
+3. Create Doxygen documents.
+
+    ```bash
+    $ doxygen
+    ```
+
+4. Create Sphinx documents.
+
+    ```bash
+    $ sphinx-build -b html -Dbreathe_projects.mscclpp=$PWD/doxygen/xml $PWD $PWD/sphinx
+    ```
+
+5. Done. The HTML files will be on `sphinx/` directory.
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,29 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+project = "mscclpp"
+copyright = "2023, MSCCL++ Team"
+author = "MSCCL++ Team"
+release = "v0.4.0"
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = ["breathe"]
+
+templates_path = ["_templates"]
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+
+# Breathe configuration
+breathe_default_project = "mscclpp"
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = "sphinx_rtd_theme"
+html_static_path = ["_static"]
diff --git a/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_1.jpeg b/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_1.jpeg
diff --git a/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_2.jpeg b/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_2.jpeg
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,26 @@
+.. MSCCL++ documentation master file, created by
+   sphinx-quickstart on Tue Sep  5 13:03:46 2023.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to MSCCL++'s documentation!
+===================================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
+
+Docs
+====
+
+.. doxygennamespace:: mscclpp
+   :members:
diff --git a/docs/make.bat b/docs/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd