Reduce default GPU memory buffer pool for TCPX RxDM in receive-data-path-manager prolog script #242

saltysoup · 2025-02-05T13:03:50Z

Within https://raw.githubusercontent.com/GoogleCloudPlatform/slurm-gcp/master/tools/prologs-epilogs/receive-data-path-manager

The RxDM process uses a default of ~4GB per GPU memory passively, to ensure TCPX feature works for GPU to GPU networking. This is an overkill and impacts users ability to maximize throughput performance by limiting the effective usable memory from 80GB HBM on H100 to 76GB.

This can be manually adjusted down to ~1.5GB with no impact to performance (benchmarked @ 512 GPU scale running Llama2 pre-training on NemO framework), by adding --rx_pool_size 1073741824 to the following

 	docker run \
		--pull=always \
		--detach \
		--rm \
		--name receive-datapath-manager-"${SLURM_JOB_ID}" \
		--cap-add=NET_ADMIN \
		--network=host \
		--privileged \
		--gpus all \
		--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
		--volume "${GPU_NIC_TOPOLOGY_DIR}":"${GPU_NIC_TOPOLOGY_DIR}" \
		--volume "${UDS_PATH}":"${UDS_PATH}" \
		--env LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/run/tcpx:/usr/lib/lib32:/usr/lib/x86_64-linux-gnu/ \
		--entrypoint /tcpgpudmarxd/build/app/tcpgpudmarxd \
		--ulimit memlock=-1 \
		${RXDM_IMAGE} \
		--gpu_nic_preset manual \
		--gpu_nic_topology ${GPU_NIC_TOPOLOGY} \
		--gpu_shmem_type fd \
		--uds_path "${UDS_PATH}" \
                --rx_pool_size 1073741824

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce default GPU memory buffer pool for TCPX RxDM in receive-data-path-manager prolog script #242

Reduce default GPU memory buffer pool for TCPX RxDM in receive-data-path-manager prolog script #242

saltysoup commented Feb 5, 2025

Reduce default GPU memory buffer pool for TCPX RxDM in receive-data-path-manager prolog script #242

Reduce default GPU memory buffer pool for TCPX RxDM in receive-data-path-manager prolog script #242

Comments

saltysoup commented Feb 5, 2025