You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The RxDM process uses a default of ~4GB per GPU memory passively, to ensure TCPX feature works for GPU to GPU networking. This is an overkill and impacts users ability to maximize throughput performance by limiting the effective usable memory from 80GB HBM on H100 to 76GB.
This can be manually adjusted down to ~1.5GB with no impact to performance (benchmarked @ 512 GPU scale running Llama2 pre-training on NemO framework), by adding --rx_pool_size 1073741824 to the following
Within https://raw.githubusercontent.com/GoogleCloudPlatform/slurm-gcp/master/tools/prologs-epilogs/receive-data-path-manager
The RxDM process uses a default of ~4GB per GPU memory passively, to ensure TCPX feature works for GPU to GPU networking. This is an overkill and impacts users ability to maximize throughput performance by limiting the effective usable memory from 80GB HBM on H100 to 76GB.
This can be manually adjusted down to ~1.5GB with no impact to performance (benchmarked @ 512 GPU scale running Llama2 pre-training on NemO framework), by adding
--rx_pool_size 1073741824
to the followingThe text was updated successfully, but these errors were encountered: