[2024-04-30 17:58:40,124] torch.distributed.run: [WARNING] [2024-04-30 17:58:40,124] torch.distributed.run: [WARNING] ***************************************** [2024-04-30 17:58:40,124] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-04-30 17:58:40,124] torch.distributed.run: [WARNING] ***************************************** 10 24 12 24 14 24 8 24 15 24 11 24 13 24 9 24 tensor([1, 2], device='cuda:2') /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:763: UserWarning: Running all_reduce on global rank 10 which does not belong to the given group. warnings.warn( tensor([1, 2], device='cuda:4') /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:763: UserWarning: Running all_reduce on global rank 12 which does not belong to the given group. warnings.warn( tensor([1, 2], device='cuda:6') /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:763: UserWarning: Running all_reduce on global rank 14 which does not belong to the given group. warnings.warn( tensor([1, 2], device='cuda:0') /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:763: UserWarning: Running all_reduce on global rank 8 which does not belong to the given group. warnings.warn( tensor([1, 2], device='cuda:3') /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:763: UserWarning: Running all_reduce on global rank 11 which does not belong to the given group. warnings.warn( tensor([1, 2], device='cuda:7') tensor([1, 2], device='cuda:5') tensor([1, 2], device='cuda:1') /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:763: UserWarning: Running all_reduce on global rank 9 which does not belong to the given group. warnings.warn( c01n38:26523:26523 [0] NCCL INFO cudaDriverVersion 12040 c01n38:26523:26523 [0] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26525:26525 [2] NCCL INFO cudaDriverVersion 12040 c01n38:26525:26525 [2] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26527:26527 [4] NCCL INFO cudaDriverVersion 12040 c01n38:26527:26527 [4] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26529:26529 [6] NCCL INFO cudaDriverVersion 12040 c01n38:26529:26529 [6] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26523:26582 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26523:26582 [0] NCCL INFO P2P plugin IBext_v8 c01n38:26526:26526 [3] NCCL INFO cudaDriverVersion 12040 c01n38:26526:26526 [3] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26528:26528 [0] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26528:26528 [5] NCCL INFO cudaDriverVersion 12040 NCCL version 2.20.5+cuda12.4 c01n38:26524:26524 [1] NCCL INFO cudaDriverVersion 12040 c01n38:26524:26524 [1] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26530:26530 [7] NCCL INFO cudaDriverVersion 12040 c01n38:26530:26530 [7] NCCL INFO Bootstrap : Using ibs255:11.1.1.112<0> c01n38:26523:26582 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26523:26582 [0] NCCL INFO Using non-device net plugin version 0 c01n38:26523:26582 [0] NCCL INFO Using network IBext_v8 c01n38:26525:26583 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26525:26583 [2] NCCL INFO P2P plugin IBext_v8 c01n38:26527:26584 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26527:26584 [4] NCCL INFO P2P plugin IBext_v8 c01n38:26525:26583 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26525:26583 [2] NCCL INFO Using non-device net plugin version 0 c01n38:26525:26583 [2] NCCL INFO Using network IBext_v8 c01n38:26529:26586 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26529:26586 [6] NCCL INFO P2P plugin IBext_v8 c01n38:26526:26588 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26526:26588 [3] NCCL INFO P2P plugin IBext_v8 c01n38:26528:26589 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26528:26589 [5] NCCL INFO P2P plugin IBext_v8 c01n38:26527:26584 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26527:26584 [4] NCCL INFO Using non-device net plugin version 0 c01n38:26527:26584 [4] NCCL INFO Using network IBext_v8 c01n38:26524:26591 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26524:26591 [1] NCCL INFO P2P plugin IBext_v8 c01n38:26530:26593 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so c01n38:26530:26593 [7] NCCL INFO P2P plugin IBext_v8 c01n38:26529:26586 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26529:26586 [6] NCCL INFO Using non-device net plugin version 0 c01n38:26529:26586 [6] NCCL INFO Using network IBext_v8 c01n38:26526:26588 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26526:26588 [3] NCCL INFO Using non-device net plugin version 0 c01n38:26526:26588 [3] NCCL INFO Using network IBext_v8 c01n38:26528:26589 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26528:26589 [5] NCCL INFO Using non-device net plugin version 0 c01n38:26528:26589 [5] NCCL INFO Using network IBext_v8 c01n38:26524:26591 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26524:26591 [1] NCCL INFO Using non-device net plugin version 0 c01n38:26524:26591 [1] NCCL INFO Using network IBext_v8 c01n38:26530:26593 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibs255:11.1.1.112<0> c01n38:26530:26593 [7] NCCL INFO Using non-device net plugin version 0 c01n38:26530:26593 [7] NCCL INFO Using network IBext_v8 c01n38:26530:26593 [7] NCCL INFO comm 0x560c7407fa40 rank 1 nranks 5 cudaDev 7 nvmlDev 7 busId d7000 commId 0x36a93d2535b07f63 - Init START c01n38:26528:26589 [5] NCCL INFO comm 0x560ea1082fc0 rank 0 nranks 5 cudaDev 5 nvmlDev 5 busId 91000 commId 0x36a93d2535b07f63 - Init START c01n38:26528:26589 [5] NCCL INFO === System : maxBw 24.0 totalBw 164.8 === c01n38:26528:26589 [5] NCCL INFO CPU/1 (1/1/2) c01n38:26528:26589 [5] NCCL INFO + PCI[48.0] - PCI/8C000 (1000c0301000ffff) c01n38:26528:26589 [5] NCCL INFO + PCI[48.0] - GPU/91000 (0) c01n38:26528:26589 [5] NCCL INFO + NVL[164.8] - NVS/0 c01n38:26528:26589 [5] NCCL INFO + PCI[24.0] - NIC/96000 c01n38:26528:26589 [5] NCCL INFO + NET[25.0] - NET/2 (d659e80003ae6d94/1/25.000000) c01n38:26528:26589 [5] NCCL INFO + PCI[48.0] - PCI/D0000 (1000c0301000ffff) c01n38:26528:26589 [5] NCCL INFO + PCI[48.0] - GPU/D7000 (1) c01n38:26528:26589 [5] NCCL INFO + NVL[164.8] - NVS/0 c01n38:26528:26589 [5] NCCL INFO + PCI[24.0] - NIC/DA000 c01n38:26528:26589 [5] NCCL INFO + NET[25.0] - NET/3 (ba59e80003ae6d94/1/25.000000) c01n38:26528:26589 [5] NCCL INFO + SYS[10.0] - CPU/0 c01n38:26528:26589 [5] NCCL INFO CPU/0 (1/1/2) c01n38:26528:26589 [5] NCCL INFO + PCI[48.0] - PCI/17000 (1000c0301000ffff) c01n38:26528:26589 [5] NCCL INFO + PCI[24.0] - NIC/21000 c01n38:26528:26589 [5] NCCL INFO + NET[25.0] - NET/0 (9a59e80003ae6d94/1/25.000000) c01n38:26528:26589 [5] NCCL INFO + PCI[48.0] - PCI/3D000 (1000c0301000ffff) c01n38:26528:26589 [5] NCCL INFO + PCI[24.0] - NIC/47000 c01n38:26528:26589 [5] NCCL INFO + NET[25.0] - NET/1 (b659e80003ae6d94/1/25.000000) c01n38:26528:26589 [5] NCCL INFO + SYS[10.0] - CPU/1 c01n38:26528:26589 [5] NCCL INFO ========================================== c01n38:26528:26589 [5] NCCL INFO GPU/91000 :GPU/91000 (0/5000.000000/LOC) GPU/D7000 (2/164.800003/NVL) NVS/0 (1/164.800003/NVL) CPU/1 (2/48.000000/PHB) CPU/0 (3/10.000000/SYS) NET/2 (3/24.000000/PIX) NET/3 (5/24.000000/PXN) NET/0 (6/10.000000/SYS) NET/1 (6/10.000000/SYS) c01n38:26528:26589 [5] NCCL INFO GPU/D7000 :GPU/91000 (2/164.800003/NVL) GPU/D7000 (0/5000.000000/LOC) NVS/0 (1/164.800003/NVL) CPU/1 (2/48.000000/PHB) CPU/0 (3/10.000000/SYS) NET/2 (5/24.000000/PXN) NET/3 (3/24.000000/PIX) NET/0 (6/10.000000/SYS) NET/1 (6/10.000000/SYS) c01n38:26528:26589 [5] NCCL INFO NET/2 :GPU/91000 (3/24.000000/PIX) GPU/D7000 (5/24.000000/PHB) CPU/1 (3/24.000000/PHB) CPU/0 (4/10.000000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (6/24.000000/PHB) NET/0 (7/10.000000/SYS) NET/1 (7/10.000000/SYS) c01n38:26528:26589 [5] NCCL INFO NET/3 :GPU/91000 (5/24.000000/PHB) GPU/D7000 (3/24.000000/PIX) CPU/1 (3/24.000000/PHB) CPU/0 (4/10.000000/SYS) NET/2 (6/24.000000/PHB) NET/3 (0/5000.000000/LOC) NET/0 (7/10.000000/SYS) NET/1 (7/10.000000/SYS) c01n38:26528:26589 [5] NCCL INFO NET/0 :GPU/91000 (6/10.000000/SYS) GPU/D7000 (6/10.000000/SYS) CPU/1 (4/10.000000/SYS) CPU/0 (3/24.000000/PHB) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (6/24.000000/PHB) c01n38:26528:26589 [5] NCCL INFO NET/1 :GPU/91000 (6/10.000000/SYS) GPU/D7000 (6/10.000000/SYS) CPU/1 (4/10.000000/SYS) CPU/0 (3/24.000000/PHB) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/0 (6/24.000000/PHB) NET/1 (0/5000.000000/LOC) c01n38:26528:26589 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000 c01n38:26530:26593 [7] NCCL INFO === System : maxBw 24.0 totalBw 164.8 === c01n38:26530:26593 [7] NCCL INFO CPU/1 (1/1/2) c01n38:26530:26593 [7] NCCL INFO + PCI[48.0] - PCI/8C000 (1000c0301000ffff) c01n38:26530:26593 [7] NCCL INFO + PCI[48.0] - GPU/91000 (0) c01n38:26530:26593 [7] NCCL INFO + NVL[164.8] - NVS/0 c01n38:26530:26593 [7] NCCL INFO + PCI[24.0] - NIC/96000 c01n38:26530:26593 [7] NCCL INFO + NET[25.0] - NET/2 (d659e80003ae6d94/1/25.000000) c01n38:26530:26593 [7] NCCL INFO + PCI[48.0] - PCI/D0000 (1000c0301000ffff) c01n38:26530:26593 [7] NCCL INFO + PCI[48.0] - GPU/D7000 (1) c01n38:26530:26593 [7] NCCL INFO + NVL[164.8] - NVS/0 c01n38:26530:26593 [7] NCCL INFO + PCI[24.0] - NIC/DA000 c01n38:26530:26593 [7] NCCL INFO + NET[25.0] - NET/3 (ba59e80003ae6d94/1/25.000000) c01n38:26530:26593 [7] NCCL INFO + SYS[10.0] - CPU/0 c01n38:26530:26593 [7] NCCL INFO CPU/0 (1/1/2) c01n38:26530:26593 [7] NCCL INFO + PCI[48.0] - PCI/17000 (1000c0301000ffff) c01n38:26530:26593 [7] NCCL INFO + PCI[24.0] - NIC/21000 c01n38:26530:26593 [7] NCCL INFO + NET[25.0] - NET/0 (9a59e80003ae6d94/1/25.000000) c01n38:26530:26593 [7] NCCL INFO + PCI[48.0] - PCI/3D000 (1000c0301000ffff) c01n38:26530:26593 [7] NCCL INFO + PCI[24.0] - NIC/47000 c01n38:26530:26593 [7] NCCL INFO + NET[25.0] - NET/1 (b659e80003ae6d94/1/25.000000) c01n38:26530:26593 [7] NCCL INFO + SYS[10.0] - CPU/1 c01n38:26530:26593 [7] NCCL INFO ========================================== c01n38:26530:26593 [7] NCCL INFO GPU/91000 :GPU/91000 (0/5000.000000/LOC) GPU/D7000 (2/164.800003/NVL) NVS/0 (1/164.800003/NVL) CPU/1 (2/48.000000/PHB) CPU/0 (3/10.000000/SYS) NET/2 (3/24.000000/PIX) NET/3 (5/24.000000/PXN) NET/0 (6/10.000000/SYS) NET/1 (6/10.000000/SYS) c01n38:26530:26593 [7] NCCL INFO GPU/D7000 :GPU/91000 (2/164.800003/NVL) GPU/D7000 (0/5000.000000/LOC) NVS/0 (1/164.800003/NVL) CPU/1 (2/48.000000/PHB) CPU/0 (3/10.000000/SYS) NET/2 (5/24.000000/PXN) NET/3 (3/24.000000/PIX) NET/0 (6/10.000000/SYS) NET/1 (6/10.000000/SYS) c01n38:26530:26593 [7] NCCL INFO NET/2 :GPU/91000 (3/24.000000/PIX) GPU/D7000 (5/24.000000/PHB) CPU/1 (3/24.000000/PHB) CPU/0 (4/10.000000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (6/24.000000/PHB) NET/0 (7/10.000000/SYS) NET/1 (7/10.000000/SYS) c01n38:26530:26593 [7] NCCL INFO NET/3 :GPU/91000 (5/24.000000/PHB) GPU/D7000 (3/24.000000/PIX) CPU/1 (3/24.000000/PHB) CPU/0 (4/10.000000/SYS) NET/2 (6/24.000000/PHB) NET/3 (0/5000.000000/LOC) NET/0 (7/10.000000/SYS) NET/1 (7/10.000000/SYS) c01n38:26530:26593 [7] NCCL INFO NET/0 :GPU/91000 (6/10.000000/SYS) GPU/D7000 (6/10.000000/SYS) CPU/1 (4/10.000000/SYS) CPU/0 (3/24.000000/PHB) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (6/24.000000/PHB) c01n38:26530:26593 [7] NCCL INFO NET/1 :GPU/91000 (6/10.000000/SYS) GPU/D7000 (6/10.000000/SYS) CPU/1 (4/10.000000/SYS) CPU/0 (3/24.000000/PHB) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/0 (6/24.000000/PHB) NET/1 (0/5000.000000/LOC) c01n38:26530:26593 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000 c01n38:26530:26593 [7] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 24.000000/24.000000, type NVL/PXN, sameChannels 0 c01n38:26530:26593 [7] NCCL INFO 0 : NET/2 GPU/0 GPU/1 NET/2 c01n38:26530:26593 [7] NCCL INFO 1 : NET/3 GPU/1 GPU/0 NET/3 c01n38:26530:26593 [7] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 48.000000/24.000000, type NVL/PIX, sameChannels 0 c01n38:26530:26593 [7] NCCL INFO 0 : NET/2 GPU/0 GPU/1 NET/2 c01n38:26530:26593 [7] NCCL INFO 1 : NET/3 GPU/1 GPU/0 NET/3 c01n38:26528:26589 [5] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 24.000000/24.000000, type NVL/PXN, sameChannels 0 c01n38:26528:26589 [5] NCCL INFO 0 : NET/2 GPU/0 GPU/1 NET/2 c01n38:26528:26589 [5] NCCL INFO 1 : NET/3 GPU/1 GPU/0 NET/3 c01n38:26528:26589 [5] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 48.000000/24.000000, type NVL/PIX, sameChannels 0 c01n38:26528:26589 [5] NCCL INFO 0 : NET/2 GPU/0 GPU/1 NET/2 c01n38:26528:26589 [5] NCCL INFO 1 : NET/3 GPU/1 GPU/0 NET/3 c01n38:26530:26593 [7] NCCL INFO comm 0x560c7407fa40 rank 1 nRanks 5 nNodes 2 localRanks 2 localRank 1 MNNVL 0 c01n38:26528:26589 [5] NCCL INFO comm 0x560ea1082fc0 rank 0 nRanks 5 nNodes 2 localRanks 2 localRank 0 MNNVL 0 c01n38:26530:26593 [7] NCCL INFO Tree 1 : -1 -> 1 -> 0/3/-1 c01n38:26528:26589 [5] NCCL INFO Tree 0 : -1 -> 0 -> 1/2/-1 c01n38:26530:26593 [7] NCCL INFO Tree 3 : 3 -> 1 -> 0/-1/-1 c01n38:26528:26589 [5] NCCL INFO Tree 2 : 2 -> 0 -> 1/-1/-1 c01n38:26530:26593 [7] NCCL INFO Ring 00 : 0 -> 1 -> 2 c01n38:26530:26593 [7] NCCL INFO Ring 01 : 4 -> 1 -> 0 c01n38:26530:26593 [7] NCCL INFO Ring 02 : 0 -> 1 -> 2 c01n38:26530:26593 [7] NCCL INFO Ring 03 : 4 -> 1 -> 0 c01n38:26530:26593 [7] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/3/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->3 c01n38:26530:26593 [7] NCCL INFO P2P Chunksize set to 131072 c01n38:26528:26589 [5] NCCL INFO Channel 00/04 : 0 1 2 4 3 c01n38:26528:26589 [5] NCCL INFO Channel 01/04 : 0 3 2 4 1 c01n38:26528:26589 [5] NCCL INFO Channel 02/04 : 0 1 2 4 3 c01n38:26528:26589 [5] NCCL INFO Channel 03/04 : 0 3 2 4 1 c01n38:26528:26589 [5] NCCL INFO Ring 00 : 3 -> 0 -> 1 c01n38:26528:26589 [5] NCCL INFO Ring 01 : 1 -> 0 -> 3 c01n38:26528:26589 [5] NCCL INFO Ring 02 : 3 -> 0 -> 1 c01n38:26528:26589 [5] NCCL INFO Ring 03 : 1 -> 0 -> 3 c01n38:26528:26589 [5] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->2 [3] -1/-1/-1->0->1 c01n38:26528:26589 [5] NCCL INFO P2P Chunksize set to 131072 c01n38:26528:26589 [5] NCCL INFO Channel 00/0 : 0[5] -> 1[7] via P2P/CUMEM c01n38:26528:26589 [5] NCCL INFO Channel 02/0 : 0[5] -> 1[7] via P2P/CUMEM c01n38:26530:26593 [7] NCCL INFO Channel 00/0 : 1[7] -> 2[1] [send] via NET/IBext_v8/2(0)/GDRDMA c01n38:26530:26593 [7] NCCL INFO Channel 02/0 : 1[7] -> 2[1] [send] via NET/IBext_v8/2(0)/GDRDMA c01n38:26530:26593 [7] bootstrap.cc:77 NCCL WARN Message truncated : received 1024 bytes instead of 256 c01n38:26528:26589 [5] NCCL INFO Channel 00/0 : 3[3] -> 0[5] [receive] via NET/IBext_v8/2/GDRDMA c01n38:26530:26593 [7] NCCL INFO bootstrap.cc:567 -> 3 c01n38:26530:26593 [7] NCCL INFO transport.cc:139 -> 3 c01n38:26530:26593 [7] NCCL INFO init.cc:1222 -> 3 c01n38:26530:26593 [7] NCCL INFO init.cc:1501 -> 3 c01n38:26530:26593 [7] NCCL INFO group.cc:64 -> 3 [Async thread] c01n38:26528:26589 [5] NCCL INFO Channel 02/0 : 3[3] -> 0[5] [receive] via NET/IBext_v8/2/GDRDMA c01n38:26530:26530 [7] NCCL INFO group.cc:418 -> 3 c01n38:26530:26530 [7] NCCL INFO group.cc:95 -> 3 c01n38:26528:26589 [5] bootstrap.cc:77 NCCL WARN Message truncated : received 1024 bytes instead of 256 c01n38:26528:26589 [5] NCCL INFO bootstrap.cc:567 -> 3 c01n38:26528:26589 [5] NCCL INFO transport.cc:140 -> 3 c01n38:26528:26589 [5] NCCL INFO init.cc:1222 -> 3 c01n38:26528:26589 [5] NCCL INFO init.cc:1501 -> 3 c01n38:26528:26589 [5] NCCL INFO group.cc:64 -> 3 [Async thread] c01n38:26528:26528 [5] NCCL INFO group.cc:418 -> 3 c01n38:26528:26528 [5] NCCL INFO group.cc:95 -> 3 [rank15]: Traceback (most recent call last): [rank15]: File "/workspace/test.py", line 15, in [rank15]: dist.all_reduce(a,group=group,op=dist.ReduceOp.SUM) [rank15]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [rank15]: return func(*args, **kwargs) [rank15]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2128, in all_reduce [rank15]: work = group.allreduce([tensor], opts) [rank15]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2006, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5 [rank15]: ncclInternalError: Internal check failed. [rank15]: Last error: [rank15]: Message truncated : received 1024 bytes instead of 256