-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273
Comments
Can you provide the log with |
Additional information:
|
Sure, i'll run with your envs and provide logs, thanks |
@sjeaugey here's the log of noderank1(rank8-15). |
We've fixed a similar-looking bug in NCCL 2.21; can you try with the latest version? |
@kiskra-nvidia Thanks for the information. we may try ngctorch:2404 or some other ways to upgrade nccl 2.21+. By the way, is there any publicly disclosable reasons about nccl 2.20's bug?(Not for the problem itself, but out of curiosity for the technology involved). I find and guess maybe 2.20 found wrong MPI paths in some cases(drivers+nnodes+topo)? look forward to your reply and thanks! |
Actually I'm not sure upgrading will help. The bug was a mixup of the connect with the following barrier and the barrier size was 8 bytes. Here all your sizes are more than 8. The log you provided only shows one node. Could it be your environment was not forwarded to the other node? That would also explain the crash, as the other node might have a different configuration ending up in a mismatch and a discrepancy in sizes we're trying to exchange. |
@sjeaugey Hi, my 3 nodes has the same baremetal config(8 H100+4 activated(8 in all) HDR NIC+2 CPU+PCIE5), with containers run from the same images(ngctorch2403+megatroncore0.6.0). If your guess is true, can my bug be reproduced by testing the P2P between every two ranks(C_{24}^2=24*23/2=276 cases)? By the way, if 276 p2p comm is all ok, would it face bug when using specific 5 rank to do all-reduce? |
env
test code
The text was updated successfully, but these errors were encountered: