Zero update #421

ngc92 · 2024-05-16T11:16:53Z

First, clean up the multi-gpu code a bit. We only need to #ifdef away sections that would lead to compile errors, anything else can be handled with a regular plain if, making the code more readable. Also gets rid of the two copies of the update function, let's keep the unified as long as we can.

the current ZERO-1 implementation performs an all-reduce on the gradients, but only one shard is actually needed per rank.
As per the zero paper,

State-of-art implementation of all-reduce uses a two-step approach, where the first step is a reduce-scatter operation, which reduces different part of the data on different process. The next step is an all-gather operation where each process gathers the reduced data on all the process. The result of these two steps is an all-reduce.

The claim that ZERO-1 does not introduce communication overhead hinges on the fact that we just insert the adam update between these two ops. So we go from reduce-scatter -> all-gather -> adam to reduce-scatter -> adam -> all-gather.

karpathy · 2024-05-16T19:48:10Z

dam, good catch, we were doing a lot more communication than we needed to. merging

karpathy · 2024-05-16T19:49:21Z

also nice to merge the two updates into one function 👍

ngc92 added 3 commits May 16, 2024 14:12

simplify multi-gpu logic by reducing #ifdefs

57f70ea

reduce communication overhead for ZERO stage 1

8b57cf6

fixup profiling

fbd8f03

karpathy merged commit 6cfc7c5 into karpathy:master May 16, 2024
8 checks passed

chinthysl mentioned this pull request May 17, 2024

NCCL only multi-gpu multi-node training without MPI #426

Open

ngc92 deleted the zero-update branch May 19, 2024 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero update #421

Zero update #421

ngc92 commented May 16, 2024

karpathy commented May 16, 2024

karpathy commented May 16, 2024

Zero update #421

Zero update #421

Conversation

ngc92 commented May 16, 2024

karpathy commented May 16, 2024

karpathy commented May 16, 2024