-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up data loading / batching for ONE BILLION WORD experiment #169
Conversation
Can you comment a little on how the new partitioning scheme speeds up the data-loading please? I see that the new training approach means that each node only sees a subset of the data, and it is always the same subset (because of the removal of the shuffle, and the modulus over the number of nodes). How does this speed up data loading? Does each node cache data it has loaded previously in some way? Does the reduction in randomness mean that convergence on a per-step basis take a little longer? |
@hpasapp so the 1B dataset is partitioned into 100 files each containing randomly shuffled sentences. Let's say we have 8 GPUs / training processes and we'd like to distribute the data for multi-GPU training. There could be two options for example:
The previous implementation is option 2. but as you can already see, this option introduces a lot more IOs because a process has to read an entire file but only use 1/8 of its data. (specifically I believe it has to read a new file every 100-200 batches). The PR uses option 1. and indeed increase the computation intensity and reduce the training time by a lot. |
Ok. Couple of questions:
|
I can do that. Do you think it is a must-have given that sentences are all shuffled already?
It is the latter. Note that the sentences within a file will be shuffled in different epochs: Btw I made the figure a bit confusing.. the two graphs are not comparable on PPL because one uses a dropout of 0.1 and the other uses 0.05. The new version is converging faster actually. |
Ah, got it. Nice! :) Ok, ignore my thoughts on shuffling the files then please :) |
🎉 |
The data loading was inefficient and was found to be the bottleneck of BILLION WORD training.
This PR rewrote the sharding (which data goes to a certain GPU / training process), and improved the training speed significantly.
The figure compares a previous run and a new test run. We see 40% reduction on training time.

This means our reported training efficiency will be much stronger

from 59 GPU days to 36 GPU days, and 4x more efficient than FairSeq Transformer results.