Speed up data loading / batching for ONE BILLION WORD experiment #169

taoleicn · 2021-03-04T14:50:34Z

The data loading was inefficient and was found to be the bottleneck of BILLION WORD training.
This PR rewrote the sharding (which data goes to a certain GPU / training process), and improved the training speed significantly.

The figure compares a previous run and a new test run. We see 40% reduction on training time.

This means our reported training efficiency will be much stronger

from 59 GPU days to 36 GPU days, and 4x more efficient than FairSeq Transformer results.

hpasapp · 2021-03-04T15:47:51Z

Can you comment a little on how the new partitioning scheme speeds up the data-loading please? I see that the new training approach means that each node only sees a subset of the data, and it is always the same subset (because of the removal of the shuffle, and the modulus over the number of nodes). How does this speed up data loading? Does each node cache data it has loaded previously in some way? Does the reduction in randomness mean that convergence on a per-step basis take a little longer?

taoleicn · 2021-03-04T15:59:45Z

@hpasapp so the 1B dataset is partitioned into 100 files each containing randomly shuffled sentences. Let's say we have 8 GPUs / training processes and we'd like to distribute the data for multi-GPU training. There could be two options for example:

distribute the 100 files across 8 training processes.
each training process still reads the 100 files sequentially, but within a file, distribute the sentences across the 8 processes.

The previous implementation is option 2. but as you can already see, this option introduces a lot more IOs because a process has to read an entire file but only use 1/8 of its data. (specifically I believe it has to read a new file every 100-200 batches).

The PR uses option 1. and indeed increase the computation intensity and reduce the training time by a lot.

hpasapp · 2021-03-04T16:14:48Z

Ok. Couple of questions:

If I've understood correctly, in the earlier code, the files would be loaded in random order, but in the current PR, the files are always loaded in the same order? Could it be worth adding a shuffle after the files have been partitioned over nodes, ie somewhere around line 436?
Is the slowness of the earlier version:
- mostly because copying all files to all nodes is slow, or
- because reading every 8th record (or similar) within each file is slow?
  If it is the latter, could it be worth shuffling the files deterministically, using a per-epoch seed, before partitioning over nodes, so that the data is a bit more random each epoch? (I notice in the two graphs above, that the graph for the newer version appears to converge slightly more slowly? Not sure if this is variance, or because the data are being presented less randomly?)

taoleicn · 2021-03-04T16:26:01Z

@hpasapp

Could it be worth adding a shuffle after the files have been partitioned over nodes, ie somewhere around line 436?

I can do that. Do you think it is a must-have given that sentences are all shuffled already?

Is the slowness of the earlier version

It is the latter. Note that the sentences within a file will be shuffled in different epochs:
https://github.com/asappresearch/sru/blob/3.0.0-dev/experiments/srupp_experiments/data_utils.py#L425-L426

Btw I made the figure a bit confusing.. the two graphs are not comparable on PPL because one uses a dropout of 0.1 and the other uses 0.05. The new version is converging faster actually.

hpasapp · 2021-03-04T16:31:00Z

Btw I made the figure a bit confusing.. the two graphs are not comparable on PPL because one uses a dropout of 0.1 and the other uses 0.05. The new version is converging faster actually.

Ah, got it. Nice! :) Ok, ignore my thoughts on shuffling the files then please :)

experiments/srupp_experiments/data_utils.py

taoleicn · 2021-03-05T16:48:38Z

🎉

taoleicn added 2 commits March 4, 2021 09:40

speed-up data loading

b2a1be4

remove unused method

84ef01c

taoleicn requested a review from hpasapp March 4, 2021 14:56

hpasapp approved these changes Mar 4, 2021

View reviewed changes

experiments/srupp_experiments/data_utils.py Show resolved Hide resolved

taoleicn merged commit 81a657b into 3.0.0-dev Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up data loading / batching for ONE BILLION WORD experiment #169

Speed up data loading / batching for ONE BILLION WORD experiment #169

taoleicn commented Mar 4, 2021 •

edited

Loading

hpasapp commented Mar 4, 2021

taoleicn commented Mar 4, 2021

hpasapp commented Mar 4, 2021

taoleicn commented Mar 4, 2021 •

edited

Loading

hpasapp commented Mar 4, 2021

taoleicn commented Mar 5, 2021

Speed up data loading / batching for ONE BILLION WORD experiment #169

Speed up data loading / batching for ONE BILLION WORD experiment #169

Conversation

taoleicn commented Mar 4, 2021 • edited Loading

hpasapp commented Mar 4, 2021

taoleicn commented Mar 4, 2021

hpasapp commented Mar 4, 2021

taoleicn commented Mar 4, 2021 • edited Loading

hpasapp commented Mar 4, 2021

taoleicn commented Mar 5, 2021

taoleicn commented Mar 4, 2021 •

edited

Loading

taoleicn commented Mar 4, 2021 •

edited

Loading