Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up data loading / batching for ONE BILLION WORD experiment #169

Merged
merged 2 commits into from
Mar 5, 2021

Conversation

taoleicn
Copy link
Contributor

@taoleicn taoleicn commented Mar 4, 2021

The data loading was inefficient and was found to be the bottleneck of BILLION WORD training.
This PR rewrote the sharding (which data goes to a certain GPU / training process), and improved the training speed significantly.

The figure compares a previous run and a new test run. We see 40% reduction on training time.

This means our reported training efficiency will be much stronger

from 59 GPU days to 36 GPU days, and 4x more efficient than FairSeq Transformer results.

@taoleicn taoleicn requested a review from hpasapp March 4, 2021 14:56
@hpasapp
Copy link
Collaborator

hpasapp commented Mar 4, 2021

Can you comment a little on how the new partitioning scheme speeds up the data-loading please? I see that the new training approach means that each node only sees a subset of the data, and it is always the same subset (because of the removal of the shuffle, and the modulus over the number of nodes). How does this speed up data loading? Does each node cache data it has loaded previously in some way? Does the reduction in randomness mean that convergence on a per-step basis take a little longer?

@taoleicn
Copy link
Contributor Author

taoleicn commented Mar 4, 2021

@hpasapp so the 1B dataset is partitioned into 100 files each containing randomly shuffled sentences. Let's say we have 8 GPUs / training processes and we'd like to distribute the data for multi-GPU training. There could be two options for example:

  1. distribute the 100 files across 8 training processes.
  2. each training process still reads the 100 files sequentially, but within a file, distribute the sentences across the 8 processes.

The previous implementation is option 2. but as you can already see, this option introduces a lot more IOs because a process has to read an entire file but only use 1/8 of its data. (specifically I believe it has to read a new file every 100-200 batches).

The PR uses option 1. and indeed increase the computation intensity and reduce the training time by a lot.

@hpasapp
Copy link
Collaborator

hpasapp commented Mar 4, 2021

Ok. Couple of questions:

  1. If I've understood correctly, in the earlier code, the files would be loaded in random order, but in the current PR, the files are always loaded in the same order? Could it be worth adding a shuffle after the files have been partitioned over nodes, ie somewhere around line 436?
  2. Is the slowness of the earlier version:
    • mostly because copying all files to all nodes is slow, or
    • because reading every 8th record (or similar) within each file is slow?
      If it is the latter, could it be worth shuffling the files deterministically, using a per-epoch seed, before partitioning over nodes, so that the data is a bit more random each epoch? (I notice in the two graphs above, that the graph for the newer version appears to converge slightly more slowly? Not sure if this is variance, or because the data are being presented less randomly?)

@taoleicn
Copy link
Contributor Author

taoleicn commented Mar 4, 2021

@hpasapp

Could it be worth adding a shuffle after the files have been partitioned over nodes, ie somewhere around line 436?

I can do that. Do you think it is a must-have given that sentences are all shuffled already?

Is the slowness of the earlier version

It is the latter. Note that the sentences within a file will be shuffled in different epochs:
https://github.com/asappresearch/sru/blob/3.0.0-dev/experiments/srupp_experiments/data_utils.py#L425-L426

Btw I made the figure a bit confusing.. the two graphs are not comparable on PPL because one uses a dropout of 0.1 and the other uses 0.05. The new version is converging faster actually.

@hpasapp
Copy link
Collaborator

hpasapp commented Mar 4, 2021

Btw I made the figure a bit confusing.. the two graphs are not comparable on PPL because one uses a dropout of 0.1 and the other uses 0.05. The new version is converging faster actually.

Ah, got it. Nice! :) Ok, ignore my thoughts on shuffling the files then please :)

@taoleicn
Copy link
Contributor Author

taoleicn commented Mar 5, 2021

🎉

@taoleicn taoleicn merged commit 81a657b into 3.0.0-dev Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants