-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experience from "Accurate, Large Minibatch SGD" #22
Labels
Comments
hma02
added a commit
that referenced
this issue
Jul 12, 2017
hma02
added a commit
that referenced
this issue
Jul 12, 2017
hma02
added a commit
that referenced
this issue
Jul 13, 2017
hma02
added a commit
that referenced
this issue
Jul 13, 2017
hma02
added a commit
that referenced
this issue
Jul 14, 2017
Use HeNormal for ConvLayers and Normal for the last FCLayer
hma02
added a commit
that referenced
this issue
Jul 20, 2017
in models: alexnet, googlenet, resnet50, vgg16
hma02
added a commit
that referenced
this issue
Aug 2, 2017
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
According to the Facebook paper, there are several implementation details to be adjusted:
1. Momentum correction. In our implementation, we used the equation (10) without momentum correction. We should either add momentum correction or change to equation (9).
2. Gradient aggregation. In our implementation, we used either weight averaging (avg) and summing gradient (cdd), neither of which normalizes the per-worker loss by total minibatch size kn, but by per-worker size n. We should consider averaging gradient and scaling up lr.
3. Learning rate gradual warm-up and linear scaling. The reason we didn't scale lr up was that when I tried this, gradient explosion happened at the beginning of training (even with a small number of workers) for VGG16. Note gradual warmup is increasing lr on every iteration rather than every epoch.
4. Batch Normalization parameters. According to the paper: "the BN statistics should not be computed across all workers". We should explicitly not do parameter exchanging on those BN parameters.
5. Use HeNormal initialization for ConvLayers and Normal for the last FCLayer. Set gamma to 0 for the last BN of each Residual Block.
6. Do multiple trials for reporting random variation. Median error of the final 5 epochs. Mean and standard deviation of the error from 5 independent runs. Each run is 90 epochs. lr/10 happens at 30, 60 and 80 epochs.
7. Use scale and aspect ratio data augmentation and normalize image by the per-color mean and std.
On the HPC side, the three phase allreduce "NCCL(reduction) -> MPI_Allreduce -> NCCL(broadcast)" mentioned in the paper can possibly be replaced by one NCCL2 operation. Or we need to make a python binding of Gloo?
The parallel communication idea mentioned in section 4 of the paper,
needs support from Theano. Currently, computation and communication are in serial in Theano-MPI.
The text was updated successfully, but these errors were encountered: