Multi-GPU training #9092

Danyache · 2021-08-24T16:39:26Z

Danyache
Aug 24, 2021

Hi!
Thanks for your lib, it's quite impressive. I'm trying to train a big model, using pytorch lightning and have a questiong. I want to train the model with some big batchsize, which is too big to fit on one GPU, but also I want to calculate cross-entropy loss over all the batch. As I've read from the docs, there is a possibility to do so with "training_step_end". However it's only used with dp training. Is it any possibility to calculate cross entropy loss over all the batches using ddp training? And if it's not -- is there a possibility to use apex with dp training? Because when I use torch backend for fp16 training and using FuseAdam from apex -- there s no layers with fp16 at all. So there are 4 main questions:

Is it possible to calculate cross entropy loss over all the batches with ddp training?
If it's not -- is it possible to use apex and dp training?
And if that's not possible too -- how to make my fp16 training to be really the fp16 training or maybe there are some other ideas of how to speedup training and reduce the memory usage
Is it possible to use activation checkpointing with dp training?

justusschock · 2021-08-25T11:39:55Z

justusschock
Aug 25, 2021
Maintainer

Hi, for real big models in general, we recommend trying out fully sharded data parallel (FSDP) or deepspeed techniques (we offer both as a plugin by just switching arguments).

Regarding your question:
1.) Theoretically it would be possible, but not recommended. Instead DDP calculates the loss and the gradients on each process and then broadcasts them resulting in a similar behaviour as if you calculated the loss over the whole batch.

2.) That should be possible but not recommended. In general we recommend DDP over DP in almost all cases since it's faster and comes with less restrictions.

3.) We (and afaik apex as well) currently do not support real fp16 training since some operations in torch are not save to be fp16. Apex and native PyTorch provide AMP implementations (Automatic Mixed Precision). That means that the layers stay in fp32 and just the underlying operations are carried out in fp16 if appropriate. So you won't be able to notice that by checking the layers dtype.

4.) That should be definitely possible. But probably you don't need it. I recommend to really have a look at deepspeed or shared data parallel before deciding to use that. That would come with some severe restrictions and performance impacts.

2 replies

Danyache Aug 25, 2021
Author

Thanks a lot for your answer.

Yes, but for example in open-ai model CLIP they've used batch of size 32,768 (paper : https://arxiv.org/pdf/2103.00020.pdf). And if I get it right they've calculated loss all over the full batch of this size. And if they did so, I don't understand how
Ok, thanks
If you look at FusedAdam code from apex library (https://nvidia.github.io/apex/_modules/apex/optimizers/fused_adam.html), there is a part of code like
if p.dtype == torch.float16: ... elif p.dtype == torch.float32: ..., and when I use it, the part of p.dtype == torch.float16 is never used (I checked it by modifying code a bit with prints). Maybe I don't get it right or maybe I don't use it properly with fp16 in pytorchlightning
Yes, deepspeed is nice, but it is definetely not the way, to calculate loss over some really big bs such as 32768 in CLIP paper

justusschock Aug 25, 2021
Maintainer

In the paper they mention that they do use some kind of sharded states. Also depending on how you average the loss across the batch (do you sum it, average it or whatever?) you could use the distributed primitives to virtually achieve the same although the actual loss calculation would happen on a per-gpu basis and you would only reduce the already pre-computed values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training #9092

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Multi-GPU training #9092

Danyache Aug 24, 2021

Replies: 1 comment · 2 replies

justusschock Aug 25, 2021 Maintainer

Danyache Aug 25, 2021 Author

justusschock Aug 25, 2021 Maintainer

Danyache
Aug 24, 2021

Replies: 1 comment 2 replies

justusschock
Aug 25, 2021
Maintainer

Danyache Aug 25, 2021
Author

justusschock Aug 25, 2021
Maintainer