Multi-GPU training #9092
Replies: 1 comment 2 replies
-
Hi, for real big models in general, we recommend trying out fully sharded data parallel (FSDP) or deepspeed techniques (we offer both as a plugin by just switching arguments). Regarding your question: 2.) That should be possible but not recommended. In general we recommend DDP over DP in almost all cases since it's faster and comes with less restrictions. 3.) We (and afaik apex as well) currently do not support real fp16 training since some operations in torch are not save to be fp16. Apex and native PyTorch provide AMP implementations (Automatic Mixed Precision). That means that the layers stay in fp32 and just the underlying operations are carried out in fp16 if appropriate. So you won't be able to notice that by checking the layers dtype. 4.) That should be definitely possible. But probably you don't need it. I recommend to really have a look at deepspeed or shared data parallel before deciding to use that. That would come with some severe restrictions and performance impacts. |
Beta Was this translation helpful? Give feedback.
-
Hi!
Thanks for your lib, it's quite impressive. I'm trying to train a big model, using pytorch lightning and have a questiong. I want to train the model with some big batchsize, which is too big to fit on one GPU, but also I want to calculate cross-entropy loss over all the batch. As I've read from the docs, there is a possibility to do so with "training_step_end". However it's only used with dp training. Is it any possibility to calculate cross entropy loss over all the batches using ddp training? And if it's not -- is there a possibility to use apex with dp training? Because when I use torch backend for fp16 training and using FuseAdam from apex -- there s no layers with fp16 at all. So there are 4 main questions:
Beta Was this translation helpful? Give feedback.
All reactions