-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime error when attempting to use data distributed parallel #19
Comments
@Phirefly9 yes, I believe I ran into a related RevTorch error yesterday as well. Could you report this to the RevTorch issues so the author can chew on it? |
@Phirefly9 oh, nevermind, I think they are unrelated |
I'll try to create a minimal example without lightning just hitting revtorch and then open one. Assuming that is the problem area |
Yea, it's related to how RevTorch manually handles the backward pass |
@Phirefly9 I will try to integrate https://github.com/silvandeleemput/memcnn today and see if we can solve these issues that way. worst case, I become an expert with custom backprop and roll my own lol |
Well I come back confused. my experiment with the example revtorch code actually didn't produce the same error, so I've written a different distributed training script for reformer without lightning that is able to train, this one uses Nvidia Apex so I could test half precision at the same time. optimizer levels other than 'O0' produce the following error, here is the error for 'O1', so half precision is currently not working
I recommend using the nvidia pytorch container because it has apex installed already
|
@Phirefly9 oh, that's a complete new error, and I understand why it doesn't work. i can put in a fix for that soon! (the rotation matrix for calculating LSH needs to be halved as well) |
@Phirefly9 79974b4 can you try again? |
seems the same?
|
@Phirefly9 just made another commit, can you upgrade and try again? |
I was able to train using the O1 optimization level, which is the standard one APEX typically recommends, I upped it to O2 and got
|
apex is not very seamless lol, ok i'll look into it |
oh, my error lol, ok fixed. can you try again? |
O2 and O3 both run through on that commit. nice work! I would be fine closing this issue on that, I don't know what the deal with pytorch-lightning is, but I think we are both in agreement it is probably down in revtorch. So I can play with lightning some more and see if I can isolate the issue in revtorch using it |
Robin from RevTorch got back to me saying he has little time. If you can figure out what's wrong with his implementation, let me know. I'm going to investigate memcnn in the meanwhile, although that repository will require a PR as well to split on the right dimension, even if it does work. |
To be honest, custom backprop scares me lol |
@Phirefly9 RobinBruegger/RevTorch#8 may fix the issue, but I'm not entirely sure. have you tried setting the |
I'm sure it will work if I add that flag. I'm thinking it's just a lightning bug, I've trained revtorch's example using lightning and it worked in distributed, and I've been looking all over the code and don't see anything. At his point I think it's a bug with lightning. If I can't find the issue after some more searching I'll open a ticket with them. |
@Phirefly9 Robin and I fixed an issue with RevTorch to allow for multiple backward passes. do you think you could try the above again and see if it incidentally fixed your issue? |
It did not unfortunately. I've opened up an issue on pytorch-lightning and hope to hear from them soon |
I would also like to train Reformer from this repo with DistributedDataParallel. Is the current workaround to use DistributedDataParallel from Apex as a drop-in replacement for the pytorch implementation, or is it sufficient to call amp.initialize(model, optim, opt_level='O0') and proceed with the pytorch implementation of DistributedDataParallel as shown in the code above? |
We may be able to use DistributedDataParellel, but I am currently trying to utilize Microsoft's new DeepSpeed library for distributed training |
@fcampagne I would recommend distributeddataparallel, you will find it's faster is most cases. Deepspeed is probably the new standard though. it integrates APEX and distributedDataParallel as well as other improvements. The other benifit is that is usually only 1 line change from a 1 gpu pytorch script |
I answered my own question and found that is is necessary to use DDP from Apex (i.e., from apex.parallel import DistributedDataParallel as DDP) instead of the pytorch implementation. |
I looked at DeepSpeed as well, looks good, but stuck on pytorch 1.2 as far as their supported dependencies. We're on 1.4 already. |
@Phirefly9 @zbloss @justindujardin @fcampagne Guys! I got DeepSpeed working with Reformer after the latest Reversible Net changes! It's blazing fast! (using it in place of DataParallel locally) |
I'm not sure about distributed, but the parallelism Deepspeed provided even on my two GPUs at home is world's faster. You can follow the example at https://github.com/lucidrains/reformer-pytorch/tree/master/examples/enwik8_deepspeed |
closing because of independent replication of Deepspeed training in other issue |
Thank you for putting in the time to do this. I have a bunch of ideas for it.
I crudely ported your example training script to use the pytorch-lightning library and when I attempted to use data distributed ran into a crash, The problem may be down in the revtorch library, but I want to hand the script off to you so you can play with it while reporting it so you can take a look and decide where the issue is.
you can get the crash by supplying the --distributed flag to the script with any number of gpus
script:
The text was updated successfully, but these errors were encountered: