Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RuntimeError in minimalistic example #87

Closed
yuxiaooye opened this issue Jul 27, 2023 · 3 comments
Closed

[BUG] RuntimeError in minimalistic example #87

yuxiaooye opened this issue Jul 27, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@yuxiaooye
Copy link

Thanks for the amazing repo! I encounter a bug when I run the minimalistic example given in README:

Describe the bug

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

image

To Reproduce

minimalistic example in README.md

Questions

I find the bug can be fixed if I add DDPStrategy(find_unused_parameters=True) according to the suggestion:

trainer = L.Trainer(
    max_epochs=1,  # only few epochs
    accelerator="gpu",  # use GPU if available, else you can use others as "cpu"
    logger=None,  # can replace with WandbLogger, TensorBoardLogger, etc.
    precision="16-mixed",  # Lightning will handle faster training with mixed precision
    gradient_clip_val=1.0,  # clip gradients to avoid exploding gradients
    reload_dataloaders_every_n_epochs=1,  # necessary for sampling new data

    strategy=DDPStrategy(find_unused_parameters=True),  # TODO can we add it?
)

But I'm wondering whether the training can still work well. Is there any bad effect introduced by DDPStrategy(find_unused_parameters=True)?

@yuxiaooye yuxiaooye added the bug Something isn't working label Jul 27, 2023
fedebotu added a commit that referenced this issue Aug 1, 2023
@fedebotu
Copy link
Member

fedebotu commented Aug 1, 2023

Hi there!
To answer your question: no effect in terms of accuracy, there might be some minor overhead as described here, and this may be solved with future TorchRL versions.
The minimalistic example was not tested with multiple GPUs (as I believe is your case) since we normally train with Hydra that set that up automatically. We now included some better automatic DDP (that sets that parameter automatically which is due to RL environments), and you may use the newer RL4COTrainer see new minimalistic example here. We will release it soon with v0.1.1, you may already use it with the nightly version in the meantime with pip install git+https://github.com/kaist-silab/rl4co.

Thanks for spotting the bug ;)

@yuxiaooye
Copy link
Author

I see! Thanks for your reply!

@fedebotu
Copy link
Member

fedebotu commented Aug 2, 2023

Marking as closed. Note that we released v0.1.1!

@fedebotu fedebotu closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants