-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression between Lightning 1.1.3 and 1.1.5 #5656
Comments
I ran the simple image classifier with fp32/fp16 using pytorch lightning 1.1.3 vs 1.1.5 and was unable to reproduce this bug. My hunch given history is that this is similar to the GradScaler bug we had with the LightningOptimizer a while back. There is only a handful of commits I can see causing this issue between 1.1.3/1.1.5: Optimizer:
Logging: I'll sync with the NVIDIA team to see if we can get a reproducible example on our end to debug per commit. |
We're now able get the accuracy back by forcing
|
Sorry about that. We changed the default, and we mentioned it in the changelog, but I did not know it has an influence on the model performance. |
I think it was mentioned in the changelog, but we definitely should've made it clearer. Thanks for the digging @ericharper! We need some way to be able to turn this on again without having to change the code (currently we have to create a ddp plugin and pass this through). There are definitely cases where we'd need to enable this flag. Any suggestions @awaelchli? EDIT: just had a look at the slack channel, and the discussed is happening there, with the jist being exposing all DDP parameters would be quite ugly to add to the trainer (as it means we need to add all arguments across all plugins). |
So is this fixed in 1.1.6? |
@awaelchli anything left TODO here? |
@Tiiiger Fixed is maybe not the right term here. We changed the value of from pytorch_lightning.plugins import DDPPlugin
trainer = Trainer(..., plugins=[DDPPlugin(find_unused_parameters=True/False)]) @edenlightning as far as the regression problem goes, this seems solved. We found the reason for it. |
There still seems to be the issue of why does this causes an accuracy difference... |
@awaelchli any idea on the accuracy difference? What is your recommendation for exposing this flag? Is there a way we can automate it? I wouldn't recommend exposing ddp flags as trainer args... |
@edenlightning unfortunately no :(
I recommend keeping it as a plugin argument for now (as shown in my example above). Automation: That's tricky, because the only way to know is to observe the gradient computation on the loss. This would mean we would have to be already training, but the flag needs to be set before training. Sometimes it can happen that we see this error message when
However, as you can see the error message itself mentions in case (2) it could simply be a silly mistake by the user not computing the loss correctly. This to me is a very strong hint that we cannot automate this choice. So in conclusion, given the above observation I recommend to keeping it as is. |
Closing this for now! |
🐛 Bug
Posted originally by @okuchaiev:
Has anyone observed a model performance degradation when switching from 1.1.3 to 1.1.4 and 1.1.5? On the plot below you can see exactly the same model/hyperparams trained using 1.1.3 (runs named enes3) and 1.1.5 (runs named enes5). You can see that 1.1.3 outperforms 1.1.5 consistently.
Please reproduce using the BoringModel
Currently not reproduced.
cc @ericharper
The text was updated successfully, but these errors were encountered: