-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer(gradient_clip_algorithm='value')
has no effect (from #6123)
#6920
Comments
Trainer(gradient_clip_algorithm='value'
has no effect (from #6123)Trainer(gradient_clip_algorithm='value')
has no effect (from #6123)
cc: @dhkim0225 as the original PR author |
I've created a potential solution to this bug. One major change I made is re-raising the caught |
This is a bug. Already reported in #6807 |
Thanks for the info! Good the know that I have not misunderstood that part of the code. |
@ceshine Thank you for reporting this! I apologize for the inconvenience caused. |
Also add a temporay workaround to Lightning-AI#6807
Also add a temporay workaround to Lightning-AI#6807
@dhkim0225 Thanks for the reply. I'm afraid that the fix to the problem is more than changing that line. I'm also creating a PR with my changes (to use the CI pipeline). Hopefully, it'll get us to a proper solution faster. |
Closing my PR since @ceshine 's PR did all things I wanted. |
🐛 Bug
I couldn't find anywhere in the code where the
gradient_clip_algorithm
argument (implemented in #6123) got passed toAccelerator.clip_gradients
method and suspected that the default algorithm (GradClipAlgorithmType.NORM
) is always used no matter what.After a brief investigation, I believe I've confirmed that it is the case and the original test case couldn't correctly detect it.
I'm not sure how to properly fix this bug yet but would like to issue a warning to other users (that only clipping by norm works at this moment).
To Reproduce
This commit firstly disabled the suppression of
AssertionError
inTrainer.run_train
, and then test if the maximum gradient value is almost the same as the set 1e-5 threshold.I ran the command
pytest tests/trainer/test_trainer.py -k "test_gradient_clipping_by_value and not test_gradient_clipping_by_value_fp16"
and got this:If we change the default algorithm in
PrecisionPlugin.clip_gradients
toGradClipAlgorithmType.VALUE
, we will pass this test case.Alternatively, we can directly assert if the clip algorithm is by value in
PrecisionPlugin.clip_gradients
. We'll get the following error:By now we can clearly see that:
gradient_clip_algorithm
changes nothing in the training procedureAssertionError
in the original test case will be ignored anyway because of the design ofTrainer.run_train
. (I'm not entirely sure of this one because I'm not familiar with the test environment setup. It appears so in my local environment for sure.)Environment
- GPU:
- GeForce RTX 2070
- available: True
- version: 11.0
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.3.0rc0
- tqdm: 4.49.0
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.7.9
- version: removed reduce on non-loss outputs from dp #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021
The text was updated successfully, but these errors were encountered: