-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[t5
] Fix negative kl issue
#262
Conversation
The documentation is not available anymore as the PR was closed or merged. |
if not self.is_encoder_decoder: | ||
output = generation[(1 - mask).sum() :] # remove padding | ||
else: | ||
output = generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could remove the special token here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is correct as the token is removed here right after: https://github.com/lvwerra/trl/blob/ed87942a47f26d15e823ca7674737be02e48cc0a/trl/trainer/ppo_trainer.py#L832
Also made a run with generation[1:]
and the KL becomes negative: https://wandb.ai/younesbelkada/trl/runs/vjbydeqv - so I think we shouldn't remove the special token here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that not suspicious that such a slight change breaks the training? If no then why is that expected? I'm asking this as I myself am having trouble with my t5 model and negative kl divergence even on v.0.4.1 release.
With other encoder-decoder models such as MarianMT models (BART architecture) I'm still experiencing negative kl, in fact it is becoming increasingly negative: -5, -19, -24. |
I experiencing negative kl warning with Alpaca-7B on the sentiment script. |
Can you try with non-batched generation as suggested in #256 (comment) and let us know if this works? |
In gpt2_sentiment.py, I have tried to modify the generation_kwargs a bit, like set the top_p to 0.9, and it said that my kl fall into negative and maybe because generation kwargs are not correctly set. I wonder is there any specific restriction on the generation kwargs during ppo? Gently pin:@younesbelkada |
Fixes #256
This PR fixes issues related with negative KL and T5 sentiment example. The first fix was related to the sentiment script that was incorrectly ported.
Before this PR, the padding side of tokenizers was always hardcoded to
left
in_batched_generate
. in encoder-decoder models, they should be set to their nativepadding_side
(i.e.right
for T5), as the padding is performed on the encoder tokens, and these models have been trained with this specific padding side. I think that the culprit is the way the positional attention bias is computed, that does not take into account the starting position ifpadding_side=left
: https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L436-L451Thus forcing padding_side=left for encoder-decoder models should probably be avoided, as most of Enc-dec models pad the encoder input to the right (to verify).
To illustrate the fix:
main
with the modification on the example script onlyYou can see that using
padding_side=left
led to unstable KL, whereas the proposed fix seems to lead to smoother KLcc @lvwerra