-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change non_eos_penalty
to be consistent across OnPolicy
trainers
#2033
Conversation
@qgallouedec I think this is ready for your review. Can you please have a look and get back to me on any additional changes you want made? Thank you! |
Thanks a lot, this PR makes sense. |
Before merging, I'd like to make sure that the results are still comparable. Not for all trainers, maybe just for RLOO? Do you have ressource to run an experiment? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@qgallouedec I fixed the incorrect negative defaults. I will now run RLOO before and after the change. Does that sound like a sufficient comparison? |
(Old) Command to Replace Scores:
W&B Run: https://wandb.ai/rylan/huggingface/runs/3vk55y9v New Command to Subtract Scores:
W&B Run: https://wandb.ai/rylan/huggingface/runs/9l8fvykd Results@qgallouedec How long would you like me to let these two run for? |
Very nice, thanks a lot @RylanSchaeffer |
What does this PR do?
This PR is designed to address this issue: #2012
To quickly recap,
PPOv2Trainer
andRLOOTrainer
replace non-EOS outputs' scores with a constant penalty, whereasOnlineDPOTrainer
subtracts non-EOS outputs' scores by a constant penalty. After discussing with Quentin, I believe we wantPPOv2Trainer
andRLOOTrainer
to be consistent withOnlineDPOTrainer
.Before submitting
to it if that's the case.
documentation guidelines.
No
Who can review?
@qgallouedec