Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change non_eos_penalty to be consistent across OnPolicy trainers #2033

Merged
merged 15 commits into from
Sep 10, 2024

Conversation

RylanSchaeffer
Copy link
Contributor

@RylanSchaeffer RylanSchaeffer commented Sep 7, 2024

What does this PR do?

This PR is designed to address this issue: #2012

To quickly recap, PPOv2Trainer and RLOOTrainer replace non-EOS outputs' scores with a constant penalty, whereas OnlineDPOTrainer subtracts non-EOS outputs' scores by a constant penalty. After discussing with Quentin, I believe we want PPOv2Trainer and RLOOTrainer to be consistent with OnlineDPOTrainer.

Before submitting

  • [x ] Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • [x ] Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

No

Who can review?

@qgallouedec

@RylanSchaeffer
Copy link
Contributor Author

@qgallouedec I think this is ready for your review. Can you please have a look and get back to me on any additional changes you want made? Thank you!

@qgallouedec
Copy link
Member

Thanks a lot, this PR makes sense.
One remark though, the penalty should be positive, because it's substracted.

@qgallouedec
Copy link
Member

Before merging, I'd like to make sure that the results are still comparable. Not for all trainers, maybe just for RLOO? Do you have ressource to run an experiment?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@RylanSchaeffer
Copy link
Contributor Author

RylanSchaeffer commented Sep 8, 2024

@qgallouedec I fixed the incorrect negative defaults. I will now run RLOO before and after the change. Does that sound like a sufficient comparison?

@RylanSchaeffer
Copy link
Contributor Author

RylanSchaeffer commented Sep 8, 2024

(Old) Command to Replace Scores:

python -u examples/scripts/ppo/ppo_tldr.py \
--learning_rate 3e-6 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--total_episodes 30000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--non_eos_penalty \
--stop_token eos \
--response_length 53

W&B Run: https://wandb.ai/rylan/huggingface/runs/3vk55y9v

New Command to Subtract Scores:

python -u examples/scripts/ppo/ppo_tldr.py \
--learning_rate 3e-6 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--total_episodes 30000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--missing_eos_penalty 1.0 \
--stop_token eos \
--response_length 53

W&B Run: https://wandb.ai/rylan/huggingface/runs/9l8fvykd

Results

image

@qgallouedec How long would you like me to let these two run for?

@RylanSchaeffer
Copy link
Contributor Author

image

Subtracting and replacing seem relatively consistent with one another.

@qgallouedec
Copy link
Member

Very nice, thanks a lot @RylanSchaeffer

@qgallouedec qgallouedec merged commit 2ee0b62 into huggingface:main Sep 10, 2024
3 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants