Change `non_eos_penalty` to be consistent across `OnPolicy` trainers #2033

RylanSchaeffer · 2024-09-07T16:21:03Z

What does this PR do?

This PR is designed to address this issue: #2012

To quickly recap, PPOv2Trainer and RLOOTrainer replace non-EOS outputs' scores with a constant penalty, whereas OnlineDPOTrainer subtracts non-EOS outputs' scores by a constant penalty. After discussing with Quentin, I believe we want PPOv2Trainer and RLOOTrainer to be consistent with OnlineDPOTrainer.

Before submitting

[x ] Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
[x ] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

No

Who can review?

@qgallouedec

…an EOS token

RylanSchaeffer · 2024-09-07T16:35:11Z

@qgallouedec I think this is ready for your review. Can you please have a look and get back to me on any additional changes you want made? Thank you!

qgallouedec · 2024-09-08T11:56:57Z

Thanks a lot, this PR makes sense.
One remark though, the penalty should be positive, because it's substracted.

qgallouedec · 2024-09-08T12:00:25Z

Before merging, I'd like to make sure that the results are still comparable. Not for all trainers, maybe just for RLOO? Do you have ressource to run an experiment?

HuggingFaceDocBuilderDev · 2024-09-08T12:01:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

RylanSchaeffer · 2024-09-08T16:25:58Z

@qgallouedec I fixed the incorrect negative defaults. I will now run RLOO before and after the change. Does that sound like a sufficient comparison?

RylanSchaeffer · 2024-09-08T20:11:39Z

(Old) Command to Replace Scores:

python -u examples/scripts/ppo/ppo_tldr.py \
--learning_rate 3e-6 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--total_episodes 30000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--non_eos_penalty \
--stop_token eos \
--response_length 53

W&B Run: https://wandb.ai/rylan/huggingface/runs/3vk55y9v

New Command to Subtract Scores:

python -u examples/scripts/ppo/ppo_tldr.py \
--learning_rate 3e-6 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--total_episodes 30000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--missing_eos_penalty 1.0 \
--stop_token eos \
--response_length 53

W&B Run: https://wandb.ai/rylan/huggingface/runs/9l8fvykd

Results

@qgallouedec How long would you like me to let these two run for?

RylanSchaeffer · 2024-09-08T23:38:44Z

Subtracting and replacing seem relatively consistent with one another.

qgallouedec · 2024-09-10T10:17:57Z

Very nice, thanks a lot @RylanSchaeffer

trl/trainer/utils.py

RylanSchaeffer added 6 commits September 6, 2024 15:47

Subtract a penalty from OnPolicy Trainers if output does not contain …

91967ff

…an EOS token

Caught a few other problems

93fc758

Updated the documentation for RLOO trainer and PPOv2Trainer

49c91be

Corrected the default type and value for missing_eos_penalty

f758913

Made RLOO Trainer consistent with Online DPO and PPOv2

7f5c3c5

Removed --non_eos_penalty from all documentation

afcee22

kashif approved these changes Sep 8, 2024

View reviewed changes

Merge branch 'main' into main

959c1a6

RylanSchaeffer added 3 commits September 8, 2024 12:09

Made missing_eos_penalty examples positive (because we subtract).

415421e

Merge branch 'main' of github.com:RylanSchaeffer/trl

bed95ca

Caught two more incorrect examples

5136ffd

Merge branch 'main' into main

33441bf

RylanSchaeffer mentioned this pull request Sep 8, 2024

OnPolicyConfig: Change non_eos_penalty to be more clearly documented and consistent across different trainers #2012

Closed

RylanSchaeffer added 2 commits September 8, 2024 16:19

Removed unnecessary whitespace to make ruff happy

6872846

Merge branch 'main' of github.com:RylanSchaeffer/trl

0096968

Merge branch 'main' into main

75818ed

qgallouedec reviewed Sep 10, 2024

View reviewed changes

trl/trainer/utils.py Outdated Show resolved Hide resolved

Update trl/trainer/utils.py

32e9731

qgallouedec merged commit 2ee0b62 into huggingface:main Sep 10, 2024
3 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `non_eos_penalty` to be consistent across `OnPolicy` trainers #2033

Change `non_eos_penalty` to be consistent across `OnPolicy` trainers #2033

RylanSchaeffer commented Sep 7, 2024 •

edited

Loading

RylanSchaeffer commented Sep 7, 2024

qgallouedec commented Sep 8, 2024

qgallouedec commented Sep 8, 2024

HuggingFaceDocBuilderDev commented Sep 8, 2024

RylanSchaeffer commented Sep 8, 2024 •

edited

Loading

RylanSchaeffer commented Sep 8, 2024 •

edited

Loading

RylanSchaeffer commented Sep 8, 2024

qgallouedec commented Sep 10, 2024

Change non_eos_penalty to be consistent across OnPolicy trainers #2033

Change non_eos_penalty to be consistent across OnPolicy trainers #2033

Conversation

RylanSchaeffer commented Sep 7, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

RylanSchaeffer commented Sep 7, 2024

qgallouedec commented Sep 8, 2024

qgallouedec commented Sep 8, 2024

HuggingFaceDocBuilderDev commented Sep 8, 2024

RylanSchaeffer commented Sep 8, 2024 • edited Loading

RylanSchaeffer commented Sep 8, 2024 • edited Loading

Results

RylanSchaeffer commented Sep 8, 2024

qgallouedec commented Sep 10, 2024

Change `non_eos_penalty` to be consistent across `OnPolicy` trainers #2033

Change `non_eos_penalty` to be consistent across `OnPolicy` trainers #2033

RylanSchaeffer commented Sep 7, 2024 •

edited

Loading

RylanSchaeffer commented Sep 8, 2024 •

edited

Loading

RylanSchaeffer commented Sep 8, 2024 •

edited

Loading