Optional Additional Loss to Center Reward Models' Outputs #1932

RylanSchaeffer · 2024-08-15T20:52:11Z

In this issue, I requested an optional additional loss function to center the rewards output by a trained reward model: #1931

This PR is a sketch of what this might look like.

Please let me know if this seems sensible and what changes, if any, are appropriate.

trl/trainer/reward_config.py

trl/trainer/reward_trainer.py

Added a reference. Co-authored-by: Quentin Gallouédec <[email protected]>

trl/trainer/reward_config.py

qgallouedec · 2024-08-17T17:00:59Z

I'm currently training a model with this new parameter to see what it looks like.

RylanSchaeffer · 2024-08-17T17:02:53Z

I'm hesitant to set the default. The value you chose is correct but I don't know whether the change should be applied to all users without them intentionally enabling it. What do you think?

…

On Sat, Aug 17, 2024, 1:01 PM Quentin Gallouédec ***@***.***> wrote: I'm currently training a model with this new parameter to see what it looks like. — Reply to this email directly, view it on GitHub <#1932 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACEHLC7WIRI4AM3BRKHC6ODZR56WBAVCNFSM6AAAAABMS4J26GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJUHEYTGNRQHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Co-authored-by: Quentin Gallouédec <[email protected]>

qgallouedec · 2024-08-17T17:11:35Z

I'm hesitant to set the default.

I wouldn't set it as default either.

qgallouedec · 2024-08-17T17:12:24Z

This one #1932 (comment) is about suggesting a value in the doc, not setting it as default.

RylanSchaeffer · 2024-08-17T17:14:23Z

Right. I was more asking you for guidance about what default value we should choose. I think you and I are on the same page.

RylanSchaeffer · 2024-08-17T17:17:13Z

@qgallouedec for my education and for posterity's sake, can you share your experimental results here (once completed)?

qgallouedec · 2024-08-17T20:07:20Z

Sure!

Training

Here are the wandb runs

center_rewards_coefficient = None: https://wandb.ai/huggingface/trl/runs/u6zob8ml (brown)
center_rewards_coefficient = 0.01: https://wandb.ai/huggingface/trl/runs/d73qlevz (green)

As expected the loss is a bit larger.

Playing with the trained reward model

center_rewards_coefficient = None: https://huggingface.co/qgallouedec/reward_modeling_anthropic_hh
center_rewards_coefficient = 0.01: https://huggingface.co/qgallouedec/reward_modeling_anthropic_hh_crc

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "qgallouedec/reward_modeling_anthropic_hh"  # without the coef
# model_id = "qgallouedec/reward_modeling_anthropic_hh_crc"  # with the coef

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)

dataset = load_dataset("Anthropic/hh-rlhf", split="test")

examples = dataset[:8]

input_chosen = tokenizer(examples["chosen"], return_tensors="pt", padding=True)
input_rejected = tokenizer(examples["rejected"], return_tensors="pt", padding=True)

output_chosen = model(**input_chosen)
output_rejected = model(**input_rejected)

mean_chosen = output_chosen.logits.mean().item()
mean_rejected = output_rejected.logits.mean().item()
print(mean_chosen, mean_rejected)

`center_rewards_coefficient =`	Rejected	Chosen
`None`	-3.3083	-2.1824
`0.01`	-0.6140	0.2871

So overall it's looking good!

qgallouedec · 2024-08-17T20:14:50Z

I'll just add a little piece of documentation and we're good to merge!

…RylanSchaeffer/1932

HuggingFaceDocBuilderDev · 2024-08-17T20:34:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2024-08-17T20:44:39Z

Thanks for your first contribution @RylanSchaeffer!

RylanSchaeffer added 3 commits August 15, 2024 16:44

Implemented Eisenstein reward model centering

c0abaed

Forgot self in accessing args

ce707bf

Added docstring for center_rewards_coefficient.

65d7bec

qgallouedec linked an issue Aug 17, 2024 that may be closed by this pull request

Additional Optional Loss to Center Reward Models #1931

Closed

qgallouedec reviewed Aug 17, 2024

View reviewed changes

trl/trainer/reward_config.py Outdated Show resolved Hide resolved

qgallouedec reviewed Aug 17, 2024

View reviewed changes

trl/trainer/reward_trainer.py Outdated Show resolved Hide resolved

RylanSchaeffer and others added 3 commits August 17, 2024 12:42

Fixed bug.

bad3db3

Update trl/trainer/reward_config.py

e64d108

Added a reference. Co-authored-by: Quentin Gallouédec <[email protected]>

Switched to Quentin's suggestion

a691462

qgallouedec reviewed Aug 17, 2024

View reviewed changes

trl/trainer/reward_config.py Outdated Show resolved Hide resolved

Update trl/trainer/reward_config.py

4ebf53b

Co-authored-by: Quentin Gallouédec <[email protected]>

qgallouedec and others added 5 commits August 17, 2024 20:28

doc

256d92b

Merge branch 'main' of https://github.com/rylanschaeffer/trl into pr/…

0bb86e1

…RylanSchaeffer/1932

Merge branch 'main' into main

c5298de

0.01

6c6b50f

Merge branch 'main' of https://github.com/rylanschaeffer/trl into pr/…

805a204

…RylanSchaeffer/1932

qgallouedec approved these changes Aug 17, 2024

View reviewed changes

style

72f6b3b

qgallouedec merged commit 42933fa into huggingface:main Aug 17, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional Additional Loss to Center Reward Models' Outputs #1932

Optional Additional Loss to Center Reward Models' Outputs #1932

RylanSchaeffer commented Aug 15, 2024 •

edited

Loading

qgallouedec commented Aug 17, 2024

RylanSchaeffer commented Aug 17, 2024 via email

qgallouedec commented Aug 17, 2024

qgallouedec commented Aug 17, 2024

RylanSchaeffer commented Aug 17, 2024

RylanSchaeffer commented Aug 17, 2024 •

edited

Loading

qgallouedec commented Aug 17, 2024 •

edited

Loading

qgallouedec commented Aug 17, 2024

HuggingFaceDocBuilderDev commented Aug 17, 2024

qgallouedec commented Aug 17, 2024

Optional Additional Loss to Center Reward Models' Outputs #1932

Optional Additional Loss to Center Reward Models' Outputs #1932

Conversation

RylanSchaeffer commented Aug 15, 2024 • edited Loading

qgallouedec commented Aug 17, 2024

RylanSchaeffer commented Aug 17, 2024 via email

qgallouedec commented Aug 17, 2024

qgallouedec commented Aug 17, 2024

RylanSchaeffer commented Aug 17, 2024

RylanSchaeffer commented Aug 17, 2024 • edited Loading

qgallouedec commented Aug 17, 2024 • edited Loading

Training

Playing with the trained reward model

qgallouedec commented Aug 17, 2024

HuggingFaceDocBuilderDev commented Aug 17, 2024

qgallouedec commented Aug 17, 2024

RylanSchaeffer commented Aug 15, 2024 •

edited

Loading

RylanSchaeffer commented Aug 17, 2024 •

edited

Loading

qgallouedec commented Aug 17, 2024 •

edited

Loading