-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional Additional Loss to Center Reward Models' Outputs #1932
Conversation
Added a reference. Co-authored-by: Quentin Gallouédec <[email protected]>
I'm currently training a model with this new parameter to see what it looks like. |
I'm hesitant to set the default. The value you chose is correct but I don't
know whether the change should be applied to all users without them
intentionally enabling it. What do you think?
…On Sat, Aug 17, 2024, 1:01 PM Quentin Gallouédec ***@***.***> wrote:
I'm currently training a model with this new parameter to see what it
looks like.
—
Reply to this email directly, view it on GitHub
<#1932 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACEHLC7WIRI4AM3BRKHC6ODZR56WBAVCNFSM6AAAAABMS4J26GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJUHEYTGNRQHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Co-authored-by: Quentin Gallouédec <[email protected]>
I wouldn't set it as default either. |
This one #1932 (comment) is about suggesting a value in the doc, not setting it as default. |
Right. I was more asking you for guidance about what default value we should choose. I think you and I are on the same page. |
@qgallouedec for my education and for posterity's sake, can you share your experimental results here (once completed)? |
Sure! TrainingHere are the wandb runs
![]() As expected the loss is a bit larger. Playing with the trained reward model
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_id = "qgallouedec/reward_modeling_anthropic_hh" # without the coef
# model_id = "qgallouedec/reward_modeling_anthropic_hh_crc" # with the coef
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
dataset = load_dataset("Anthropic/hh-rlhf", split="test")
examples = dataset[:8]
input_chosen = tokenizer(examples["chosen"], return_tensors="pt", padding=True)
input_rejected = tokenizer(examples["rejected"], return_tensors="pt", padding=True)
output_chosen = model(**input_chosen)
output_rejected = model(**input_rejected)
mean_chosen = output_chosen.logits.mean().item()
mean_rejected = output_rejected.logits.mean().item()
print(mean_chosen, mean_rejected)
So overall it's looking good! |
I'll just add a little piece of documentation and we're good to merge! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for your first contribution @RylanSchaeffer! |
In this issue, I requested an optional additional loss function to center the rewards output by a trained reward model: #1931
This PR is a sketch of what this might look like.
Please let me know if this seems sensible and what changes, if any, are appropriate.