-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core
] Officially Support Reward Modeling
#303
[core
] Officially Support Reward Modeling
#303
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks really good and clean to me. Left a few comments to try to make it a bit more user friendly.
Maybe @lewtun would also be interested to have a look to see if there is feedback from the H4 team. |
Co-authored-by: Leandro von Werra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome feature @younesbelkada 🔥
I left some tiny nits and a feature request to compute accuracy by default :)
|
||
## Using the `RewardTrainer` | ||
|
||
After standardizing your dataset, you can use the `RewardTrainer` as a classic HugingFace Trainer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain what format the raw dataset should have here? E.g. you could use samples of the StackExchange or Anthropic dataset (https://huggingface.co/datasets/Anthropic/hh-rlhf) as a guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added few lines in 4bcd96e but not sure if what I said is 100% correct, would love to have a second look here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also add an example now - EDIT added it
from peft import get_peft_model | ||
|
||
|
||
class RewardTrainer(Trainer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since accuracy
is the most common metric for evaluating reward models, would it make sense to provide it as a default in compute_metrics
? E.g. something like this should work:
def compute_metrics(eval_pred):
predictions, _ = eval_pred
# Here, predictions is rewards_chosen and rewards_rejected.
# We want to see how much of the time rewards_chosen > rewards_rejected.
predictions = np.argmax(predictions, axis=0)
labels = np.zeros(predictions.shape)
return accuracy.compute(predictions=predictions, references=labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would add evaluate
as an additional dependency to the library, we can also have it as an optional dependency similar as peft
! For me it's totally fine to have it as a core dependency, but l want to hear @lvwerra's opinion to make sure we are aligned on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah true, I think having it as an optional dep would be the way to go (unless evaluate
is already so light that it's deps are covered by the trl
core deps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accuracy is not a very hard metric, maybe we can just build it from scratch here :)
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: lewtun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small comments, otherwise this is good to go!
trl/trainer/reward_trainer.py
Outdated
compute_metrics (`Callable[[transformers.EvalPrediction], Dict]`): | ||
The metrics to use for evaluation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is accuracy.
eval_dataset, | ||
tokenizer, | ||
model_init, | ||
compute_metrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure this works: if we overwrite the class method with our own metric the compute metrics in the parent class is never used, no?
what about defining a compute_accuracy
function outside the class and pass if compute_metrics
from the init is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great plan!
What does this PR do?
With Reward modeling being an important piece of PPO algorithm, it would be cool to support an "official" RewardTrainer in
trl
.The
RewardTrainer
simply inherits fromtransformers.Trainer
, but with some constraints. Users should be responsible to create a paired dataset that containsinput_ids_j
,input_ids_k
,attention_mask_j
,attention_mask_k
, if they want to use the defaultRewardDataCollatorWithPadding
data collator.Also I propose to add the possibility to create the PEFT model under the hood, if a user passes a
PeftConfig
to the Trainer.This PR adds a first version of it, adds also nice tests and cool documentation about that
TODO: update the README &
reward_trainer.mdx
filecc @lvwerra