[`core`] Officially Support Reward Modeling #303

younesbelkada · 2023-04-14T10:26:38Z

What does this PR do?

With Reward modeling being an important piece of PPO algorithm, it would be cool to support an "official" RewardTrainer in trl.

The RewardTrainer simply inherits from transformers.Trainer, but with some constraints. Users should be responsible to create a paired dataset that contains input_ids_j, input_ids_k, attention_mask_j, attention_mask_k, if they want to use the default RewardDataCollatorWithPadding data collator.
Also I propose to add the possibility to create the PEFT model under the hood, if a user passes a PeftConfig to the Trainer.

This PR adds a first version of it, adds also nice tests and cool documentation about that

TODO: update the README & reward_trainer.mdx file

cc @lvwerra

- add working version - add all possible tests - add docs

HuggingFaceDocBuilderDev · 2023-04-14T10:48:02Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra

Generally looks really good and clean to me. Left a few comments to try to make it a bit more user friendly.

docs/source/trainer.mdx

trl/trainer/reward_trainer.py

lvwerra · 2023-04-17T14:30:49Z

Maybe @lewtun would also be interested to have a look to see if there is feedback from the H4 team.

Co-authored-by: Leandro von Werra <[email protected]>

lewtun

Awesome feature @younesbelkada 🔥

I left some tiny nits and a feature request to compute accuracy by default :)

docs/source/reward_trainer.mdx

lewtun · 2023-04-18T08:22:19Z

docs/source/reward_trainer.mdx

+
+## Using the `RewardTrainer`
+
+After standardizing your dataset, you can use the `RewardTrainer` as a classic HugingFace Trainer. 


Maybe explain what format the raw dataset should have here? E.g. you could use samples of the StackExchange or Anthropic dataset (https://huggingface.co/datasets/Anthropic/hh-rlhf) as a guide

I added few lines in 4bcd96e but not sure if what I said is 100% correct, would love to have a second look here!

Will also add an example now - EDIT added it

docs/source/reward_trainer.mdx

trl/trainer/reward_trainer.py

lewtun · 2023-04-18T08:29:58Z

trl/trainer/reward_trainer.py

+    from peft import get_peft_model
+
+
+class RewardTrainer(Trainer):


Since accuracy is the most common metric for evaluating reward models, would it make sense to provide it as a default in compute_metrics? E.g. something like this should work:

def compute_metrics(eval_pred): predictions, _ = eval_pred # Here, predictions is rewards_chosen and rewards_rejected. # We want to see how much of the time rewards_chosen > rewards_rejected. predictions = np.argmax(predictions, axis=0) labels = np.zeros(predictions.shape) return accuracy.compute(predictions=predictions, references=labels)

This would add evaluate as an additional dependency to the library, we can also have it as an optional dependency similar as peft! For me it's totally fine to have it as a core dependency, but l want to hear @lvwerra's opinion to make sure we are aligned on this

Ah true, I think having it as an optional dep would be the way to go (unless evaluate is already so light that it's deps are covered by the trl core deps)

Accuracy is not a very hard metric, maybe we can just build it from scratch here :)

Co-authored-by: lewtun <[email protected]>

lvwerra

Just a few small comments, otherwise this is good to go!

docs/source/reward_trainer.mdx

lvwerra · 2023-04-26T08:56:14Z

trl/trainer/reward_trainer.py

+            compute_metrics (`Callable[[transformers.EvalPrediction], Dict]`):
+                The metrics to use for evaluation.


Default is accuracy.

lvwerra · 2023-04-26T08:57:48Z

trl/trainer/reward_trainer.py

+            eval_dataset,
+            tokenizer,
+            model_init,
+            compute_metrics,


i am not sure this works: if we overwrite the class method with our own metric the compute metrics in the parent class is never used, no?

what about defining a compute_accuracy function outside the class and pass if compute_metrics from the init is None

Sounds like a great plan!

younesbelkada added 4 commits April 14, 2023 10:22

v1

ec7ddf7

- add working version - add all possible tests - add docs

add some contents

4ca4460

clean up

0e873f1

fixes

df0e14a

patch test for now

c5656b2

younesbelkada requested a review from lvwerra April 14, 2023 10:57

younesbelkada added 4 commits April 14, 2023 11:41

fix test

f1085fc

clean up

8be9855

fix

096141a

this time fix

c63ef0f

lvwerra reviewed Apr 17, 2023

View reviewed changes

docs/source/trainer.mdx Outdated Show resolved Hide resolved

trl/trainer/reward_trainer.py Show resolved Hide resolved

trl/trainer/reward_trainer.py Show resolved Hide resolved

trl/trainer/reward_trainer.py Outdated Show resolved Hide resolved

younesbelkada and others added 4 commits April 17, 2023 17:56

Update docs/source/trainer.mdx

df83bbb

Co-authored-by: Leandro von Werra <[email protected]>

fixe

a438c93

update

a177866

final changes

e9dfc9d

younesbelkada requested a review from lvwerra April 17, 2023 16:58

oops

6cb1e74

lewtun reviewed Apr 18, 2023

View reviewed changes

younesbelkada and others added 7 commits April 18, 2023 15:15

Update docs/source/reward_trainer.mdx

0d0d78d

Co-authored-by: lewtun <[email protected]>

Update docs/source/reward_trainer.mdx

25178af

Co-authored-by: lewtun <[email protected]>

Update docs/source/reward_trainer.mdx

4482f4b

Co-authored-by: lewtun <[email protected]>

switch to chosen / rejected

73d0405

fixes

4bcd96e

add example

92fdd95

add accuracy metric

47677e8

younesbelkada requested a review from lewtun April 19, 2023 16:56

younesbelkada mentioned this pull request Apr 25, 2023

Issues running stack llama example #313

Closed

lvwerra reviewed Apr 26, 2023

View reviewed changes

younesbelkada added 2 commits April 26, 2023 09:30

pass PEFT config

a0af2a1

refactor compute metrics

0f36224

younesbelkada requested a review from lvwerra April 26, 2023 09:42

lvwerra approved these changes Apr 26, 2023

View reviewed changes

younesbelkada merged commit 3cfe194 into huggingface:main Apr 26, 2023

younesbelkada deleted the add-reward-trainer branch April 26, 2023 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core`] Officially Support Reward Modeling #303

[`core`] Officially Support Reward Modeling #303

younesbelkada commented Apr 14, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 14, 2023 •

edited

Loading

lvwerra left a comment

lvwerra commented Apr 17, 2023

lewtun left a comment

lewtun Apr 18, 2023

younesbelkada Apr 19, 2023

younesbelkada Apr 19, 2023 •

edited

Loading

lewtun Apr 18, 2023

younesbelkada Apr 18, 2023 •

edited

Loading

lewtun Apr 18, 2023

lvwerra Apr 19, 2023

lvwerra left a comment

lvwerra Apr 26, 2023

lvwerra Apr 26, 2023

younesbelkada Apr 26, 2023


		## Using the `RewardTrainer`

		After standardizing your dataset, you can use the `RewardTrainer` as a classic HugingFace Trainer.

		from peft import get_peft_model


		class RewardTrainer(Trainer):

		compute_metrics (`Callable[[transformers.EvalPrediction], Dict]`):
		The metrics to use for evaluation.

[core] Officially Support Reward Modeling #303

[core] Officially Support Reward Modeling #303

Conversation

younesbelkada commented Apr 14, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 14, 2023 • edited Loading

lvwerra left a comment

Choose a reason for hiding this comment

lvwerra commented Apr 17, 2023

lewtun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[`core`] Officially Support Reward Modeling #303

[`core`] Officially Support Reward Modeling #303

younesbelkada commented Apr 14, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 14, 2023 •

edited

Loading

younesbelkada Apr 19, 2023 •

edited

Loading

younesbelkada Apr 18, 2023 •

edited

Loading