Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🗺️ Implementation DiscoPOP Loss #2323

Merged
merged 20 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/source/dpo_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ The DPO algorithm supports several loss functions. The loss function can be set
| `"sppo_hard"` | The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. |
| `"aot"` or `loss_type="aot_pair"` | The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size. |
| `"apo_zero"` or `loss_type="apo_down"` | The [APO](https://huggingface.co/papers/2408.06266) method introduces an "anchored" version of the alignment objective. There are two variants: `apo_zero` and `apo_down`. The `apo_zero` loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, `apo_down` decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. |
| `"discopop"` | The [DiscoPOP](https://huggingface.co/papers/2406.08414) paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0). To use this discovered loss, set the `loss_type` value to `discopop` in the [`DPOConfig`]. |

### Label smoothing

Expand All @@ -167,6 +168,10 @@ The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterativ

The [WPO](https://huggingface.co/papers/2406.11827) paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the `use_weighting` flag to `True` in the [`DPOConfig`].

### DiscoPOP loss

The [DiscoPOP](https://huggingface.co/papers/2406.08414) paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0). To use this discovered loss, set the `loss_type` value to `discopop` in the [`DPOConfig`]. Additionally, you can change the `discopop_tau` value to change the shape of the DiscoPOP loss. However, the authors recommed the default value `discopop_tau=0.05`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### DiscoPOP loss
The [DiscoPOP](https://huggingface.co/papers/2406.08414) paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0). To use this discovered loss, set the `loss_type` value to `discopop` in the [`DPOConfig`]. Additionally, you can change the `discopop_tau` value to change the shape of the DiscoPOP loss. However, the authors recommed the default value `discopop_tau=0.05`.

Can you make the final remark about discopop_tau in the table instead?

### For Mixture of Experts Models: Enabling the auxiliary loss

MOEs are the most efficient if the load is about equally distributed between experts.
Expand Down
1 change: 1 addition & 0 deletions tests/test_dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,7 @@ def setUp(self):
["t5", "exo_pair", True],
["gpt2", "apo_zero", True],
["t5", "apo_down", False],
["gpt2", "discopop", False],
]
)
def test_dpo_trainer(self, name, loss_type, pre_compute):
Expand Down
2 changes: 2 additions & 0 deletions tests/test_trainers_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ def test_dpo(self):
ref_model_mixup_alpha=0.5,
ref_model_sync_steps=32,
rpo_alpha=0.5,
discopop_tau=0.1
)
trainer = DPOTrainer(
model="gpt2", ref_model="gpt2", args=training_args, train_dataset=dataset, processing_class=tokenizer
Expand Down Expand Up @@ -193,6 +194,7 @@ def test_dpo(self):
self.assertEqual(trainer.args.ref_model_mixup_alpha, 0.5)
self.assertEqual(trainer.args.ref_model_sync_steps, 32)
self.assertEqual(trainer.args.rpo_alpha, 0.5)
self.assertEqual(trainer.args.discopop_tau, 0.1)

def test_kto(self):
tokenizer = AutoTokenizer.from_pretrained("gpt2")
Expand Down
1 change: 1 addition & 0 deletions trl/commands/scripts
6 changes: 6 additions & 0 deletions trl/trainer/dpo_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ class DPOConfig(TrainingArguments):
- `"aot_pair"`: AOT loss for unpaired datasets from the [AOT](https://huggingface.co/papers/2406.05882) paper.
- `"apo_zero"`: APO-zero loss from the [APO](https://huggingface.co/papers/2408.06266) paper.
- `"apo_down"`: APO-down loss from the [APO](https://huggingface.co/papers/2408.06266) paper.
- `"discopop"`: DiscoPOP (a.k.a Log-Ratio Modulated Loss, LRML) loss from the [DiscoPOP](https://huggingface.co/papers/2406.08414) paper.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sorted by date

Suggested change
- `"apo_zero"`: APO-zero loss from the [APO](https://huggingface.co/papers/2408.06266) paper.
- `"apo_down"`: APO-down loss from the [APO](https://huggingface.co/papers/2408.06266) paper.
- `"discopop"`: DiscoPOP (a.k.a Log-Ratio Modulated Loss, LRML) loss from the [DiscoPOP](https://huggingface.co/papers/2406.08414) paper.
- `"discopop"`: DiscoPOP (a.k.a Log-Ratio Modulated Loss, LRML) loss from the [DiscoPOP](https://huggingface.co/papers/2406.08414) paper.
- `"apo_zero"`: APO-zero loss from the [APO](https://huggingface.co/papers/2408.06266) paper.
- `"apo_down"`: APO-down loss from the [APO](https://huggingface.co/papers/2408.06266) paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not "lrm" instead by the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a clear answer to this. While lrml was the name proposed by the LLM during discovery, the authors agreed to name the best performing one DiscoPOP, which seemed to us like a catchy abbreviation for Discovered Preference Optimization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion on this, but LRML might be more informative than DiscoPOP. I mean, we could have a lot of different Discovered Preference Optimization losses.

use_weighting (`bool`, *optional*, defaults to `False`):
Whether or not to weight the loss as done in the [WPO](https://huggingface.co/papers/2406.11827) paper.
label_pad_token_id (`int`, *optional*, defaults to `-100`):
Expand Down Expand Up @@ -132,6 +133,9 @@ class DPOConfig(TrainingArguments):
α parameter from the [RPO](https://huggingface.co/papers/2404.19733) paper (v3), which controls the
weighting of the NLL term in the loss. If `None`, no weighting is applied and the loss is the same as the
DPO loss. The paper recommends `rpo_alpha=1.0`.
discopop_tau (`float`, *optional*, defaults to `0.05`):
tau/temperature parameter from the [DiscoPOP](https://huggingface.co/papers/2406.08414) paper, which controls
the shape of log ratio modulated loss. The paper recommends the default value `discopop_tau=0.05`.
"""

learning_rate: float = 1e-6
Expand All @@ -150,6 +154,7 @@ class DPOConfig(TrainingArguments):
"aot_pair",
"apo_zero",
"apo_down",
"discopop",
] = "sigmoid"
use_weighting: bool = False
label_pad_token_id: int = -100
Expand All @@ -176,6 +181,7 @@ class DPOConfig(TrainingArguments):
ref_model_mixup_alpha: float = 0.9
ref_model_sync_steps: int = 64
rpo_alpha: Optional[float] = None
discopop_tau: Optional[float] = 0.05

def __post_init__(self):
if self.max_target_length is not None:
Expand Down
17 changes: 16 additions & 1 deletion trl/trainer/dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1019,11 +1019,26 @@ def dpo_loss(
losses_chosen = F.sigmoid(self.beta * chosen_logratios)
losses_rejected = 1 - F.sigmoid(self.beta * (chosen_logratios - rejected_logratios))
losses = losses_chosen + losses_rejected

elif self.loss_type == "discopop":
# Eqn (5) of the DiscoPOP paper (https://huggingface.co/papers/2406.08414)
# This loss was discovered with LLM discovery
pi_logratios = chosen_logps - rejected_logps
ref_logratios = ref_chosen_logps - ref_rejected_logps
logits = pi_logratios - ref_logratios
logits = logits * self.beta
# Modulate the mixing coefficient based on the log ratio magnitudes
log_ratio_modulation = torch.sigmoid(logits / self.args.discopop_tau)
logistic_component = -F.logsigmoid(logits)
exp_component = torch.exp(-logits)
# Blend between logistic and exponential component based on log ratio modulation
losses = logistic_component * (1 - log_ratio_modulation) + exp_component * log_ratio_modulation
return losses

else:
raise ValueError(
f"Unknown loss type: {self.loss_type}. Should be one of ['sigmoid', 'hinge', 'ipo', 'exo_pair', "
"'nca_pair', 'robust', 'bco_pair', 'sppo_hard', 'aot', 'aot_pair', 'apo_zero', 'apo_down']"
"'nca_pair', 'robust', 'bco_pair', 'sppo_hard', 'aot', 'aot_pair', 'apo_zero', 'apo_down', 'discopop']"
)

chosen_rewards = self.beta * (chosen_logps.to(device) - ref_chosen_logps.to(device)).detach()
Expand Down
Loading