Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
LeonEricsson committed Jan 19, 2024
1 parent 0ddec43 commit bda39be
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions blog/2024-01-15-dpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,21 @@ $$

In the end what this means is that one can formulate a simple maximum likelihood objective over human preference data w.r.t a parametrized policy $\pi_\theta$ and a reference policy $\pi^{\text{SFT}}$, completely removing the need for a explicit reward model and RL! Given a sample $(x, y_l, y_w)$, the DPO update rule increases the likelihood of the preferred completion $y_w$ and decreases the likelihood of dispreferred completions $y_l$. Importantly, the examples are weighed by how much higher the implicit reward model $\beta \log \frac{\pi_\theta(y\mid x)}{\pi^{\text{SFT}}(y\mid x)}$ rates the dispreferred completions, scaled by how incorrectly the implicit reward model orders the completions.

## DPO vs IPO vs KTO
DPO's success has prompted the exploration of new loss functions, focusing on two lacking aspects of DPO:

- **Robustness**. One shortcoming of DPO is that it is prone to overfit on the preference dataset if you aren't ready to perform early stopping. As a response to this, DeepMind published [Identity Preference Optimization](https://arxiv.org/pdf/2310.12036.pdf), which adds a regularizing term ot the DPO loss.
- **Paired preference data**. Alignment methods typically requires paired preference data $(x, y_w, y_l)$ and DPO is no different. Collecting this
kind of data is, as we've repeated consistently, expensive and time consuming. [Kahneman-Taversky Optimization](https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf) reformulates the loss function such that it depends entirely on individual examples that have been labeled as *good* or *bad*. These are much easier to acquire in practice.

A team at huggingface recently published a comparison of these three alignment methods, evaluating their performance across a range of different $\beta$ values. I found this post to be super interesting so I'd like to share the results with you. The team aligned two SFT models: OpenHermes-2.5-Mistral-7B and Zephyr-7B-Beta, results below.

[](/images/zephyr-ktodpoipo.png)

[](/images/mistral-ktodpoipo.png)

Zephyr clearly benefits from a small $\beta$, where DPO is the strongest performer. However, across the spectrum it's actually KTO that wins. On OpenHermes-Mistral the results are way less satisfying, overall however it seems that DPO > KTO > IPO with the $\beta$ sweet spot alternating for each algorithm.

## Final thoughts

It's fairly rare to see such an innovative analytical derivation that omits entire steps of a learning pipeline. RLHF is inherently a very brittle process that most of the open-source community has failed to adopt seamlessly. DPO appears to make this process a lot more robust and it's taken over as the preferred fine-tuning step following SFT. Unfortunately, this still requires human preference data which can be expensive to obtain but synthetic data is becoming more and more prominent in that regard. To finish off I reiterate the beautiful title statement: **Your Language Model is Secretly a Reward Model**
Binary file added public/images/mistral-ktodpoipo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/images/zephyr-ktodpoipo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit bda39be

Please sign in to comment.