update

LeonEricsson · Jan 19, 2024 · bda39be · bda39be
1 parent 0ddec43
commit bda39be
Show file tree

Hide file tree

Showing 3 changed files with 15 additions and 0 deletions.
diff --git a/blog/2024-01-15-dpo.md b/blog/2024-01-15-dpo.md
@@ -58,6 +58,21 @@ $$
 
 In the end what this means is that one can formulate a simple maximum likelihood objective over human preference data w.r.t a parametrized policy $\pi_\theta$ and a reference policy $\pi^{\text{SFT}}$, completely removing the need for a explicit reward model and RL! Given a sample $(x, y_l, y_w)$, the DPO update rule increases the likelihood of the preferred completion $y_w$ and decreases the likelihood of dispreferred completions $y_l$. Importantly, the examples are weighed by how much higher the implicit reward model $\beta \log \frac{\pi_\theta(y\mid x)}{\pi^{\text{SFT}}(y\mid x)}$ rates the dispreferred completions, scaled by how incorrectly the implicit reward model orders the completions.
 
+## DPO vs IPO vs KTO
+DPO's success has prompted the exploration of new loss functions, focusing on two lacking aspects of DPO:
+
+- **Robustness**. One shortcoming of DPO is that it is prone to overfit on the preference dataset if you aren't ready to perform early stopping. As a response to this, DeepMind published [Identity Preference Optimization](https://arxiv.org/pdf/2310.12036.pdf), which adds a regularizing term ot the DPO loss.
+- **Paired preference data**. Alignment methods typically requires paired preference data $(x, y_w, y_l)$ and DPO is no different. Collecting this
+kind of data is, as we've repeated consistently, expensive and time consuming. [Kahneman-Taversky Optimization](https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf) reformulates the loss function such that it depends entirely on individual examples that have been labeled as *good* or *bad*. These are much easier to acquire in practice. 
+
+A team at huggingface recently published a comparison of these three alignment methods, evaluating their performance across a range of different $\beta$ values. I found this post to be super interesting so I'd like to share the results with you. The team aligned two SFT models: OpenHermes-2.5-Mistral-7B and Zephyr-7B-Beta, results below.
+
+[](/images/zephyr-ktodpoipo.png)
+
+[](/images/mistral-ktodpoipo.png)
+
+Zephyr clearly benefits from a small $\beta$, where DPO is the strongest performer. However, across the spectrum it's actually KTO that wins. On OpenHermes-Mistral the results are way less satisfying, overall however it seems that DPO > KTO > IPO with the $\beta$ sweet spot alternating for each algorithm.
+
 ## Final thoughts
 
 It's fairly rare to see such an innovative analytical derivation that omits entire steps of a learning pipeline. RLHF is inherently a very brittle process that most of the open-source community has failed to adopt seamlessly. DPO appears to make this process a lot more robust and it's taken over as the preferred fine-tuning step following SFT. Unfortunately, this still requires human preference data which can be expensive to obtain but synthetic data is becoming more and more prominent in that regard. To finish off I reiterate the beautiful title statement: **Your Language Model is Secretly a Reward Model**
diff --git a/public/images/mistral-ktodpoipo.png b/public/images/mistral-ktodpoipo.png
diff --git a/public/images/zephyr-ktodpoipo.png b/public/images/zephyr-ktodpoipo.png