Skip to content

Releases: huggingface/trl

v0.11.1

24 Sep 16:13
Compare
Choose a tag to compare

Bug fix

  • allow parse-args as list of floats for Online DPO, XPO and Nash-MD configs by @kashif in #2108

Full Changelog: v0.11.0...v0.11.1

v0.11.0

19 Sep 08:46
Compare
Choose a tag to compare

We are excited to introduce the new v0.11.0 release, with many new features and post-training algorithms. The highlights are as follows:

New post-training methods

Generalized Knowledge Distillation

Screenshot 2024-09-19 at 10 01 02

Generalized Knowledge Distillation (GKD) is a post-training method from Google DeepMind that extends standard knowledge distillation by allowing the student to generate outputs during training and receive online feedback from the teacher. It consistently outperforms SFT and in some cases enables the student model to match the performance of the teacher, but with far fewer parameters.

To train models with this method, check out the GKDTrainer.

Exploratory Preference Optimization

Screenshot 2024-09-19 at 10 13 27

Exploratory Preference Optimization is an online post-training method from researchers at Microsoft, MIT, and Wisconsin that extends DPO to incorporate online feedback from reward models or LLM judges. It is similar to online DPO, but has a slightly different theoretical basis concerning sample efficiency.

To train models with this method, check out the XPOTrainer.

Nash Learning with Human Feedback

Screenshot 2024-09-19 at 10 32 04

Nash Learning with Human Feedback is a novel post-training method from Google DeepMind that uses pairwise preference models which are conditioned on two inputs, instead of the single one used in reward models. These preference models are then used to train a policy that consistently produces responses that are preferred over those from competing policies, thus approximating a Nash equilibrium (i.e. a two player game where actions are responses and payoffs are given by the preference model).

To train models with this method, check out the NashMDTrainer.

New trainer features

  • Online DPO now supports training LoRA adapters with PEFT, which means you can dramatically reduce the amount of VRAM needed to train models with this method. By @qgallouedec in #2041
  • The OrpoTrainer has better integration with PyTorchXLA for faster step time on TPUs ⚡ . By @wenxindongwork in #2001

Deprecations 🚨

  • The PPOTrainer is marked for deprecated in favour of PPOv2Trainer to provide a consistent API across TRL's trainers. It will be removed in v0.12.0. By @qgallouedec in #2016
  • The RichProgressCallback has been removed from the example scripts as it caused a variety of problems with logging in distributed environments. You can still use it by adding it manually to the trainer callbacks. By @lewtun in #2053

Bugfixes and improvements

New Contributors

Read more

v0.10.1

29 Aug 14:34
Compare
Choose a tag to compare

We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:

Online DPO

Screenshot 2024-08-29 at 15 53 29

Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:

  • Eliminates the need for a pre-collected preference dataset (it's generated online)
  • Enables continuous model improvement
  • Yields better results than traditional DPO

To train models with this method, use the OnlineDPOTrainer

Liger Triton kernels for supercharged SFT

image (18)

  • We've integrated LinkedIn's Liger Triton kernels to the SFTTrainer for faster throughput and lower memory usage. To use them, set use_liger_kernel in SFTConfig

DPO for VLMs

  • We've added support to align vision-language models with DPO, now covering architectures LLaVa-1.5, PaliGemma, and Idefics2. To train VLMs with DPO, use the dpo_visual.py script as follows
accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --trust_remote_code \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16

WinRate callback for LLM as a judge

  • We've added support to compute win rates over the reference model for methods like DPO. To do so, configure the callback to point to the LLM as judge API (OpenAI or Hugging Face Inference API) and then add:
trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

  • Added the APO method, which is an "anchored" version of the alignment objective. There are two variants: apo_zero and apo_down. The apo_zero loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, apo_down decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set loss_type="apo_zero" or loss_type="apo_down" in the DPOConfig

What's Changed

Read more

v0.9.6 release

08 Jul 13:51
314e8eb
Compare
Choose a tag to compare

We are excited to introduce the new v0.9.6 release. Many new exciting features and algorithms. The highlights are as follows:

  • Support for SimPO by @fe1ixxu, a reference-free method that also regularizes output length. To use this loss, the users can input loss_type="simpo" and cpo_alpha=0 in the CPOConfig and use it with the CPOTrainer.
image

We also included many important fixes and improvements such as fixing prints in the CLI with GCP containers by @alvarobartt. Enjoy the release!

What's Changed

New Contributors

Full Changelog: v0.9.4...v0.9.6

v0.9.4

06 Jun 14:17
974b0d3
Compare
Choose a tag to compare

Mainly backward compatibility fixes with SFTTrainer.

What's Changed

New Contributors

Full Changelog: v0.9.3...v0.9.4

v0.9.3 RLOO / PPOv2 Trainer, RM Visualization

05 Jun 16:08
c0819ee
Compare
Choose a tag to compare

We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:

  1. RLOO Trainer: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by Ahmadian et al from Cohere. Check out our docs here to get started
  2. PPOv2 Trainer: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs here to get started
  3. Reward model visualization: the reward model training now includes visualization on the eval dataset, as shown below.
Screen.Recording.2024-05-09.at.2.37.44.PM.mov
  1. New losses in the DPO Trainer: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
  2. New losses in the KTO Trainer: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)

What's Changed

New Contributors

Full Changelog: v0.8.6...v0.9.2

v0.8.6: Fixes for CLI

22 Apr 08:59
e90e8d9
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.8.5...v0.8.6

v0.8.5: Important fixes for CLIs

18 Apr 11:58
3595eb0
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.8.4...v0.8.5

v0.8.4: CLI / CPO / KTO important fixes

17 Apr 15:22
a5788ac
Compare
Choose a tag to compare

This patch release includes important fixes for the CLI and KTO & CPO trainers

What's Changed

New Contributors

Full Changelog: v0.8.3...v0.8.4

v0.8.3: Patch release for CLI

12 Apr 10:25
9822647
Compare
Choose a tag to compare

What's Changed

This is a patch release that includes an import fix for CLIs

Full Changelog: v0.8.2...v0.8.3