Releases · huggingface/trl

24 Sep 16:13

qgallouedec

v0.11.1

86ad7a7

v0.11.1

Bug fix

allow parse-args as list of floats for Online DPO, XPO and Nash-MD configs by @kashif in #2108

Full Changelog: v0.11.0...v0.11.1

Contributors

kashif

Assets 2

19 Sep 08:46

lewtun

v0.11.0

e4935e1

v0.11.0

We are excited to introduce the new v0.11.0 release, with many new features and post-training algorithms. The highlights are as follows:

New post-training methods

Generalized Knowledge Distillation

Generalized Knowledge Distillation (GKD) is a post-training method from Google DeepMind that extends standard knowledge distillation by allowing the student to generate outputs during training and receive online feedback from the teacher. It consistently outperforms SFT and in some cases enables the student model to match the performance of the teacher, but with far fewer parameters.

To train models with this method, check out the GKDTrainer.

Exploratory Preference Optimization

Exploratory Preference Optimization is an online post-training method from researchers at Microsoft, MIT, and Wisconsin that extends DPO to incorporate online feedback from reward models or LLM judges. It is similar to online DPO, but has a slightly different theoretical basis concerning sample efficiency.

To train models with this method, check out the XPOTrainer.

Nash Learning with Human Feedback

Nash Learning with Human Feedback is a novel post-training method from Google DeepMind that uses pairwise preference models which are conditioned on two inputs, instead of the single one used in reward models. These preference models are then used to train a policy that consistently produces responses that are preferred over those from competing policies, thus approximating a Nash equilibrium (i.e. a two player game where actions are responses and payoffs are given by the preference model).

To train models with this method, check out the NashMDTrainer.

New trainer features

Online DPO now supports training LoRA adapters with PEFT, which means you can dramatically reduce the amount of VRAM needed to train models with this method. By @qgallouedec in #2041
The OrpoTrainer has better integration with PyTorchXLA for faster step time on TPUs ⚡ . By @wenxindongwork in #2001

Deprecations 🚨

The PPOTrainer is marked for deprecated in favour of PPOv2Trainer to provide a consistent API across TRL's trainers. It will be removed in v0.12.0. By @qgallouedec in #2016
The RichProgressCallback has been removed from the example scripts as it caused a variety of problems with logging in distributed environments. You can still use it by adding it manually to the trainer callbacks. By @lewtun in #2053

Bugfixes and improvements

Adds experimental Liger support to SFT script by @edbeeching in #1992
move slow-tests CI to new cluster by @glegendre01 in #1996
[Online-DPO] fixes to the training scripts and setup.py by @kashif in #1997
[pre-commit] update pre-commit yaml by @kashif in #2002
[Docs] Add Liger-Kernel usage to SFTTrainer page by @ryankert01 in #2007
[ci] pin numpy to < 2 on windows by @kashif in #2009
Remove prompts arg from WinrateCallback by @qgallouedec in #2010
Allow WinRateCallback to be used without reference model by @qgallouedec in #2013
Feat: Add support for APO-zero in KTOTrainer by @KarelDO in #1952
Clean configs documentation by @qgallouedec in #1944
Refactor reward modelling script to work with chat models by @lewtun in #2026
correct formatting of star sign in kto_trainer.mdx by @mattany in #2031
Remove unused functions in core.py by @northern-64bit in #2017
Improves formatting of docstring + newlines by @northern-64bit in #2006
Fix packing doc in SFTConfig and fix error when neither dataset_text_field nor formatting_func is provided. by @qgallouedec in #2035
fix: unpackaging error in Custom Mixture of Experts model when aux_loss_enabled is set to True. by @Jonathanjordan21 in #2039
Drop canonical namespaces by @qgallouedec in #2048
Change non_eos_penalty to be consistent across OnPolicy trainers by @RylanSchaeffer in #2033
Temporary pin the transformers hash in the CI by @qgallouedec in #2049
[XPO] xpo trainer by @kashif in #1943
Fix logits compuation in KTO trainer prediction step by @issamemari in #2050
[Draft, don't merge] Fix failing windows by @LysandreJik in #2051
Clean up DPO example by @lewtun in #2043
Remove debug and sanity_check args by @qgallouedec in #2055
Gkd trainer by @kashif in #1814
Documentation dataset format by @qgallouedec in #2020
Add missing autodocs by @qgallouedec in #2056
Mask loss in gkd when generating from the student by @gaetanlop in #2058
©️ Copyrights by @qgallouedec in #2063
Support for SFTTrainer.evaluate() and SFTTrainer.predict() with null train_dataset by @Sohaib9920 in #2004
make cuda-only tests device-agnostic by @faaany in #2044
Make ConstantLengthDataset (or packing=True) shuffle examples before they are packed by @muupan in #2037
Standardise API for WinRateCallback and LogCompletionsCallback by @lewtun in #2061
Fix dataset in GKD script by @lewtun in #2067
[online models] remove min_new_tokens=args.max_new_tokens by @kashif in #2069
Standardising datasets for testing by @qgallouedec in #2065
[KTO] learning rate recomentations for kto by @kashif in #2070
Nash md by @kashif in #1853
Use transformers utilities when possible by @qgallouedec in #2064
Minor doc fixes and comments by @qgallouedec in #2073
Added error check to RLOO, PPOv2, OnlineDPO that ref_policy and policy have different identities by @RylanSchaeffer in #2057
processor(prompt, images=image) to processor(images=image, text=prompt) by @qgallouedec in #2076
Use wrapped model for reference completions in WinRateCallback and set default freq to eval_steps in LogCompletionsCallback` by @lewtun in #2074
Conversational dataset support for Online DPO by @qgallouedec in #2075
[WIP] Fix logits/chosen and logits/rejected metrics in kto_trainer. by @PhilipMay in #2077
Standardize dataset naming by @qgallouedec in #2081
Fix deepspeed for PPOv2Trainer by @qgallouedec in #2080

New Contributors

@AdnaneKhan made their first contribution in #1822
@mkopecki made their first contribution in #1825
@DZ9 made their first contribution in #1836
@MAOJIASONG made their first contribution in #1840
@davanstrien made their first contribution in #1845
@eliebak made their first contribution in #1863
@Rishav-hub made their first contribution in #1862
@cemiu made their first contribution in #1738
@SunMarc made their first contribution in #1919
@karel-contextual made their first contribution in #1928
@RylanSchaeffer made their first contribution in #1932
@mina-parham made their first contribution in https://github.com/huggingface/trl/pull...

Contributors

kashif, PhilipMay, and 33 other contributors

Assets 2

29 Aug 14:34

lewtun

v0.10.1

55cc4b1

v0.10.1

We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:

Online DPO

Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:

Eliminates the need for a pre-collected preference dataset (it's generated online)
Enables continuous model improvement
Yields better results than traditional DPO

To train models with this method, use the OnlineDPOTrainer

Liger Triton kernels for supercharged SFT

We've integrated LinkedIn's Liger Triton kernels to the SFTTrainer for faster throughput and lower memory usage. To use them, set use_liger_kernel in SFTConfig

DPO for VLMs

We've added support to align vision-language models with DPO, now covering architectures LLaVa-1.5, PaliGemma, and Idefics2. To train VLMs with DPO, use the dpo_visual.py script as follows

accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --trust_remote_code \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16

WinRate callback for LLM as a judge

We've added support to compute win rates over the reference model for methods like DPO. To do so, configure the callback to point to the LLM as judge API (OpenAI or Hugging Face Inference API) and then add:

trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

Added the APO method, which is an "anchored" version of the alignment objective. There are two variants: apo_zero and apo_down. The apo_zero loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, apo_down decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set loss_type="apo_zero" or loss_type="apo_down" in the DPOConfig

What's Changed

Set dev version by @vwxyzjn in #1817
Upgrade GitHub actions by @qgallouedec in #1818
DPO Llava 1.5 and PaliGemma support by @qgallouedec in #1797
Delete unused benchmark.yml workflow by @AdnaneKhan in #1822
Consistent use of trust_remote_code by @qgallouedec in #1806
Fix: authentication token kwarg not passed when loading PEFT adapters by @mkopecki in #1825
refactor trainer callbacks by @kashif in #1826
Uniform model_ref naming by @qgallouedec in #1835
fix ppov2_trainer tensorboard logging bug by @DZ9 in #1836
Fix issues of KTOTrainer by @MAOJIASONG in #1840
add link to DPO datasets collection by @davanstrien in #1845
fix arg parsing in chat.py by @lvwerra in #1846
DPO for VLM blog post in doc by @qgallouedec in #1844
Add WinRateCallback and Judges by @lewtun in #1598
Remove CI_HUB_USER_TOKEN by @qgallouedec in #1852
Online DPO and Online trainer refactor by @vwxyzjn in #1809
[online-DPO] online dpo cleanups by @kashif in #1864
arXiv to HF Papers by @qgallouedec in #1870
fix fsdp & qlora support by @eliebak in #1863
Import missing setup_chat_format by @Rishav-hub in #1862
Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM by @Rishav-hub in #1861
Small fixes to online dpo example by @edbeeching in #1879
Skip BigBird save and load test until next transformers version by @qgallouedec in #1874
Llama in modelling value head tests by @qgallouedec in #1878
Improve judges by @qgallouedec in #1856
[Do not merge] Re-add BigBird Pegasus save/load test by @qgallouedec in #1876
Re-add BigBird Pegasus save/load test by @qgallouedec in #1882
Move BCO to separate BCOTrainer with fixes by @claralp in #1869
Update example overview documentation section by @qgallouedec in #1883
fix dpo_trainer bug for LLMs without bos_token in config by @DZ9 in #1885
Fix SFT for VLM example by @qgallouedec in #1865
evaluation_strategy -> eval_strategy by @qgallouedec in #1894
fix serialization of RunningMoments on multiple GPUs by @claralp in #1892
[WIP] Fix CI by @qgallouedec in #1897
Drop setUpClass in reward tester by @qgallouedec in #1895
Support IterableDataset for SFTTrainer by @qgallouedec in #1899
Fix data processing in ORPO example script by @qgallouedec in #1903
[RPO] use loss from v3 of paper by @kashif in #1904
Support Rank Stabilized LoRA in the ModelConfig/LoraConfig by @JohnGiorgi in #1877
[Online-DPO] num_generation_per_prompt is fixed by @kashif in #1898
Fix GPT2 sentiment notebook reward by @cemiu in #1738
Fix AlignPropTrainer import by @qgallouedec in #1908
Various args and test fix by @qgallouedec in #1909
lr_scheduler.step() after optimizer.step() by @qgallouedec in #1918
torch.cuda.amp.autocast() -> torch.amp.autocast("cuda") by @qgallouedec in #1921
Fix orpo trainer loss device by @SunMarc in #1919
Add transformers library name for TRL repos by @lewtun in #1922
Standardize dataset_num_proc usage by @qgallouedec in #1925
PartialState().local_main_process_first() when map in examples by @qgallouedec in #1926
minor BCO fixes by @claralp in #1923
Improve DPO/loss doc by @qgallouedec in #1929
feat: anchored pref optimization by @karel-contextual in #1928
Add tests for DPO for VLM by @qgallouedec in #1935
fix model to save in ppov2 by @mnoukhov in #1776
Optional Additional Loss to Center Reward Models' Outputs by @RylanSchaeffer in #1932
Properly label all models when pushed to the hub by @qgallouedec in #1940
Skip token in push_to_hub by @qgallouedec in #1945
Fix model wrapping for online DPO by @lewtun in #1946
Don't mark issues as stale if nobody answered by @qgallouedec in #1949
Add a simple-to-understand example for online DPO by @vwxyzjn in #1947
Log WandB tables on main process by @lewtun in #1951
[ODPO] Fix global step for consistent checkpointing with global updates by @lewtun in #1950
"help wanted" in label to exempt from stale by @qgallouedec in #1956
Fix response truncation in examples/notebooks/gpt2-sentiment.ipynb by @qgallouedec in #1957
[ODPO] Refactor training script to use messages API by @lewtun in #1958
Support LLaVA-NeXT in Vision SFT by @qgallouedec in #1959
Add i...

Contributors

kashif, kit1980, and 25 other contributors

Assets 2

08 Jul 13:51

vwxyzjn

v0.9.6

314e8eb

v0.9.6 release

We are excited to introduce the new v0.9.6 release. Many new exciting features and algorithms. The highlights are as follows:

Support for SimPO by @fe1ixxu, a reference-free method that also regularizes output length. To use this loss, the users can input loss_type="simpo" and cpo_alpha=0 in the CPOConfig and use it with the CPOTrainer.

Added AlignProp by @mihirp1998, a method for finetuning Stable Diffusion model using reward gradients.
Added Efficient Exact Optimization (EXO) by @haozheji

We also included many important fixes and improvements such as fixing prints in the CLI with GCP containers by @alvarobartt. Enjoy the release!

What's Changed

set dev version by @younesbelkada in #1710
Add a variant of CPO, SimPO by @fe1ixxu in #1703
[RPO] fix nll loss by @kashif in #1705
fix yaml parser for derived config classes by @mnoukhov in #1713
Fix default padding_value in dpo_config.py by @mnoukhov in #1692
feat(ci): add trufflehog secrets detection by @McPatate in #1721
ktotrainer: Refuse datasets which contain only one class of labels by @jetlime in #1724
adds AOT by @imelnyk in #1701
Workflow: Notify tests results on slack channel by @younesbelkada in #1744
better trl parser with yaml config by @mnoukhov in #1739
CI / core: Pin numpy to !=2.0.0 for CI and to users by @younesbelkada in #1747
TrlParser: Add ignore extra args option by @younesbelkada in #1748
small KTO fixes by @kawine in #1734
CPO / DPO: Fix red CI by @younesbelkada in #1749
prepare deepspeed accomodate fp16 and bf16 by @mnoukhov in #1728
CI / KTOTrainer: Remove old tests by @younesbelkada in #1750
change the process function in the example of DPO by @AIR-hl in #1753
Integrate f-divergence to DPO (Follow up) by @1485840691 in #1610
Support for returning past_key_values from the model by @idanshen in #1742
Fix masking of response tokens by @mertsayar8 in #1718
Support num_train_epochs by @vwxyzjn in #1743
Fix: Add dataset_text_field in examples/scripts/sft.py by @scottsuk0306 in #1758
New sentiment and descriptiveness dataset by @vwxyzjn in #1757
Add CPO-SimPO method by @fe1ixxu in #1760
Added Reward Backpropogation Support by @mihirp1998 in #1585
MoE Models: option to add load balancing loss by @claralp in #1765
evaluation_strategy to eval_strategy by @qgallouedec in #1771
add Efficient Exact Optimization (EXO) by @haozheji in #1735
Remove the leading space in the tldr preference dataset by @vwxyzjn in #1773
Fix Documentation Overflow Issues for Long URLs in SFTConfig by @Mubin17 in #1774
Visual DPO by @qgallouedec in #1647
[DOCS] fix docs and cli example script by @kashif in #1780
Fixed typo in SFT trainer docs by @detsutut in #1788
[SFT] add model_init_kwargs to training_args by @kashif in #1787
Bugfix: Preserve token fields when converting TrainingArguments to SFTConfig by @noahlt in #1794
Clean examples by @qgallouedec in #1791
Remove extra print in reward_trainer.py by @mnoukhov in #1799
Fix torch_dtype handling in {DPO,SFT}Trainer when provided via CLI by @alvarobartt in #1807
Fix TRL_USE_RICH environment variable handling by @alvarobartt in #1808
0.9.6 release by @vwxyzjn in #1816

New Contributors

@McPatate made their first contribution in #1721
@jetlime made their first contribution in #1724
@imelnyk made their first contribution in #1701
@AIR-hl made their first contribution in #1753
@1485840691 made their first contribution in #1610
@idanshen made their first contribution in #1742
@mertsayar8 made their first contribution in #1718
@scottsuk0306 made their first contribution in #1758
@mihirp1998 made their first contribution in #1585
@haozheji made their first contribution in #1735
@Mubin17 made their first contribution in #1774
@detsutut made their first contribution in #1788
@noahlt made their first contribution in #1794

Full Changelog: v0.9.4...v0.9.6

Contributors

kashif, noahlt, and 20 other contributors

Assets 2

06 Jun 14:17

vwxyzjn

v0.9.4

974b0d3

v0.9.4

Mainly backward compatibility fixes with SFTTrainer.

What's Changed

Fixed doc string and related docs for the SFTConfig update by @GuilhermeFreire in #1706
SFTTrainer: Fix backward Compatibility issue with TrainingArguments by @younesbelkada in #1707
0.9.4 release by @vwxyzjn in #1708

New Contributors

@GuilhermeFreire made their first contribution in #1706

Full Changelog: v0.9.3...v0.9.4

Contributors

vwxyzjn, GuilhermeFreire, and younesbelkada

Assets 2

05 Jun 16:08

vwxyzjn

v0.9.3

c0819ee

v0.9.3 RLOO / PPOv2 Trainer, RM Visualization

We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:

RLOO Trainer: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by Ahmadian et al from Cohere. Check out our docs here to get started
PPOv2 Trainer: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs here to get started
Reward model visualization: the reward model training now includes visualization on the eval dataset, as shown below.

Screen.Recording.2024-05-09.at.2.37.44.PM.mov

New losses in the DPO Trainer: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
New losses in the KTO Trainer: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)

What's Changed

set dev version by @younesbelkada in #1568
fix add_special_tokens issue for data with template by @edixiong in #1509
[DPO] add 'bco_pair' loss_type by @seanexp in #1524
[DPO] DPOConfig class by @kashif in #1554
[SFT] add SFT Trainer Config dataclass by @kashif in #1530
FIX: Fix CI on transformers main by @younesbelkada in #1576
[SFTTrainer] Add warning in SFTTrainer when dataset already processed by @younesbelkada in #1577
Fix typo detoxifying doc by @qgallouedec in #1594
Core: removed unexisting SftArgumentParser by @younesbelkada in #1602
[KTOTrainer] add BCO (reward shift and underlying distribution matching) by @seanexp in #1599
[CLI] Use auto device map for model load by @lewtun in #1596
Removing tests/ from package data by @jamesbraza in #1607
Docs: Fix build main documentation by @younesbelkada in #1604
support loss function for Self-play Preference Optimization by @winglian in #1612
Update HH dataset on helpful only subset by @vwxyzjn in #1613
corrects loss function for Self-play Preference Optimization hard label version by @angelahzyuan in #1615
Fix ZeRO-3 generation context manager by @lewtun in #1617
fixed adding bos and eos token unconditionally by @jasonyux in #1591
visualize rm prediction by @vwxyzjn in #1636
[ORPO] Correct label mask for pad tokens by @IlyaGusev in #1625
Update sft_llama2.py to work with the latest API by @xianbaoqian in #1637
Fixed wrong logs prefixes in KTOTrainer by @bartoszzuk in #1641
Pairwise Noise Contrastive Alignment by @winglian in #1632
don't cast the trainable lora layers to half precision by @pacman100 in #1644
PPO / Reinforce Trainers by @vwxyzjn in #1540
Apply deprecated evaluation_strategy by @muellerzr in #1559
FEAT: Add support for training collator in PPOTrainer by @younesbelkada in #1658
Correct Documentation for cDPO Usage by @AliBakly in #1655
Fix inheritance order in PPOv2Config by @Nicolinho in #1659
[DPO] Add 'robust' loss_type by @Abilityguy in #1653
🤫 TR-DPO implementation by @syrn1k in #1593
Do not upcast adapters when using FSDP+QLoRA by @pacman100 in #1654
[Tests] update eval_strategy API by @kashif in #1662
Fix ppov2 test case by @vwxyzjn in #1661
FIX / PPO: Fix enable_input_require_grads issues with PPO models by @younesbelkada in #1664
fix dataset load error by @sywangyi in #1670
FIX / SFTTrainer: Fix SFTTrainer with args=None by @younesbelkada in #1678
Fix max_completion_length for encoder_decoder models in KTO Trainer by @samuki in #1588
intial RPO loss by @kashif in #1686
Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig by @alexisrozhkov in #1690
Skip packing validation by @alex-jw-brooks in #1673
Fix typo in DPOTrainer's warnings by @qgallouedec in #1688
Quick fix on GPT4-eval by @vwxyzjn in #1696
Release 0.9.2 by @vwxyzjn in #1697

New Contributors

@edixiong made their first contribution in #1509
@seanexp made their first contribution in #1524
@jamesbraza made their first contribution in #1607
@winglian made their first contribution in #1612
@angelahzyuan made their first contribution in #1615
@jasonyux made their first contribution in #1591
@IlyaGusev made their first contribution in #1625
@xianbaoqian made their first contribution in #1637
@bartoszzuk made their first contribution in #1641
@muellerzr made their first contribution in #1559
@AliBakly made their first contribution in #1655
@Nicolinho made their first contribution in #1659
@Abilityguy made their first contribution in #1653
@syrn1k made their first contribution in #1593
@alexisrozhkov made their first contribution in #1690
@alex-jw-brooks made their first contribution in #1673

Full Changelog: v0.8.6...v0.9.2

Contributors

kashif, winglian, and 22 other contributors

Assets 2

22 Apr 08:59

younesbelkada

v0.8.6

e90e8d9

v0.8.6: Fixes for CLI

What's Changed

set dev version by @younesbelkada in #1556
[CLI] Update init.py imports by @kashif in #1557
CLI: Add warning when ignored params are passed + parse config file if config if passed by @younesbelkada in #1565
Release: v0.8.6 by @younesbelkada in #1567

Full Changelog: v0.8.5...v0.8.6

Contributors

kashif and younesbelkada

Assets 2

18 Apr 11:58

younesbelkada

v0.8.5

3595eb0

v0.8.5: Important fixes for CLIs

What's Changed

set dev version by @younesbelkada in #1548
FIX: make the train / test fields modulable by @younesbelkada in #1551
enable multiple eos tokens by @lvwerra in #1553
Release: v0.8.5 by @younesbelkada in #1555

Full Changelog: v0.8.4...v0.8.5

Contributors

lvwerra and younesbelkada

Assets 2

17 Apr 15:22

younesbelkada

v0.8.4

a5788ac

v0.8.4: CLI / CPO / KTO important fixes

This patch release includes important fixes for the CLI and KTO & CPO trainers

What's Changed

set dev version by @younesbelkada in #1529
[CPO] fix memory leak due to retained value by @kashif in #1531
VSFT hotfix - adds gen prompt to template and processor to hub by @edbeeching in #1532
save_model -> save_pretrained in ppo_trainer.mdx by @ejmejm in #1537
[KTO] support to load the adapter twice by @claralp in #1542
CLI: Set dataset_text_field to None to allow ChatML automatic template by @younesbelkada in #1545
FIX: Fix slow test by @younesbelkada in #1546
Fixed ref model not used in PPO generation by @ejmejm in #1534
Release: v0.8.4 by @younesbelkada in #1547

New Contributors

@ejmejm made their first contribution in #1537

Full Changelog: v0.8.3...v0.8.4

Contributors

kashif, ejmejm, and 3 other contributors

Assets 2

12 Apr 10:25

younesbelkada

v0.8.3

9822647

v0.8.3: Patch release for CLI

What's Changed

This is a patch release that includes an import fix for CLIs

set dev version by @younesbelkada in #1523
[CLI] fix imports by @kashif in #1527
Release: v0.8.3 by @younesbelkada in #1528

Full Changelog: v0.8.2...v0.8.3

Contributors

kashif and younesbelkada

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix

Contributors

New post-training methods

Generalized Knowledge Distillation

Exploratory Preference Optimization

Nash Learning with Human Feedback

New trainer features

Deprecations 🚨

Bugfixes and improvements

New Contributors

Contributors

Online DPO

Liger Triton kernels for supercharged SFT

DPO for VLMs

WinRate callback for LLM as a judge

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Releases: huggingface/trl

v0.11.1

Bug fix

Contributors

v0.11.0

New post-training methods

Generalized Knowledge Distillation

Exploratory Preference Optimization

Nash Learning with Human Feedback

New trainer features

Deprecations 🚨

Bugfixes and improvements

New Contributors

Contributors

v0.10.1

Online DPO

Liger Triton kernels for supercharged SFT

DPO for VLMs

WinRate callback for LLM as a judge

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

What's Changed

Contributors

v0.9.6 release

What's Changed

New Contributors

Contributors

v0.9.4

What's Changed

New Contributors

Contributors

v0.9.3 RLOO / PPOv2 Trainer, RM Visualization

What's Changed

New Contributors

Contributors

v0.8.6: Fixes for CLI

What's Changed

Contributors

v0.8.5: Important fixes for CLIs

What's Changed

Contributors

v0.8.4: CLI / CPO / KTO important fixes

What's Changed

New Contributors

Contributors

v0.8.3: Patch release for CLI

What's Changed

Contributors