diff --git a/docs/source/dpo_trainer.mdx b/docs/source/dpo_trainer.mdx index 724992c6c7..b86e498da1 100644 --- a/docs/source/dpo_trainer.mdx +++ b/docs/source/dpo_trainer.mdx @@ -28,7 +28,7 @@ The DPO trainer expects a very specific format for the dataset. Since the model -Therefore the final dataset object should contain these 3 entries if you use the default `DPODataCollatorWithPadding` data collator. The entries should be named: +Therefore the final dataset object should contain these 3 entries if you use the default [`DPODataCollatorWithPadding`] data collator. The entries should be named: - `prompt` - `chosen` @@ -70,7 +70,7 @@ dpo_dataset_dict = { where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. -`DPOTrainer` can be used to fine-tune visual language models (VLMs). In this case, the dataset must also contain the key `images`, and the trainer's `tokenizer` is the VLM's `processor`. For example, for Idefics2, the processor expects the dataset to have the following format: +[`DPOTrainer`] can be used to fine-tune visual language models (VLMs). In this case, the dataset must also contain the key `images`, and the trainer's `tokenizer` is the VLM's `processor`. For example, for Idefics2, the processor expects the dataset to have the following format: Note: Currently, VLM support is exclusive to Idefics2 and does not extend to other VLMs. @@ -101,7 +101,7 @@ The DPO trainer expects a model of `AutoModelForCausalLM` or `AutoModelForVision ## Using the `DPOTrainer` -For a detailed example have a look at the `examples/scripts/dpo.py` script. At a high level we need to initialize the `DPOTrainer` with a `model` we wish to train, a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response, the `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder). +For a detailed example have a look at the `examples/scripts/dpo.py` script. At a high level we need to initialize the [`DPOTrainer`] with a `model` we wish to train, a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response, the `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder). ```py training_args = DPOConfig( @@ -126,27 +126,27 @@ Note that the `beta` is the temperature parameter for the DPO loss, typically so ## Loss functions -Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the DPO authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. +Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. To use this loss, set the `loss_type="sigmoid"` (default) in the [`DPOConfig`]. -The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. The `DPOTrainer` can be switched to this loss via the `loss_type="hinge"` argument and the `beta` in this case is the reciprocal of the margin. +The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. To use this loss, set the `loss_type="hinge"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the margin. -The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the `loss_type="ipo"` argument to the trainer. Note that the `beta` parameter is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only). +The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. To use the loss set the `loss_type="ipo"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only). -The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability that can be passed to the `DPOTrainer` via `label_smoothing` argument (between 0 and 0.5) and then a conservative DPO loss is used. Pass the `label_smoothing` argument to the trainer to use it (default is 0). +The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0). -The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, assume that the preference labels are noisy with some probability that can be passed to the `DPOTrainer` via `label_smoothing` argument (between 0 and 0.5). Use `loss_type="robust"` to the trainer to use it. +The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. To use the loss set the `loss_type="exo"` in the [`DPOConfig`]. Setting non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large. -The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. Setting `loss_type`=`exo_pair` and a non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large. +The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. To use the loss set the `loss_type="nca"` in the [`DPOConfig`]. -The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. The `DPOTrainer` can be switched to this loss via the `loss_type="bco_pair"` argument. +The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, it assumes that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0) and set the `loss_type="robust_dpo"` in the [`DPOConfig`]. -The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation using loss_type="sppo_hard" approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. +The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. To use this loss, set the `loss_type="bco"` in the [`DPOConfig`]. -The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. +The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model=True` in the [`DPOConfig`]. -The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model` flag in the `DPOConfig`. +The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss, set the `rpo_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this weight to 1.0. -The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss set the `rpo_alpha` in the `DPOConfig` to an appropriate value. The paper suggests setting this weight to 1.0. +The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. To use this loss, set the `loss_type="sppo_hard"` in the [`DPOConfig`]. The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size. @@ -240,7 +240,7 @@ However, after using this approach, you will have an unquantized base model. The ### Using option 3 - load the adapter twice -To avoid the downsides with option 2, you can load your fine-tuned adapter into the model twice, with different names, and set the model/ref adapter names in DPOTrainer. +To avoid the downsides with option 2, you can load your fine-tuned adapter into the model twice, with different names, and set the model/ref adapter names in [`DPOTrainer`]. For example: diff --git a/trl/trainer/dpo_config.py b/trl/trainer/dpo_config.py index 8167d913a1..b5097bf4df 100644 --- a/trl/trainer/dpo_config.py +++ b/trl/trainer/dpo_config.py @@ -35,67 +35,77 @@ class DPOConfig(TrainingArguments): Initialize DPOConfig. Args: - beta (`float`, defaults to 0.1): + beta (`float`, *optional*, defaults to `0.1`): The beta factor in DPO loss. Higher beta means less divergence from the initial policy. For the IPO loss, beta is the regularization parameter denoted by tau in the paper. - label_smoothing (`float`, defaults to 0): + label_smoothing (`float`, *optional*, defaults to `0.0`): The robust DPO label smoothing parameter from the [cDPO](https://ericmitchell.ai/cdpo.pdf) report and [Robust DPO](https://huggingface.co/papers/2403.00409) paper that should be between 0 and 0.5. - loss_type (`str`, defaults to `"sigmoid"`): - The type of DPO loss to use. Either `"sigmoid"` the default DPO loss,`"hinge"` loss from [SLiC](https://huggingface.co/papers/2305.10425) paper, `"ipo"` from [IPO](https://huggingface.co/papers/2310.12036) paper, - `"bco_pair"` from [BCO](https://huggingface.co/papers/2404.04656) paper or `"robust"` from [Robust DPO](https://huggingface.co/papers/2403.00409) paper, - "aot" and "aot_pair" from alignment via optimal transport - label_pad_token_id (`int`, defaults to `-100`): + loss_type (`str`, *optional*, defaults to `"sigmoid"`): + The type of DPO loss to use. Possible values are: + + - `"sigmoid"`: sigmoid loss from the original [DPO](https://huggingface.co/papers/2305.18290) paper. + - `"hinge"`: hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. + - `"ipo"`: IPO loss from the [IPO](https://huggingface.co/papers/2310.12036) paper. + - `"exo_pair"`: pairwise EXO loss from the [EXO](https://huggingface.co/papers/2402.00856) paper. + - `"nca_pair"`: pairwise NCA loss from the [NCA](https://huggingface.co/papers/2402.05369) paper. + - `"robust"`: unbiased estimate of the DPO loss that is robust to preference noise from the [Robust DPO](https://huggingface.co/papers/2403.00409) paper. + - `"bco_pair"`: pairwise BCO loss from the [BCO](https://huggingface.co/papers/2404.04656) paper. + - `"sppo_hard"`: SPPO loss with hard label from the [SPPO](https://huggingface.co/papers/2405.00675) paper. + - `"aot"`: AOT loss for paired datasets from the [AOT](https://huggingface.co/papers/2406.05882) paper. + - `"aot_pair"`: AOT loss for unpaired datasets from the [AOT](https://huggingface.co/papers/2406.05882) paper. + + label_pad_token_id (`int`, *optional*, defaults to `-100`): The label pad token id. This argument is required if you want to use the default data collator. - padding_value (`Optional[int]`, *optional*): + padding_value (`Optional[int]`, *optional*, defaults to `None`): The padding value if it is different to the tokenizer's pad_token_id. - truncation_mode (`str`, defaults to `keep_end`): + truncation_mode (`str`, *optional*, defaults to `"keep_end"`): The truncation mode to use, either `keep_end` or `keep_start`. This argument is required if you want to use the default data collator. - max_length (`int`, defaults to `None`): + max_length (`Optional[int]`, *optional*, defaults to `None`): The maximum length of the sequences in the batch. This argument is required if you want to use the default data collator. - max_prompt_length (`int`, defaults to `None`): + max_prompt_length (`Optional[int]`, *optional*, defaults to `None`): The maximum length of the prompt. This argument is required if you want to use the default data collator. - max_target_length (`int`, defaults to `None`): + max_target_length (`Optional[int]`, *optional*, defaults to `None`): The maximum length of the target. This argument is required if you want to use the default data collator and your model is an encoder-decoder. - is_encoder_decoder (`Optional[bool]`, `optional`, defaults to `None`): + is_encoder_decoder(`Optional[int]`, *optional*, defaults to `None`): If no model is provided, we need to know if the model_init returns an encoder-decoder. - disable_dropout (`bool`, defaults to `True`): + disable_dropout (`bool`, *optional*, defaults to `True`): Whether or not to disable dropouts in `model` and `ref_model`. - generate_during_eval (`bool`, defaults to `False`): + generate_during_eval (`bool`, *optional*, defaults to `False`): Whether to sample and log generations during evaluation step. - precompute_ref_log_probs (`bool`, defaults to `False`): + precompute_ref_log_probs (`bool`, *optional*, defaults to `False`): Flag to precompute reference model log probabilities for training and evaluation datasets. This is useful if you want to train without the reference model and reduce the total GPU memory needed. - dataset_num_proc (`Optional[int]`, *optional*): + dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`): The number of workers to use to tokenize the data. Defaults to None. - model_init_kwargs (`Optional[Dict]`, *optional*): + model_init_kwargs (`Optional[Dict]`, *optional*, defaults to `None`): Dict of Optional kwargs to pass when instantiating the model from a string - ref_model_init_kwargs (`Optional[Dict]`, *optional*): + ref_model_init_kwargs (`Optional[Dict]`, *optional*, defaults to `None`): Dict of Optional kwargs to pass when instantiating the ref model from a string - model_adapter_name (`str`, defaults to `None`): + model_adapter_name (`Optional[str]`, *optional*, defaults to `None`): Name of the train target PEFT adapter, when using LoRA with multiple adapters. - ref_adapter_name (`str`, defaults to `None`): + ref_adapter_name (`Optional[str]`, *optional*, defaults to `None`): Name of the reference PEFT adapter, when using LoRA with multiple adapters. - reference_free (`bool` defaults to `False`): + reference_free (`bool`, *optional*, defaults to `False`): If True, we ignore the _provided_ reference model and implicitly use a reference model that assigns equal probability to all responses. - force_use_ref_model (`bool`, defaults to `False`): + force_use_ref_model (`bool`, *optional*, defaults to `False`): In case one passes a PEFT model for the active model and you want to use a different model for the ref_model, set this flag to `True`. f_divergence_type (`FDivergenceType`, *optional*, defaults to `FDivergenceType.REVERSE_KL`): The type of f-divergence regularization function to compute divergence between policy and reference model. This argument is optional, defaults to `FDivergenceType.REVERSE_KL`. f_alpha_divergence_coef (`float`, *optional*, defaults to `1.0`): The alpha coef in alpha-divergence(u^-alpha) regularization function for DPO loss. - sync_ref_model ('bool', defaults to `False`): + sync_ref_model ('bool', *optional*, defaults to `False`): The flag for syncing reference model during training from the [TR-DPO](https://huggingface.co/papers/2404.09656) paper. - ref_model_mixup_alpha ('float', defaults to 1.0): + ref_model_mixup_alpha ('float', *optional*, defaults to `1.0`): The alpha parameter from the [TR-DPO](https://huggingface.co/papers/2404.09656) paper. - ref_model_sync_steps ('int', defaults to 2): + ref_model_sync_steps ('int', *optional*, defaults to `2`): The tau parameter from the [TR-DPO](https://huggingface.co/papers/2404.09656) paper. - rpo_alpha ('float', defaults to `None`): + rpo_alpha ('float', *optional*, defaults to `None`): The alpha parameter from the [RPO](https://huggingface.co/papers/2404.19733) paper V3. If None, no weighting is applied and the loss is the same as the DPO loss. The paper recommends `rpo_alpha=1.0`. """ beta: float = 0.1 label_smoothing: float = 0 loss_type: Literal[ - "sigmoid", "hinge", "ipo", "bco_pair", "sppo_hard", "nca_pair", "robust", "aot", "aot_pair", "exo_pair" + "sigmoid", "hinge", "ipo", "exo_pair", "nca_pair", "robust", "bco_pair", "sppo_hard", "aot", "aot_pair" ] = "sigmoid" label_pad_token_id: int = -100 padding_value: Optional[int] = None @@ -114,8 +124,8 @@ class DPOConfig(TrainingArguments): ref_adapter_name: Optional[str] = None reference_free: bool = False force_use_ref_model: bool = False - f_divergence_type: Optional[FDivergenceType] = FDivergenceType.REVERSE_KL - f_alpha_divergence_coef: Optional[float] = 1.0 + f_divergence_type: FDivergenceType = FDivergenceType.REVERSE_KL + f_alpha_divergence_coef: float = 1.0 sync_ref_model: bool = False ref_model_mixup_alpha: float = 0.9 ref_model_sync_steps: int = 64