support loss function for Self-play Preference Optimization #1612

winglian · 2024-05-02T12:35:42Z

younesbelkada

Thanks a lot for this great addition !
Can you add a section here in the docs to mention this method: https://github.com/huggingface/trl/blob/main/docs/source/dpo_trainer.mdx#loss-functions ! 🙏

HuggingFaceDocBuilderDev · 2024-05-02T12:41:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

winglian · 2024-05-02T12:44:45Z

@younesbelkada docs updated. thanks!

younesbelkada

Thanks a lot !

younesbelkada · 2024-05-02T12:49:53Z

cc @kashif wdyt? 🙏

kashif · 2024-05-02T13:10:57Z

thanks @winglian can you kindly update the Config's doc and arguments

trl/trl/trainer/dpo_config.py

Line 72 in e2189c9

    
           loss_type: Literal["sigmoid", "hinge", "ipo", "kto_pair", "bco_pair"] = "sigmoid"

docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <[email protected]>

kashif · 2024-05-02T13:25:07Z

@winglian do you want to add an option to the test? e.g. a ["gpt2", "sppo", True],

angelahzyuan · 2024-05-03T04:11:24Z

@winglian Thanks for adding our work! @younesbelkada @kashif Just submitted a new pull request at #1615. This updates the loss function according to Equation (4.8), with $P(y_w > y_l) = 1$ and $P(y_l > y_w) = 0$, and justified it in doc as the hard label version of the algorithm.

It should work well now for the first iteration. Our reported 3 iterations results was based on the soft label version.

flozi00 · 2024-05-05T20:51:49Z

I just gave it a try and it's working better than orpo for me now.
Just installed the main branch this evening with the follow up patch for the hard loss type.

RL4LLM · 2024-05-09T00:45:48Z

Hi @winglian @flozi00 Do you know what is the value of beta shoud I set for SPPO?

support loss function for Self-play Preference Optimization

567029c

younesbelkada reviewed May 2, 2024

View reviewed changes

update docs

699d6d7

younesbelkada approved these changes May 2, 2024

View reviewed changes

younesbelkada requested a review from kashif May 2, 2024 12:49

update value error msg

e2189c9

winglian mentioned this pull request May 2, 2024

add support for SPPO axolotl-ai-cloud/axolotl#1585

Open

kashif reviewed May 2, 2024

View reviewed changes

docs/source/dpo_trainer.mdx Outdated Show resolved Hide resolved

winglian and others added 2 commits May 2, 2024 09:21

update typehint

36aa65e

Update docs/source/dpo_trainer.mdx

78d921e

Co-authored-by: Kashif Rasul <[email protected]>

kashif approved these changes May 2, 2024

View reviewed changes

include sppo in tests

0d13ac4

kashif approved these changes May 2, 2024

View reviewed changes

kashif merged commit adf17a5 into huggingface:main May 2, 2024
9 checks passed

angelahzyuan mentioned this pull request May 3, 2024

corrects loss function for Self-play Preference Optimization hard label version #1615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support loss function for Self-play Preference Optimization #1612

support loss function for Self-play Preference Optimization #1612

winglian commented May 2, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented May 2, 2024

winglian commented May 2, 2024

younesbelkada left a comment

younesbelkada commented May 2, 2024

kashif commented May 2, 2024

kashif commented May 2, 2024

angelahzyuan commented May 3, 2024

flozi00 commented May 5, 2024

RL4LLM commented May 9, 2024

support loss function for Self-play Preference Optimization #1612

support loss function for Self-play Preference Optimization #1612

Conversation

winglian commented May 2, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 2, 2024

winglian commented May 2, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada commented May 2, 2024

kashif commented May 2, 2024

kashif commented May 2, 2024

angelahzyuan commented May 3, 2024

flozi00 commented May 5, 2024

RL4LLM commented May 9, 2024