Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

policy kl [old | new] #168

Merged
merged 1 commit into from
Feb 22, 2023
Merged

policy kl [old | new] #168

merged 1 commit into from
Feb 22, 2023

Conversation

kashif
Copy link
Collaborator

@kashif kashif commented Feb 21, 2023

I believe the kl we want to record is KL[policy_old || policy_new]

Perhaps @natolambert can confirm?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 21, 2023

The documentation is not available anymore as the PR was closed or merged.

@natolambert
Copy link
Contributor

For things like this, I think it's best to run a couple examples with and without the change. In theory, I think you can be right (@lvwerra probably knows).

I went back to the original repo and saw that TRL matches it.

https://github.com/openai/lm-human-preferences/blob/bd3775f200676e7c9ed438c50727e7452b1a52c1/lm_human_preferences/train_policy.py#L360

@kashif
Copy link
Collaborator Author

kashif commented Feb 21, 2023

thanks @natolambert yes the approxkl is fine since it's the square of the difference so the order is not important. It's the policykl which is not correct i believe.

@lvwerra
Copy link
Member

lvwerra commented Feb 22, 2023

I wonder where that actually came from since it was not part of the original implementation 😂. True, if policykl should measure the KL-div of the current to the old policy then it's the wrong way round. Since it's only used for monitoring I am up to changing it unless there is a strong incentive to keep it that way.

@kashif
Copy link
Collaborator Author

kashif commented Feb 22, 2023

yes in any case it's just for logging, it's not used to back-prop etc. I check other implementations e.g. ray, spinningup etc. and they have it as in this fix.

@lvwerra lvwerra merged commit a757ac4 into huggingface:main Feb 22, 2023
@kashif kashif deleted the fix-policy-kl branch February 22, 2023 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants