-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements 1a #70
Improvements 1a #70
Conversation
The documentation is not available anymore as the PR was closed or merged. |
1fb877a
to
5286000
Compare
abaf0e4
to
f3c9ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @edbeeching, this looks great! Left a few minor comments, can also address some of them in a follow up PR.
# define a reward for response | ||
# (this could be any reward such as human feedback or output from another model) | ||
reward = [torch.tensor(1.0)] | ||
reward = [torch.tensor(1.0).to(device)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should let accelerate
handle device placement internally. (Can address it in a follow up PR if you want)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reward would normally be from a preference model, so it would probably be on the correct device. But in this example that is not the case. I'll leave it to another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, discussed with @younesbelkada that we add some device placement steps to PPOTrainer.step
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this will be addressed in a PR right after merging this
# isort: off | ||
from .utils import AdaptiveKLController, FixedKLController | ||
|
||
# isort: on | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the utils imports separate from the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes a circular import in PPOTrainer if we allow isort to sort the imports. I added a comment. Alternatively, we could exclude these from the init.py file and/or use a relative import in PPOTrainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks @edbeeching!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of the improvements @edbeeching !
This PR: