-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LR Scheduler to full finetune distributed #2017
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2017
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit cfd2eb4 with merge base 0c31907 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
thanks for the PR! I glanced over it and it looks great! I will review it more carefully tomorrow and merge it i dont find any issues :) |
Consider refactoring (extracting into a separate file) because this same setup function is used in Eventually they'll fall out of sync. cc: @felipemello1 (i've hit this same issue and was about to submit a PR but noticed this one :)) |
Also might be worthwhile adding something like:
to configs, e.g. for llama 3.1 (8b/70b) better than not being sure what scheduler is being used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for doing all of these tests! As a followup, we would have to update the configs. I have a script i used for another PR to bulk update. If you want to do it, let me know, otherwise i can. In the script you would have to filter all cfgs that have "full.yaml" in their name, and find the best spot to add these lines, which is probably right after the tokenizer.
@gordicaleksa , great point! We are currently having some internal discussions about what should be exposed in the recipe and what should be a utility. In general, we are ok with repeating code so it is easy for people to hack and make their changes. But there are use cases like this one that seems to be pretty standard and really don't add much value by being exposed. We will work on making our recipes a bit learner soon. |
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses: #1308
Purpose of this PR:
This PR adds support for an optional learning rate scheduler to the
FullFinetuneRecipeDistributed
class, allowing users to configure and use a learning rate scheduler during training.You can enable it by adding the following to your config file:
Changelog
What are the changes made in this PR?
_setup_lr_scheduler
method to initialize the scheduler based on the configuration.setup
method to call_setup_lr_scheduler
after computingself._steps_per_epoch
andself.global_step
.train
method to step the scheduler after each optimizer step.Test plan
Tested on 4 GPUs with various configurations: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests:
https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/km2jw6rs
https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/lfacg1b8
https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/ymktfbam
https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/ckia4yzi