Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LR Scheduler to full finetune distributed #2017

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

parthsarthi03
Copy link
Contributor

@parthsarthi03 parthsarthi03 commented Nov 17, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature

Please link to any issues this PR addresses: #1308

Purpose of this PR:

This PR adds support for an optional learning rate scheduler to the FullFinetuneRecipeDistributed class, allowing users to configure and use a learning rate scheduler during training.

You can enable it by adding the following to your config file:

lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 50

Changelog

What are the changes made in this PR?

  • Implemented Optional Learning Rate Scheduler:
    • Added a _setup_lr_scheduler method to initialize the scheduler based on the configuration.
    • Modified the setup method to call _setup_lr_scheduler after computing self._steps_per_epoch and self.global_step.
    • Updated the training loop in the train method to step the scheduler after each optimizer step.

Test plan

Tested on 4 GPUs with various configurations: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests:

  1. No Learning Rate Scheduler, No Optimizer-in-Backward: https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/e1ddni13
  2. No Learning Rate Scheduler, With Optimizer-in-Backward:
    https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/km2jw6rs
  3. Cosine Learning Rate Scheduler with 50 Warmup Steps, With Optimizer-in-Backward:
    https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/lfacg1b8
  4. Cosine Learning Rate Scheduler with 50 Warmup Steps, Without Optimizer-in-Backward:
    https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/ymktfbam
  5. Resuming Training with Learning Rate Scheduler, Without Optimizer-in-Backward:
    https://wandb.ai/psarthi/torchtune_lr_scheduler_tests/runs/ckia4yzi

Copy link

pytorch-bot bot commented Nov 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2017

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit cfd2eb4 with merge base 0c31907 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 17, 2024
@parthsarthi03 parthsarthi03 changed the title Add LR Scheduler to **FullFinetuneRecipeDistributed** Add LR Scheduler to full finetune distributed Nov 17, 2024
@felipemello1
Copy link
Contributor

felipemello1 commented Nov 19, 2024

thanks for the PR! I glanced over it and it looks great! I will review it more carefully tomorrow and merge it i dont find any issues :)

@gordicaleksa
Copy link

gordicaleksa commented Nov 20, 2024

Consider refactoring (extracting into a separate file) because this same setup function is used in full_finetune_single_device.py (https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_single_device.py#L496)

Eventually they'll fall out of sync.

cc: @felipemello1

(i've hit this same issue and was about to submit a PR but noticed this one :))

@gordicaleksa
Copy link

Also might be worthwhile adding something like:

lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 10

to configs, e.g. for llama 3.1 (8b/70b) better than not being sure what scheduler is being used

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for doing all of these tests! As a followup, we would have to update the configs. I have a script i used for another PR to bulk update. If you want to do it, let me know, otherwise i can. In the script you would have to filter all cfgs that have "full.yaml" in their name, and find the best spot to add these lines, which is probably right after the tokenizer.

image

@felipemello1
Copy link
Contributor

@gordicaleksa , great point! We are currently having some internal discussions about what should be exposed in the recipe and what should be a utility. In general, we are ok with repeating code so it is easy for people to hack and make their changes. But there are use cases like this one that seems to be pretty standard and really don't add much value by being exposed. We will work on making our recipes a bit learner soon.

@felipemello1 felipemello1 merged commit fcd400f into pytorch:main Nov 20, 2024
17 checks passed
@ebsmothers ebsmothers mentioned this pull request Nov 26, 2024
44 tasks
@parthsarthi03 parthsarthi03 deleted the add_lr_scheduler branch November 29, 2024 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants