Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Maximal Update of Parameters - MUP #372

Open
gkielian opened this issue Jan 23, 2025 · 0 comments
Open

Add support for Maximal Update of Parameters - MUP #372

gkielian opened this issue Jan 23, 2025 · 0 comments
Assignees

Comments

@gkielian
Copy link
Collaborator

gkielian commented Jan 23, 2025

Mu P

This is an advance showing that if one does learning rate optimization on a smaller model, that one can use this framework.

From what I can see, vocab size and number of layers need to remain the same, but this has a method of finding the optimal learning rate for larger models (all larger width models) from a single sweep of a smaller model.

The claim is that this also holds true across datasets, and I think it also assumes that higher and smaller learning rates don't cross the smaller model's validation loss vs iteration curve (something I'm not sure is true since schedulers -- and dropping the learning rate -- has such a large impact on the final loss).

Notes: Changing model architecture requires retuning, this scales only the embedding dimension.

Reference: https://github.com/EleutherAI/nanoGPT-mup

Image from paper:

Image

Image source: https://cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization

@gkielian gkielian self-assigned this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant