You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an advance showing that if one does learning rate optimization on a smaller model, that one can use this framework.
From what I can see, vocab size and number of layers need to remain the same, but this has a method of finding the optimal learning rate for larger models (all larger width models) from a single sweep of a smaller model.
The claim is that this also holds true across datasets, and I think it also assumes that higher and smaller learning rates don't cross the smaller model's validation loss vs iteration curve (something I'm not sure is true since schedulers -- and dropping the learning rate -- has such a large impact on the final loss).
Notes: Changing model architecture requires retuning, this scales only the embedding dimension.
Mu P
This is an advance showing that if one does learning rate optimization on a smaller model, that one can use this framework.
From what I can see, vocab size and number of layers need to remain the same, but this has a method of finding the optimal learning rate for larger models (all larger width models) from a single sweep of a smaller model.
The claim is that this also holds true across datasets, and I think it also assumes that higher and smaller learning rates don't cross the smaller model's validation loss vs iteration curve (something I'm not sure is true since schedulers -- and dropping the learning rate -- has such a large impact on the final loss).
Notes: Changing model architecture requires retuning, this scales only the embedding dimension.
Reference: https://github.com/EleutherAI/nanoGPT-mup
Image from paper:
Image source: https://cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization
The text was updated successfully, but these errors were encountered: