Add support for Maximal Update of Parameters - MUP #372

gkielian · 2025-01-23T22:45:39Z

Mu P

This is an advance showing that if one does learning rate optimization on a smaller model, that one can use this framework.

From what I can see, vocab size and number of layers need to remain the same, but this has a method of finding the optimal learning rate for larger models (all larger width models) from a single sweep of a smaller model.

The claim is that this also holds true across datasets, and I think it also assumes that higher and smaller learning rates don't cross the smaller model's validation loss vs iteration curve (something I'm not sure is true since schedulers -- and dropping the learning rate -- has such a large impact on the final loss).

Notes: Changing model architecture requires retuning, this scales only the embedding dimension.

Reference: https://github.com/EleutherAI/nanoGPT-mup

Image from paper:

Image source: https://cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization

gkielian self-assigned this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Maximal Update of Parameters - MUP #372

Add support for Maximal Update of Parameters - MUP #372

gkielian commented Jan 23, 2025 •

edited

Loading

Add support for Maximal Update of Parameters - MUP #372

Add support for Maximal Update of Parameters - MUP #372

Comments

gkielian commented Jan 23, 2025 • edited Loading

Mu P

gkielian commented Jan 23, 2025 •

edited

Loading