Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LBFGS optimizer as an option #792

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

vue1999
Copy link
Collaborator

@vue1999 vue1999 commented Jan 20, 2025

Adds LBFGS optimizer support.

Key details

  • Supports distributed training across multiple GPUs when using the —distributed flag
  • Handles datasets larger than available memory through batch processing
  • Uses strong Wolfe conditions for line search
  • modified dataloader to avoid dropping samples when using LBFGS

How to use

  • Add the —lbfgs flag
  • Use the largest batch_size that fits into memory (for optimal performance)
  • Disable EMA

Advantages

  • More deterministic training than first order optimizers
  • Less sensitivity to random seeds
  • Batch size independent convergence path
  • Fewer hyperparameters to tune
  • Possibly faster and better convergence

Notes

  • First-order optimizer hyperparameters (weight decay, learning rate, etc.) are ignored when using LBFGS
  • EMA should be disabled as it's redundant with line search
  • Unlike first-order optimizers where batch size affects the optimization dynamics, in LBFGS the batch_size parameter is purely computational:
    • Each optimization step requires gradients computed over the entire training set (at several points in the model space)
    • The full-dataset gradient computation is split into batches for memory efficiency and performance (by parallelising across GPUs)
    • The optimizer only sees the accumulated full-dataset gradients, regardless of how they were computed
  • While it can be used immediately from the start of training, it is likely that at the beginning ADAM is significantly faster because the precision of LBFGS is not that beneficial initially.
  • In the examples we looked at, we first trained the models using ADAM then switched to LBFGS. (This can be done by restarting training from a checkpoint and using the —lbfgs flag.)
  • It doesn’t work with multi-head training

Examples

3BPA
3BPA_size_comp

error_spread_256

SPICE (H,C atoms only subs
relative_tables_MACE_OFF
et)
loss_vs_time_continue

vue1999 and others added 30 commits December 4, 2024 00:10
@vue1999 vue1999 marked this pull request as ready for review February 17, 2025 00:32
@ilyes319
Copy link
Contributor

ilyes319 commented Mar 5, 2025

hey @vue1999, that looks great! is it ready to merge?

@vue1999
Copy link
Collaborator Author

vue1999 commented Mar 12, 2025

Yes. We didn’t do exhaustive testing, but it works with the multihead options too.

@ilyes319
Copy link
Contributor

would be worth adding some tests for it, inspired with the run train one (test_run_train.py). just a simple training run would be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants