Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nemo-automodel checkpoint-io refactor #12070

Merged
merged 143 commits into from
Feb 18, 2025

Conversation

akoumpa
Copy link
Member

@akoumpa akoumpa commented Feb 5, 2025

Changes:

  1. Centralize all checkpoint-related functionality in HFCheckpointIO, which is responsible for saving the model, optimizer and trainer state-dicts. By Centralizing the checkpoint-io, it can be shared across different strategies (e.g. SingleDeviceStrategy, DDPStrategy, FSDP2Strategy).
  2. Non-MegatronStrategy do not define a consumed_samples attribute, and instead the step attribute is used.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@akoumpa akoumpa changed the title init commit nemo-automodel checkpoint-io refactor Feb 5, 2025
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 5 times, most recently from 2ebc0ed to 62c9b22 Compare February 6, 2025 06:20
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 2 times, most recently from 162eaa2 to 103326b Compare February 6, 2025 06:34
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 6 times, most recently from 5a687db to e4327a3 Compare February 7, 2025 18:24
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 13 times, most recently from ef2c811 to cf72230 Compare February 9, 2025 20:50
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 82a1343 to efdf62e Compare February 16, 2025 23:00
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 16, 2025
Copy link
Contributor

[🤖]: Hi @akoumpa 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

@akoumpa akoumpa marked this pull request as ready for review February 17, 2025 20:49
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 842333d to 78c19da Compare February 17, 2025 22:05
pablo-garay
pablo-garay previously approved these changes Feb 17, 2025
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 17, 2025
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 0687c89 to e877f2a Compare February 17, 2025 22:49
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 7c456f2 to 159015e Compare February 17, 2025 22:53
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 17, 2025
Copy link
Contributor

[🤖]: Hi @akoumpa 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

@akoumpa akoumpa merged commit 2f25569 into main Feb 18, 2025
238 of 239 checks passed
@akoumpa akoumpa deleted the akoumparouli/nemo_automodel_checkpoint_io_refactor branch February 18, 2025 05:44
ko3n1g pushed a commit that referenced this pull request Feb 18, 2025
* init commit

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add checkpoint_io param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale code

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move HFCheckpointIO to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rank logic to strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add make_strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add sync_dist option

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* wip

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update kw

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* run _sync_from_last_pipeline_stage only with MegatronStrategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* switch ckpt template for automodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update HFCheckpointIO call & add load_pretrained

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update to use logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update to use logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* for non-mcore strats track step instread of global_step

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* moved reduced_train_loss log to automodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* log reduced_train_loss

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update docs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* f

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused option

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add is_rank_0 guard

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update load_checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore; docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore; docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* override lightning_module_state_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load_state_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add autoresume

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add automodel & switch to HFdatamodule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* uncomment test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* enable test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* at most one change

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add module name materializer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* skip test for now

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Introduce hf_adapter and hf_weights directories

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix comment

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylitn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylitn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add HFAdapterKeyRenamer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add HFAdapterKeyRenamer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add HFAdapterKeyRenamer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint;

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use HF_WEIGHTS_PATH

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update sft.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add auto-resume tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* comment fsdp2 test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use mixtral_2l instead of hf_gemma_2b

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* nemo auto-model peft restoration

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* nemo auto-model peft restoration

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update peft test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update peft test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* skip params without grad

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* disable optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add verify peft checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add auto-restore test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix arg typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix device_mesh init and load_model_state_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* comments

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update ValidateCheckpointRestoreCallback

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add missing imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove fully_shard from lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* trigger parallelize from peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove parallelize_fn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add mp_policy param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* handle torch's migration in import

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor change

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add connector between model and optimizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update for automodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make HF_WEIGHTS_PATH

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* workaround fiddle

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add tests for PytorchOptimizerModule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add io/hf.py test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused import

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update verify_sft_checkpoint_structure

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* copyright

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load adapter weights to cpu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update automodels

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update automodels

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use getattr to handle children

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* drop connect_optim_builder change

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use .format

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* switch to .format

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
akoumpa added a commit that referenced this pull request Feb 18, 2025
* init commit

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add checkpoint_io param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale code

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move HFCheckpointIO to separate file

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move rank logic to strat

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add make_strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add sync_dist option

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* wip

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update kw

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* run _sync_from_last_pipeline_stage only with MegatronStrategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* switch ckpt template for automodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update HFCheckpointIO call & add load_pretrained

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update to use logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update to use logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* for non-mcore strats track step instread of global_step

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* moved reduced_train_loss log to automodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* log reduced_train_loss

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update docs

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* f

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused option

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add is_rank_0 guard

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update assert message

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update load_checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused args

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore; docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* optim state restore; docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* override lightning_module_state_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load_state_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add autoresume

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add automodel & switch to HFdatamodule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* uncomment test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* enable test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* at most one change

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add module name materializer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* skip test for now

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Introduce hf_adapter and hf_weights directories

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix comment

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylitn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylitn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add HFAdapterKeyRenamer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add HFAdapterKeyRenamer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add HFAdapterKeyRenamer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint;

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use HF_WEIGHTS_PATH

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add logger

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update sft.py

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add auto-resume tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* comment fsdp2 test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use mixtral_2l instead of hf_gemma_2b

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* nemo auto-model peft restoration

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* nemo auto-model peft restoration

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update peft test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update peft test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* skip params without grad

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* disable optim state restore

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add verify peft checkpoint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add auto-restore test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix arg typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix device_mesh init and load_model_state_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* comments

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update ValidateCheckpointRestoreCallback

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add missing imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* docu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move parallelize_fn to FSDP2Strategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove fully_shard from lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* trigger parallelize from peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove parallelize_fn

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add mp_policy param

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* handle torch's migration in import

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* minor change

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add connector between model and optimizer

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update for automodel

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* make HF_WEIGHTS_PATH

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* workaround fiddle

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add tests for PytorchOptimizerModule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add io/hf.py test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused import

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update verify_sft_checkpoint_structure

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* copyright

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* load adapter weights to cpu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update automodels

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update automodels

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use getattr to handle children

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* drop connect_optim_builder change

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use .format

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* switch to .format

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
ko3n1g added a commit that referenced this pull request Feb 18, 2025
* init commit



* fix



* add checkpoint_io param



* remove stale code



* move HFCheckpointIO to separate file



* move rank logic to strat



* add make_strategy



* minor fix



* fix



* fix



* fix



* minor fix



* add sync_dist option



* wip



* update kw



* run _sync_from_last_pipeline_stage only with MegatronStrategy



* switch ckpt template for automodel



* fix



* use logger



* update HFCheckpointIO call & add load_pretrained



* update to use logger



* update to use logger



* for non-mcore strats track step instread of global_step



* moved reduced_train_loss log to automodel



* log reduced_train_loss



* update docs



* fix



* f



* remove unused option



* pylint



* pylint



* pylint



* pylint



* fix



* pylint



* add is_rank_0 guard



* update assert message



* load checkpoint



* load checkpoint



* update load_checkpoint



* remove unused args



* fix



* remove unused imports



* optim state restore



* optim state restore



* optim state restore



* optim state restore; docu



* optim state restore; docu



* override lightning_module_state_dict



* fix



* load_state_dict



* add autoresume



* add automodel & switch to HFdatamodule



* uncomment test



* enable test



* Apply isort and black reformatting



* at most one change



* add module name materializer



* skip test for now



* minor fix



* fix



* Introduce hf_adapter and hf_weights directories



* fix comment



* docu



* pylitn



* docu



* pylitn



* fix



* add HFAdapterKeyRenamer



* add HFAdapterKeyRenamer



* add HFAdapterKeyRenamer



* pylint;



* typo



* fix



* use HF_WEIGHTS_PATH



* add logger



* update sft.py



* update tests



* add auto-resume tests



* comment fsdp2 test



* remove unused imports



* use mixtral_2l instead of hf_gemma_2b



* nemo auto-model peft restoration



* nemo auto-model peft restoration



* update peft test



* update peft test



* skip params without grad



* disable optim state restore



* add verify peft checkpoint



* add auto-restore test



* fix arg typo



* fix



* fix



* fix device_mesh init and load_model_state_dict



* comments



* update ValidateCheckpointRestoreCallback



* add missing imports



* remove unused imports



* docu



* pylint



* move parallelize_fn to FSDP2Strategy



* move parallelize_fn to FSDP2Strategy



* move parallelize_fn to FSDP2Strategy



* move parallelize_fn to FSDP2Strategy



* move parallelize_fn to FSDP2Strategy



* move parallelize_fn to FSDP2Strategy



* fix



* fix



* remove fully_shard from lora



* trigger parallelize from peft



* remove parallelize_fn



* pylint



* add mp_policy param



* handle torch's migration in import



* refix



* fix



* fix



* fix



* fix



* minor change



* add connector between model and optimizer



* update for automodel



* make HF_WEIGHTS_PATH



* workaround fiddle



* fix



* add tests for PytorchOptimizerModule



* add io/hf.py test



* remove unused import



* fix typo



* update verify_sft_checkpoint_structure



* copyright



* load adapter weights to cpu



* update automodels



* update automodels



* use getattr to handle children



* fix test



* fix



* remove unused



* drop connect_optim_builder change



* use .format



* switch to .format



* fix



* Apply isort and black reformatting



---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants