-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nemo-automodel checkpoint-io refactor #12070
nemo-automodel checkpoint-io refactor #12070
Conversation
2ebc0ed
to
62c9b22
Compare
162eaa2
to
103326b
Compare
5a687db
to
e4327a3
Compare
ef2c811
to
cf72230
Compare
82a1343
to
efdf62e
Compare
[🤖]: Hi @akoumpa 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
Signed-off-by: Alexandros Koumparoulis <[email protected]>
842333d
to
78c19da
Compare
Signed-off-by: Alexandros Koumparoulis <[email protected]>
0687c89
to
e877f2a
Compare
Signed-off-by: Alexandros Koumparoulis <[email protected]>
7c456f2
to
159015e
Compare
Signed-off-by: akoumpa <[email protected]>
[🤖]: Hi @akoumpa 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
* init commit Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add checkpoint_io param Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove stale code Signed-off-by: Alexandros Koumparoulis <[email protected]> * move HFCheckpointIO to separate file Signed-off-by: Alexandros Koumparoulis <[email protected]> * move rank logic to strat Signed-off-by: Alexandros Koumparoulis <[email protected]> * add make_strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add sync_dist option Signed-off-by: Alexandros Koumparoulis <[email protected]> * wip Signed-off-by: Alexandros Koumparoulis <[email protected]> * update kw Signed-off-by: Alexandros Koumparoulis <[email protected]> * run _sync_from_last_pipeline_stage only with MegatronStrategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * switch ckpt template for automodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * use logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * update HFCheckpointIO call & add load_pretrained Signed-off-by: Alexandros Koumparoulis <[email protected]> * update to use logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * update to use logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * for non-mcore strats track step instread of global_step Signed-off-by: Alexandros Koumparoulis <[email protected]> * moved reduced_train_loss log to automodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * log reduced_train_loss Signed-off-by: Alexandros Koumparoulis <[email protected]> * update docs Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * f Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused option Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * add is_rank_0 guard Signed-off-by: Alexandros Koumparoulis <[email protected]> * update assert message Signed-off-by: Alexandros Koumparoulis <[email protected]> * load checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * load checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * update load_checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused args Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore; docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore; docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * override lightning_module_state_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * load_state_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * add autoresume Signed-off-by: Alexandros Koumparoulis <[email protected]> * add automodel & switch to HFdatamodule Signed-off-by: Alexandros Koumparoulis <[email protected]> * uncomment test Signed-off-by: Alexandros Koumparoulis <[email protected]> * enable test Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * at most one change Signed-off-by: Alexandros Koumparoulis <[email protected]> * add module name materializer Signed-off-by: Alexandros Koumparoulis <[email protected]> * skip test for now Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Introduce hf_adapter and hf_weights directories Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix comment Signed-off-by: Alexandros Koumparoulis <[email protected]> * docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylitn Signed-off-by: Alexandros Koumparoulis <[email protected]> * docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylitn Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add HFAdapterKeyRenamer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add HFAdapterKeyRenamer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add HFAdapterKeyRenamer Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint; Signed-off-by: Alexandros Koumparoulis <[email protected]> * typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * use HF_WEIGHTS_PATH Signed-off-by: Alexandros Koumparoulis <[email protected]> * add logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * update sft.py Signed-off-by: Alexandros Koumparoulis <[email protected]> * update tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * add auto-resume tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * comment fsdp2 test Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * use mixtral_2l instead of hf_gemma_2b Signed-off-by: Alexandros Koumparoulis <[email protected]> * nemo auto-model peft restoration Signed-off-by: Alexandros Koumparoulis <[email protected]> * nemo auto-model peft restoration Signed-off-by: Alexandros Koumparoulis <[email protected]> * update peft test Signed-off-by: Alexandros Koumparoulis <[email protected]> * update peft test Signed-off-by: Alexandros Koumparoulis <[email protected]> * skip params without grad Signed-off-by: Alexandros Koumparoulis <[email protected]> * disable optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * add verify peft checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * add auto-restore test Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix arg typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix device_mesh init and load_model_state_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * comments Signed-off-by: Alexandros Koumparoulis <[email protected]> * update ValidateCheckpointRestoreCallback Signed-off-by: Alexandros Koumparoulis <[email protected]> * add missing imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove fully_shard from lora Signed-off-by: Alexandros Koumparoulis <[email protected]> * trigger parallelize from peft Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove parallelize_fn Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * add mp_policy param Signed-off-by: Alexandros Koumparoulis <[email protected]> * handle torch's migration in import Signed-off-by: Alexandros Koumparoulis <[email protected]> * refix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor change Signed-off-by: Alexandros Koumparoulis <[email protected]> * add connector between model and optimizer Signed-off-by: Alexandros Koumparoulis <[email protected]> * update for automodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * make HF_WEIGHTS_PATH Signed-off-by: Alexandros Koumparoulis <[email protected]> * workaround fiddle Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add tests for PytorchOptimizerModule Signed-off-by: Alexandros Koumparoulis <[email protected]> * add io/hf.py test Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused import Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * update verify_sft_checkpoint_structure Signed-off-by: Alexandros Koumparoulis <[email protected]> * copyright Signed-off-by: Alexandros Koumparoulis <[email protected]> * load adapter weights to cpu Signed-off-by: Alexandros Koumparoulis <[email protected]> * update automodels Signed-off-by: Alexandros Koumparoulis <[email protected]> * update automodels Signed-off-by: Alexandros Koumparoulis <[email protected]> * use getattr to handle children Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix test Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused Signed-off-by: Alexandros Koumparoulis <[email protected]> * drop connect_optim_builder change Signed-off-by: Alexandros Koumparoulis <[email protected]> * use .format Signed-off-by: Alexandros Koumparoulis <[email protected]> * switch to .format Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]>
* init commit Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add checkpoint_io param Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove stale code Signed-off-by: Alexandros Koumparoulis <[email protected]> * move HFCheckpointIO to separate file Signed-off-by: Alexandros Koumparoulis <[email protected]> * move rank logic to strat Signed-off-by: Alexandros Koumparoulis <[email protected]> * add make_strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add sync_dist option Signed-off-by: Alexandros Koumparoulis <[email protected]> * wip Signed-off-by: Alexandros Koumparoulis <[email protected]> * update kw Signed-off-by: Alexandros Koumparoulis <[email protected]> * run _sync_from_last_pipeline_stage only with MegatronStrategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * switch ckpt template for automodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * use logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * update HFCheckpointIO call & add load_pretrained Signed-off-by: Alexandros Koumparoulis <[email protected]> * update to use logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * update to use logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * for non-mcore strats track step instread of global_step Signed-off-by: Alexandros Koumparoulis <[email protected]> * moved reduced_train_loss log to automodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * log reduced_train_loss Signed-off-by: Alexandros Koumparoulis <[email protected]> * update docs Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * f Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused option Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * add is_rank_0 guard Signed-off-by: Alexandros Koumparoulis <[email protected]> * update assert message Signed-off-by: Alexandros Koumparoulis <[email protected]> * load checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * load checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * update load_checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused args Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore; docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * optim state restore; docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * override lightning_module_state_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * load_state_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * add autoresume Signed-off-by: Alexandros Koumparoulis <[email protected]> * add automodel & switch to HFdatamodule Signed-off-by: Alexandros Koumparoulis <[email protected]> * uncomment test Signed-off-by: Alexandros Koumparoulis <[email protected]> * enable test Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * at most one change Signed-off-by: Alexandros Koumparoulis <[email protected]> * add module name materializer Signed-off-by: Alexandros Koumparoulis <[email protected]> * skip test for now Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Introduce hf_adapter and hf_weights directories Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix comment Signed-off-by: Alexandros Koumparoulis <[email protected]> * docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylitn Signed-off-by: Alexandros Koumparoulis <[email protected]> * docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylitn Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add HFAdapterKeyRenamer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add HFAdapterKeyRenamer Signed-off-by: Alexandros Koumparoulis <[email protected]> * add HFAdapterKeyRenamer Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint; Signed-off-by: Alexandros Koumparoulis <[email protected]> * typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * use HF_WEIGHTS_PATH Signed-off-by: Alexandros Koumparoulis <[email protected]> * add logger Signed-off-by: Alexandros Koumparoulis <[email protected]> * update sft.py Signed-off-by: Alexandros Koumparoulis <[email protected]> * update tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * add auto-resume tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * comment fsdp2 test Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * use mixtral_2l instead of hf_gemma_2b Signed-off-by: Alexandros Koumparoulis <[email protected]> * nemo auto-model peft restoration Signed-off-by: Alexandros Koumparoulis <[email protected]> * nemo auto-model peft restoration Signed-off-by: Alexandros Koumparoulis <[email protected]> * update peft test Signed-off-by: Alexandros Koumparoulis <[email protected]> * update peft test Signed-off-by: Alexandros Koumparoulis <[email protected]> * skip params without grad Signed-off-by: Alexandros Koumparoulis <[email protected]> * disable optim state restore Signed-off-by: Alexandros Koumparoulis <[email protected]> * add verify peft checkpoint Signed-off-by: Alexandros Koumparoulis <[email protected]> * add auto-restore test Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix arg typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix device_mesh init and load_model_state_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * comments Signed-off-by: Alexandros Koumparoulis <[email protected]> * update ValidateCheckpointRestoreCallback Signed-off-by: Alexandros Koumparoulis <[email protected]> * add missing imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * docu Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * move parallelize_fn to FSDP2Strategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove fully_shard from lora Signed-off-by: Alexandros Koumparoulis <[email protected]> * trigger parallelize from peft Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove parallelize_fn Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * add mp_policy param Signed-off-by: Alexandros Koumparoulis <[email protected]> * handle torch's migration in import Signed-off-by: Alexandros Koumparoulis <[email protected]> * refix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * minor change Signed-off-by: Alexandros Koumparoulis <[email protected]> * add connector between model and optimizer Signed-off-by: Alexandros Koumparoulis <[email protected]> * update for automodel Signed-off-by: Alexandros Koumparoulis <[email protected]> * make HF_WEIGHTS_PATH Signed-off-by: Alexandros Koumparoulis <[email protected]> * workaround fiddle Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add tests for PytorchOptimizerModule Signed-off-by: Alexandros Koumparoulis <[email protected]> * add io/hf.py test Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused import Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * update verify_sft_checkpoint_structure Signed-off-by: Alexandros Koumparoulis <[email protected]> * copyright Signed-off-by: Alexandros Koumparoulis <[email protected]> * load adapter weights to cpu Signed-off-by: Alexandros Koumparoulis <[email protected]> * update automodels Signed-off-by: Alexandros Koumparoulis <[email protected]> * update automodels Signed-off-by: Alexandros Koumparoulis <[email protected]> * use getattr to handle children Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix test Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused Signed-off-by: Alexandros Koumparoulis <[email protected]> * drop connect_optim_builder change Signed-off-by: Alexandros Koumparoulis <[email protected]> * use .format Signed-off-by: Alexandros Koumparoulis <[email protected]> * switch to .format Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]>
* init commit * fix * add checkpoint_io param * remove stale code * move HFCheckpointIO to separate file * move rank logic to strat * add make_strategy * minor fix * fix * fix * fix * minor fix * add sync_dist option * wip * update kw * run _sync_from_last_pipeline_stage only with MegatronStrategy * switch ckpt template for automodel * fix * use logger * update HFCheckpointIO call & add load_pretrained * update to use logger * update to use logger * for non-mcore strats track step instread of global_step * moved reduced_train_loss log to automodel * log reduced_train_loss * update docs * fix * f * remove unused option * pylint * pylint * pylint * pylint * fix * pylint * add is_rank_0 guard * update assert message * load checkpoint * load checkpoint * update load_checkpoint * remove unused args * fix * remove unused imports * optim state restore * optim state restore * optim state restore * optim state restore; docu * optim state restore; docu * override lightning_module_state_dict * fix * load_state_dict * add autoresume * add automodel & switch to HFdatamodule * uncomment test * enable test * Apply isort and black reformatting * at most one change * add module name materializer * skip test for now * minor fix * fix * Introduce hf_adapter and hf_weights directories * fix comment * docu * pylitn * docu * pylitn * fix * add HFAdapterKeyRenamer * add HFAdapterKeyRenamer * add HFAdapterKeyRenamer * pylint; * typo * fix * use HF_WEIGHTS_PATH * add logger * update sft.py * update tests * add auto-resume tests * comment fsdp2 test * remove unused imports * use mixtral_2l instead of hf_gemma_2b * nemo auto-model peft restoration * nemo auto-model peft restoration * update peft test * update peft test * skip params without grad * disable optim state restore * add verify peft checkpoint * add auto-restore test * fix arg typo * fix * fix * fix device_mesh init and load_model_state_dict * comments * update ValidateCheckpointRestoreCallback * add missing imports * remove unused imports * docu * pylint * move parallelize_fn to FSDP2Strategy * move parallelize_fn to FSDP2Strategy * move parallelize_fn to FSDP2Strategy * move parallelize_fn to FSDP2Strategy * move parallelize_fn to FSDP2Strategy * move parallelize_fn to FSDP2Strategy * fix * fix * remove fully_shard from lora * trigger parallelize from peft * remove parallelize_fn * pylint * add mp_policy param * handle torch's migration in import * refix * fix * fix * fix * fix * minor change * add connector between model and optimizer * update for automodel * make HF_WEIGHTS_PATH * workaround fiddle * fix * add tests for PytorchOptimizerModule * add io/hf.py test * remove unused import * fix typo * update verify_sft_checkpoint_structure * copyright * load adapter weights to cpu * update automodels * update automodels * use getattr to handle children * fix test * fix * remove unused * drop connect_optim_builder change * use .format * switch to .format * fix * Apply isort and black reformatting --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: akoumpa <[email protected]>
Changes:
MegatronStrategy
do not define aconsumed_samples
attribute, and instead thestep
attribute is used.What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information