-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update checkpointing directory -> using vLLM and from_pretrained #2074
Merged
Merged
Changes from 38 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
4bbc330
comments
f8c40c6
save ckpt compatible with hf
2ee50f0
checking differences
1c907df
add better ckpt paths
623daf7
add base_model_name_or_path
5e98008
better support for adapter / recipe state defaults
87b89c1
copy files from ckpt dir to output dir
a55b6ca
mkdir + dont copy cache
fc6cfbe
minor updates
01bc4be
modularization + better logic
e544031
comment
623a955
fix ckpt file
4438da0
remove contants from init
f4ecce5
update hardcoded dirname
6f828ce
update docs
a8cc992
fix tests
6ebc9ad
checkpointer tests pass
e99fbd5
dowloads tests pass
5d48860
update recipe tests
d6d1f84
and another one
9a10f93
update more tests
5dd2203
and another one
a803638
add suffix variable based on ckpter type
8fd9237
and another one
60548bd
ooops
87fc8cd
YOU SHALL PASS
b97515b
is this it?
6a41db1
modularize + tests + back to .pt
afd6623
docstrings
03ce473
hardcod to look for recipe_state.pt if its not provided
c797e4f
update input_dir
788986c
input dir != output dir
8b2199c
replace todo
fba9090
add todo
300a4f2
Merge branch 'main' into checkpointer
36e79e8
merge conflict
15498c7
fix paths
b34006f
...
34c12ff
typo
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,12 @@ | |
TOKENIZER_PATHS, | ||
) | ||
|
||
from torchtune.training.checkpointing._utils import ( | ||
get_largest_iter_folder, | ||
RECIPE_STATE_DIRNAME, | ||
SHARD_FNAME, | ||
) | ||
|
||
|
||
class TestFullFinetuneSingleDeviceRecipe: | ||
def _get_test_config_overrides(self): | ||
|
@@ -173,15 +179,21 @@ def test_training_state_on_resume(self, tmpdir, monkeypatch): | |
runpy.run_path(TUNE_PATH, run_name="__main__") | ||
|
||
# Resume training | ||
epoch_folder = get_largest_iter_folder(tmpdir) | ||
epoch_folder_minus_one = f"epoch_{int(epoch_folder.split('_')[-1]) - 1}" | ||
suffix = ".safetensors" | ||
model_ckpt_fname = ( | ||
SHARD_FNAME.format(cpt_idx="1".zfill(5), num_shards="1".zfill(5)) + suffix | ||
) | ||
cmd_2 = f""" | ||
tune run full_finetune_single_device \ | ||
--config llama2/7B_full_low_memory \ | ||
batch_size=8 \ | ||
output_dir={tmpdir} \ | ||
checkpointer._component_=torchtune.training.FullModelHFCheckpointer \ | ||
checkpointer.checkpoint_dir={tmpdir} \ | ||
checkpointer.checkpoint_files=[{os.path.join(tmpdir, "hf_model_0001_0.pt")}]\ | ||
checkpointer.recipe_checkpoint={os.path.join(tmpdir, "recipe_state.pt")}\ | ||
checkpointer.checkpoint_dir={ckpt_dir} \ | ||
checkpointer.checkpoint_files=[{os.path.join(epoch_folder_minus_one, model_ckpt_fname)}]\ | ||
checkpointer.recipe_checkpoint={os.path.join(RECIPE_STATE_DIRNAME, "recipe_state.pt")}\ | ||
checkpointer.output_dir={tmpdir} \ | ||
checkpointer.model_type=LLAMA2 \ | ||
tokenizer.path=/tmp/test-artifacts/tokenizer.model \ | ||
|
@@ -196,6 +208,7 @@ def test_training_state_on_resume(self, tmpdir, monkeypatch): | |
with pytest.raises(SystemExit, match=""): | ||
runpy.run_path(TUNE_PATH, run_name="__main__") | ||
|
||
raise NotImplementedError("") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove |
||
expected_loss_values = self._fetch_expected_loss_values("llama2")[2:] | ||
|
||
loss_values = get_loss_values_from_metric_logger(log_file) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test changes are related to finding the files. Before they were hardcoded.
Now we retrieve the epoch folder, get the suffix based on ckpt_type, and create the ckpt_name based on the defined standard.