Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP integration #6152
FSDP integration #6152
Changes from 8 commits
78f1eb4
c36e00a
59dbb83
19a1440
3b38615
5ff06ab
36434f0
c61a190
1c4f011
02599e6
e79977a
d15d4b5
ebf1818
290e8fd
5c5f762
d28438b
ab591a8
23ccdb8
a60f2c0
3d4e6df
516bd04
9f8864f
eac5344
32df0cb
282a133
7a94e72
8091481
633fc77
c99a36f
a68c8d7
9529a22
3c1c782
7daba43
87ec222
966b2e5
5f6e039
b512e72
8ba82df
1e5ca37
936dc1a
a6de18e
8684f94
76091ae
226d498
cd63c10
b881e2f
e8959be
52478ac
b7f1896
a62f8d8
69c33f1
9fa26c0
56f23ce
9ca3f0c
b53ba36
0da5249
90c6479
69d8178
a459d10
a7842d9
48ee83f
57a696c
36889b8
89b8cb5
0c1d2de
592bb28
4e230c9
ca8e586
54f501d
02925cc
b67f1a9
132eb64
e6ce3cf
01153af
6cfe57d
efa81ab
78d52b5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeanNaren , after getting detailed memory usage, I finally figured out why originally the full model fits in one GPU, but when checkpointing, it OOM
because in
checkpoint_connector
(https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L270-L277), we havehere, we try to collect again, this would double the size.
One easy workaround now, is to add
but this is not ideal, we summon the full parameters twice which is unnecessary.
I feel, we should modify that file to let training type plugin to control, something like
trainer.accelerator.training_type_plugin.state_dict()
especially when we would like to collect only sharded state dict in the future.
cc @ananthsub
@min-xu-ai , I think this is the root cause for OOM, facebookresearch/fairscale#658 should not be problem (for setting
state_dict_device=torch.device("cpu")
, CPU OOM should be similar problem as we also double the model storage in CPU)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shuyingsunshine21 for your help here! This makes sense since we're allocating memory new memory.
I agree with allowing the training type plugin to return the state dict, we already rely on the accelerator to dump the optimizer dicts. I'm happy to make the change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeanNaren , thanks, no worry, if you have not already made the change, I could help send a small PR for that.