-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightningLite Integration #2700
Conversation
hi @alanakbik, could you pls help in enabling CI tests? |
@aniketmaurya done! just a heads-up: the release of flair 0.11 is imminent, so I'll review the PR after the release! |
thanks @alanakbik! Sure, please take your time. |
@alanakbik btw we can jointly release this with flair 0.11. I can help with any test required for LightningLite Trainer integration and then we can potentially collaborate on a release article for cross marketing. This is just an idea. Let me know what you think? 🙂 |
This is awesome! in particular I like that the code is easier to read and reduce some conditional imports :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @aniketmaurya, awesome, looking forward to integrate this. Have you tested your development using multiple GPUs? See my comment in the review.
@@ -285,12 +296,12 @@ def train( | |||
self.model.train() | |||
|
|||
# reset variables | |||
hidden = self.model.init_hidden(mini_batch_size) | |||
hidden = self.model.module.init_hidden(mini_batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried running it on single machine - multiple GPUs? I get an error, obviously LightningLite wraps DistributedDataParallel which wraps the LanguageModel class - so this line should be self.model.module.module.init_hidden
. However this looks not correct to me. When using just CPU, this configurations works. Can you check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for checking and reporting this @whoisjones, let me test this on multiple GPUs and I will fix it.
@@ -418,14 +425,15 @@ def evaluate(self, data_source, eval_batch_size, sequence_length): | |||
total_loss = 0 | |||
ntokens = len(self.corpus.dictionary) | |||
|
|||
hidden = self.model.init_hidden(eval_batch_size) | |||
hidden = self.model.module.init_hidden(eval_batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as mentioned before.
@@ -418,14 +425,15 @@ def evaluate(self, data_source, eval_batch_size, sequence_length): | |||
total_loss = 0 | |||
ntokens = len(self.corpus.dictionary) | |||
|
|||
hidden = self.model.init_hidden(eval_batch_size) | |||
hidden = self.model.module.init_hidden(eval_batch_size) | |||
|
|||
for i in range(0, data_source.size(0) - 1, sequence_length): | |||
data, targets = self._get_batch(data_source, i, sequence_length) | |||
prediction, rnn_output, hidden = self.model.forward(data, hidden) | |||
output_flat = prediction.view(-1, ntokens) | |||
total_loss += len(data) * self.loss_function(output_flat, targets).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using multiple GPUs, we need to align tensors to be on the same device, I received an issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flair.device = self.device
will move the tensor to appropriate device with multiple devices. I have added the fix. :)
I think that the next step would be to integrate with EcoCI, which will run nightly also on GPUs |
Hi @whoisjones, thanks again for reviewing the PR :) I have fixed the GPU and CPU device mismatch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Hey @aniketmaurya have you already heard some from pytorch-lightning community? |
Hi @whoisjones, sorry didn't get you :) |
Hi @aniketmaurya I think the question is about this PR: Lightning-AI/pytorch-lightning#12597 - if I understand correctly, it needs to be merged first so that LightningLite runs for Flair with DDP strategy. |
Hi @alanakbik, sorry for the late response! Right, we need to merge the PL side PR first but can do a quick fix until then - The model (
Use |
Thanks for the quick fix @aniketmaurya! Its probably best though to wait until the PR is merged as otherwise it might be confusing for users that only a subset of PyTorch Lightning capabilities is available. |
Thanks everyone for the great work on this, I've been trying to get Flair to work on single instance multi-gpu as well, so very interested in this PR. I see that the Lighting PR was finally merged 7 days ago. |
Hi @machinelearnear, thank you for your support! I already have this PR, which removes Python 3.6 for Flair. |
@aniketmaurya that is great, thanks for the update! We will do another minor release first and reserve the update to Python 3.7 for the next bigger release - I'll keep you posted! |
@aniketmaurya any plans to also integrate |
Any updates on this one? I have two GPUs now and eager to try it with Flair! |
cc: @awaelchli 🐰 |
@dchaplinsky I tried today with the most recent lightning release and DDP strategy, using the code from the example that comes with the PR: trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp') It worked quite well in that it was automatically using 4 GPUs that were all going at 100%. It seems each GPU gets a different split of the data. However, when epoch 1 ended, it got stuck without throwing an error, and I terminated execution after a while. @aniketmaurya any idea why this might be the case? |
hi @alanakbik, Is it possible to get the data and training details to reproduce this error? Also could you please provide the error trace? |
Hello @aniketmaurya sure, I'll build a minimal example for you. |
Here is a minimal training data example for my training script. Unpack this where you like and point the script to the root folder: I used the following training script on a machine with 3 Nividia 3090 GPUs and Python 3.8: import flair
from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
flair.device = 'cuda:0'
# are you training a forward or backward LM?
is_forward_lm = True
# get your corpus, process forward and at the character level
dictionary = Dictionary.load('chars')
# get corpus
corpus = TextCorpus("resources/tasks/penn_lm/",
dictionary,
is_forward_lm,
character_level=True,
random_case_flip=True,
)
language_model = LanguageModel(dictionary,
is_forward_lm=True,
hidden_size=128,
nlayers=1,
)
print(language_model)
# train your language model
trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp')
# train your language model
trainer.train(language_model,
corpus,
"resources/LMs/penn-gpu-auto-ddp",
sequence_length=50,
mini_batch_size=10,
learning_rate=10,
patience=10,
max_epochs=20,
checkpoint=True,
num_workers=4,
)
print(language_model.generate_text()) It automatically takes all three GPUs and runs them at 100% (great), but at the end of the first epoch nothing more happens. Here is the last output I get: 2022-10-31 09:34:56,368 read text file with 2000 lines
2022-10-31 09:34:56,369 shuffled
2022-10-31 09:34:56,372 Sequence length is 50
2022-10-31 09:34:56,373 Split 1 - (09:34:56)
2022-10-31 09:34:56,379 read text file with 2000 lines
2022-10-31 09:34:56,380 shuffled
2022-10-31 09:34:56,382 read text file with 2000 lines
2022-10-31 09:34:56,383 shuffled
2022-10-31 09:34:56,384 read text file with 2000 lines
2022-10-31 09:34:56,385 shuffled
2022-10-31 09:34:56,386 read text file with 2000 lines
2022-10-31 09:34:56,386 Sequence length is 50
2022-10-31 09:34:56,386 shuffled
2022-10-31 09:34:56,387 Split 1 - (09:34:56)
2022-10-31 09:34:56,395 read text file with 2000 lines
2022-10-31 09:34:56,396 read text file with 2000 lines
2022-10-31 09:34:56,396 shuffled
2022-10-31 09:34:56,396 shuffled
2022-10-31 09:34:56,399 read text file with 2000 lines
2022-10-31 09:34:56,399 shuffled
2022-10-31 09:34:56,419 read text file with 42068 lines
2022-10-31 09:34:56,436 shuffled
2022-10-31 09:34:59,746 | split 1/ 23 | 100/ 483 batches | ms/batch 33.72 | loss 3.0842 | ppl 21.8503
2022-10-31 09:34:59,746 | split 1/ 23 | 100/ 478 batches | ms/batch 33.59 | loss 3.0945 | ppl 22.0756
2022-10-31 09:34:59,746 | split 1/ 23 | 100/ 480 batches | ms/batch 34.54 | loss 3.1399 | ppl 23.1021
2022-10-31 09:34:59,927 | split 1/ 23 | 200/ 483 batches | ms/batch 1.80 | loss 2.2430 | ppl 9.4220
2022-10-31 09:34:59,927 | split 1/ 23 | 200/ 478 batches | ms/batch 1.80 | loss 2.2366 | ppl 9.3619
2022-10-31 09:34:59,927 | split 1/ 23 | 200/ 480 batches | ms/batch 1.80 | loss 2.2417 | ppl 9.4089
2022-10-31 09:35:00,102 | split 1/ 23 | 300/ 483 batches | ms/batch 1.75 | loss 2.0134 | ppl 7.4888
2022-10-31 09:35:00,102 | split 1/ 23 | 300/ 478 batches | ms/batch 1.75 | loss 2.0211 | ppl 7.5464
2022-10-31 09:35:00,102 | split 1/ 23 | 300/ 480 batches | ms/batch 1.75 | loss 2.0382 | ppl 7.6765
2022-10-31 09:35:00,278 | split 1/ 23 | 400/ 480 batches | ms/batch 1.76 | loss 1.8816 | ppl 6.5641
2022-10-31 09:35:00,278 | split 1/ 23 | 400/ 478 batches | ms/batch 1.76 | loss 1.8837 | ppl 6.5780
2022-10-31 09:35:00,278 | split 1/ 23 | 400/ 483 batches | ms/batch 1.77 | loss 1.8589 | ppl 6.4168 On a single GPU on my local machine it works fine. |
Quick update: it trains successfully if the |
I've tried this branch on my 2 GPU (2060 and 1070) setup while training
flair embeddings for Ukrainian (using different strategies, half of which
didn't work :(). Unfortunately in this setup it gave me no performance
boost so I switched to a different scheme: trained forward and backward
embeddings on different GPUs (for a month+)
Would be happy to try it again one day.
…On Thu, Oct 27, 2022 at 6:39 PM Alan Akbik ***@***.***> wrote:
@dchaplinsky <https://github.com/dchaplinsky> I tried today with the most
recent lightning release and DDP strategy, using the code from the example
that comes with the PR:
trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp')
It worked quite well in that it was automatically using 4 GPUs that were
all going at 100%. It seems each GPU gets a different split of the data.
However, when epoch 1 ended, it got stuck without throwing an error, and I
terminated execution after a while. @aniketmaurya
<https://github.com/aniketmaurya> any idea why this might be the case?
—
Reply to this email directly, view it on GitHub
<#2700 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABAA4SJ7B4FYQL2HH3FWEDWFKO4JANCNFSM5SNLLRJQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dchaplinsky form what I can tell, it does a mini-batch on each GPU and then gathers the gradients for a single update. So in the same amount of time, you double (with two GPUs) your mini-batch size. So this seems to be primarily to achieve higher mini-batch sizes by using multiple GPUs. I think there is currently something wrong in the way the splits are distributed across GPUs. I would expect the dataset to be partitioned so that each GPU to get different splits. But currently it looks like each GPU gets the full data set, which is why there is no speedup. |
Yeah, similar feelings. I ended up modifying the code for LM trainer,
implementing a poor man version of grad accum and ramped up batch 5 times.
…On Tue, Nov 1, 2022 at 10:32 AM Alan Akbik ***@***.***> wrote:
@dchaplinsky <https://github.com/dchaplinsky> form what I can tell, it
does a mini-batch on each GPU and then gathers the gradients for a single
update. So in the same amount of time, you double (with two GPUs) your
mini-batch size. So this seems to be primarily to achieve higher mini-batch
sizes by using multiple GPUs.
I think there is currently something wrong in the way the splits are
distributed across GPUs. I would expect the dataset to be partitioned so
that each GPU to get different splits. But currently it looks like each GPU
gets the full data set, which is why there is no speedup.
—
Reply to this email directly, view it on GitHub
<#2700 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABAA4QQ4HDMOHFL3O2UFU3WGDITFANCNFSM5SNLLRJQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Please keep it open and make it happen! |
@aniketmaurya could you update it with 2.0? 🐿️ |
This branch was around ~500 commits behind! I am updating it with refactored the new LightningLite - Fabric |
created a new PR with Fabric (revamped LightningLite). It is much cleaner!! |
Fixes #2697
This PR attempts to show a refactor to integrate LightningLite for scalable model training with multiple hardware support, mixed precision and DDP.