LightningLite Integration #2700

aniketmaurya · 2022-04-03T15:45:36Z

This PR attempts to show a refactor to integrate LightningLite for scalable model training with multiple hardware support, mixed precision and DDP.

Ready for Review
Ready to merge
Update Example
Update Tests

aniketmaurya · 2022-04-04T13:46:25Z

hi @alanakbik, could you pls help in enabling CI tests?

alanakbik · 2022-04-04T14:37:34Z

@aniketmaurya done! just a heads-up: the release of flair 0.11 is imminent, so I'll review the PR after the release!

aniketmaurya · 2022-04-04T14:55:55Z

thanks @alanakbik! Sure, please take your time. ~~I have one request, if you find some time could you help me with a text corpus for preparing an example?~~

aniketmaurya · 2022-04-04T20:09:54Z

@aniketmaurya done! just a heads-up: the release of flair 0.11 is imminent, so I'll review the PR after the release!

@alanakbik btw we can jointly release this with flair 0.11. I can help with any test required for LightningLite Trainer integration and then we can potentially collaborate on a release article for cross marketing.

This is just an idea. Let me know what you think? 🙂

Borda · 2022-04-05T01:19:57Z

This is awesome! in particular I like that the code is easier to read and reduce some conditional imports :)

whoisjones

hi @aniketmaurya, awesome, looking forward to integrate this. Have you tested your development using multiple GPUs? See my comment in the review.

whoisjones · 2022-04-06T08:09:38Z

flair/trainers/language_model_trainer.py

@@ -285,12 +296,12 @@ def train(
                    self.model.train()

                    # reset variables
-                    hidden = self.model.init_hidden(mini_batch_size)
+                    hidden = self.model.module.init_hidden(mini_batch_size)


Have you tried running it on single machine - multiple GPUs? I get an error, obviously LightningLite wraps DistributedDataParallel which wraps the LanguageModel class - so this line should be self.model.module.module.init_hidden. However this looks not correct to me. When using just CPU, this configurations works. Can you check?

thanks for checking and reporting this @whoisjones, let me test this on multiple GPUs and I will fix it.

whoisjones · 2022-04-06T08:09:58Z

flair/trainers/language_model_trainer.py

@@ -418,14 +425,15 @@ def evaluate(self, data_source, eval_batch_size, sequence_length):
            total_loss = 0
            ntokens = len(self.corpus.dictionary)

-            hidden = self.model.init_hidden(eval_batch_size)
+            hidden = self.model.module.init_hidden(eval_batch_size)


Same issue as mentioned before.

whoisjones · 2022-04-06T08:10:39Z

flair/trainers/language_model_trainer.py

@@ -418,14 +425,15 @@ def evaluate(self, data_source, eval_batch_size, sequence_length):
            total_loss = 0
            ntokens = len(self.corpus.dictionary)

-            hidden = self.model.init_hidden(eval_batch_size)
+            hidden = self.model.module.init_hidden(eval_batch_size)

            for i in range(0, data_source.size(0) - 1, sequence_length):
                data, targets = self._get_batch(data_source, i, sequence_length)
                prediction, rnn_output, hidden = self.model.forward(data, hidden)
                output_flat = prediction.view(-1, ntokens)
                total_loss += len(data) * self.loss_function(output_flat, targets).data


If using multiple GPUs, we need to align tensors to be on the same device, I received an issue here.

flair.device = self.device will move the tensor to appropriate device with multiple devices. I have added the fix. :)

Borda · 2022-04-06T08:52:24Z

Have you tested your development using multiple GPUs? See my comment in the review.

I think that the next step would be to integrate with EcoCI, which will run nightly also on GPUs
Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI

aniketmaurya · 2022-04-07T06:10:42Z

Hi @whoisjones, thanks again for reviewing the PR :)

I have fixed the GPU and CPU device mismatch.
For training with DDP strategy we could do conditional check but that won't generalize to other strategies.So we already have started a PR fix for this here 😃

Borda

lgtm

whoisjones · 2022-05-06T10:27:41Z

Hey @aniketmaurya have you already heard some from pytorch-lightning community?

aniketmaurya · 2022-05-06T11:57:07Z

Hey @aniketmaurya have you already heard some from pytorch-lightning community?

Hi @whoisjones, sorry didn't get you :)

alanakbik · 2022-05-06T12:47:33Z

Hi @aniketmaurya I think the question is about this PR: Lightning-AI/pytorch-lightning#12597 - if I understand correctly, it needs to be merged first so that LightningLite runs for Flair with DDP strategy.

aniketmaurya · 2022-05-08T11:10:54Z

Hi @alanakbik, sorry for the late response! Right, we need to merge the PL side PR first but can do a quick fix until then -

The model (nn.Module) is wrapped into the DistributedDataParallel class so we are not able to access the init_hidden method directly, but we can use a conditional statement to get there until the PR is merged.
This method will work with single device training and DDP but won't generalize with other strategies available in PyTorch Lightning like Deepspeed and ddp_spwan.

DistributedDataParallel(
  (module): LanguageModel(
    (drop): Dropout(p=0.1, inplace=False)
    (encoder): Embedding(275, 100)
    (rnn): LSTM(100, 128)
    (decoder): Linear(in_features=128, out_features=275, bias=True)
  )
)

Use isinstance(DistributedDataParallel, model) to check DDP and access the init_hidden method.

alanakbik · 2022-05-09T04:18:16Z

Thanks for the quick fix @aniketmaurya! Its probably best though to wait until the PR is merged as otherwise it might be confusing for users that only a subset of PyTorch Lightning capabilities is available.

machinelearnear · 2022-05-18T09:59:30Z

Thanks everyone for the great work on this, I've been trying to get Flair to work on single instance multi-gpu as well, so very interested in this PR. I see that the Lighting PR was finally merged 7 days ago.

aniketmaurya · 2022-05-18T10:59:28Z

Hi @machinelearnear, thank you for your support!
@alanakbik, Lightning has also deprecated Python 3.6. We will have to get the latest updated version with bug fixes by upgrading from Python 3.6.

I already have this PR, which removes Python 3.6 for Flair.

alanakbik · 2022-05-19T09:32:41Z

@aniketmaurya that is great, thanks for the update! We will do another minor release first and reserve the update to Python 3.7 for the next bigger release - I'll keep you posted!

machinelearnear · 2022-05-24T17:07:04Z

@aniketmaurya any plans to also integrate trainer.py next? I did some tests here: https://github.com/machinelearnear/use-lightninglite-sagemaker-and-flair/blob/main/example-flair/notebook_flair.ipynb but still having problems porting some functions to PL (batches are split differently across GPUs for some reason). Thanks for all the nice work!

dchaplinsky · 2022-09-17T14:36:23Z

Any updates on this one? I have two GPUs now and eager to try it with Flair!

Borda · 2022-09-17T19:46:19Z

cc: @awaelchli 🐰

alanakbik · 2022-10-27T15:39:38Z

@dchaplinsky I tried today with the most recent lightning release and DDP strategy, using the code from the example that comes with the PR:

trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp')

It worked quite well in that it was automatically using 4 GPUs that were all going at 100%. It seems each GPU gets a different split of the data.

However, when epoch 1 ended, it got stuck without throwing an error, and I terminated execution after a while. @aniketmaurya any idea why this might be the case?

aniketmaurya · 2022-10-31T05:34:32Z

hi @alanakbik, Is it possible to get the data and training details to reproduce this error? Also could you please provide the error trace?

alanakbik · 2022-10-31T08:10:32Z

Hello @aniketmaurya sure, I'll build a minimal example for you.

alanakbik · 2022-10-31T08:42:50Z

Here is a minimal training data example for my training script. Unpack this where you like and point the script to the root folder:
penn_lm.zip. Note that we usually split the training data into a folder with many smaller splits. The data loader can then load the splits asynchonously. Patience is computed against the splits which allows for annealing mid-epoch.

I used the following training script on a machine with 3 Nividia 3090 GPUs and Python 3.8:

import flair

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

flair.device = 'cuda:0'

# are you training a forward or backward LM?
is_forward_lm = True

# get your corpus, process forward and at the character level
dictionary = Dictionary.load('chars')

# get corpus
corpus = TextCorpus("resources/tasks/penn_lm/",
                    dictionary,
                    is_forward_lm,
                    character_level=True,
                    random_case_flip=True,
                    )

language_model = LanguageModel(dictionary,
                               is_forward_lm=True,
                               hidden_size=128,
                               nlayers=1,
                               )
print(language_model)

# train your language model
trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp')

# train your language model
trainer.train(language_model,
              corpus,
              "resources/LMs/penn-gpu-auto-ddp",
              sequence_length=50,
              mini_batch_size=10,
              learning_rate=10,
              patience=10,
              max_epochs=20,
              checkpoint=True,
              num_workers=4,
              )

print(language_model.generate_text())

It automatically takes all three GPUs and runs them at 100% (great), but at the end of the first epoch nothing more happens. Here is the last output I get:

2022-10-31 09:34:56,368 read text file with 2000 lines
2022-10-31 09:34:56,369 shuffled
2022-10-31 09:34:56,372 Sequence length is 50
2022-10-31 09:34:56,373 Split 1  - (09:34:56)
2022-10-31 09:34:56,379 read text file with 2000 lines
2022-10-31 09:34:56,380 shuffled
2022-10-31 09:34:56,382 read text file with 2000 lines
2022-10-31 09:34:56,383 shuffled
2022-10-31 09:34:56,384 read text file with 2000 lines
2022-10-31 09:34:56,385 shuffled
2022-10-31 09:34:56,386 read text file with 2000 lines
2022-10-31 09:34:56,386 Sequence length is 50
2022-10-31 09:34:56,386 shuffled
2022-10-31 09:34:56,387 Split 1  - (09:34:56)
2022-10-31 09:34:56,395 read text file with 2000 lines
2022-10-31 09:34:56,396 read text file with 2000 lines
2022-10-31 09:34:56,396 shuffled
2022-10-31 09:34:56,396 shuffled
2022-10-31 09:34:56,399 read text file with 2000 lines
2022-10-31 09:34:56,399 shuffled
2022-10-31 09:34:56,419 read text file with 42068 lines
2022-10-31 09:34:56,436 shuffled
2022-10-31 09:34:59,746 | split   1/ 23 |   100/  483 batches | ms/batch 33.72 | loss 3.0842 | ppl 21.8503
2022-10-31 09:34:59,746 | split   1/ 23 |   100/  478 batches | ms/batch 33.59 | loss 3.0945 | ppl 22.0756
2022-10-31 09:34:59,746 | split   1/ 23 |   100/  480 batches | ms/batch 34.54 | loss 3.1399 | ppl 23.1021
2022-10-31 09:34:59,927 | split   1/ 23 |   200/  483 batches | ms/batch  1.80 | loss 2.2430 | ppl 9.4220
2022-10-31 09:34:59,927 | split   1/ 23 |   200/  478 batches | ms/batch  1.80 | loss 2.2366 | ppl 9.3619
2022-10-31 09:34:59,927 | split   1/ 23 |   200/  480 batches | ms/batch  1.80 | loss 2.2417 | ppl 9.4089
2022-10-31 09:35:00,102 | split   1/ 23 |   300/  483 batches | ms/batch  1.75 | loss 2.0134 | ppl 7.4888
2022-10-31 09:35:00,102 | split   1/ 23 |   300/  478 batches | ms/batch  1.75 | loss 2.0211 | ppl 7.5464
2022-10-31 09:35:00,102 | split   1/ 23 |   300/  480 batches | ms/batch  1.75 | loss 2.0382 | ppl 7.6765
2022-10-31 09:35:00,278 | split   1/ 23 |   400/  480 batches | ms/batch  1.76 | loss 1.8816 | ppl 6.5641
2022-10-31 09:35:00,278 | split   1/ 23 |   400/  478 batches | ms/batch  1.76 | loss 1.8837 | ppl 6.5780
2022-10-31 09:35:00,278 | split   1/ 23 |   400/  483 batches | ms/batch  1.77 | loss 1.8589 | ppl 6.4168

On a single GPU on my local machine it works fine.

alanakbik · 2022-10-31T08:59:37Z

Quick update: it trains successfully if the train folder only contains a single split, so maybe the data loader messes up the execution.

dchaplinsky · 2022-10-31T17:37:15Z

I've tried this branch on my 2 GPU (2060 and 1070) setup while training flair embeddings for Ukrainian (using different strategies, half of which didn't work :(). Unfortunately in this setup it gave me no performance boost so I switched to a different scheme: trained forward and backward embeddings on different GPUs (for a month+) Would be happy to try it again one day.

…

On Thu, Oct 27, 2022 at 6:39 PM Alan Akbik ***@***.***> wrote: @dchaplinsky <https://github.com/dchaplinsky> I tried today with the most recent lightning release and DDP strategy, using the code from the example that comes with the PR: trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp') It worked quite well in that it was automatically using 4 GPUs that were all going at 100%. It seems each GPU gets a different split of the data. However, when epoch 1 ended, it got stuck without throwing an error, and I terminated execution after a while. @aniketmaurya <https://github.com/aniketmaurya> any idea why this might be the case? — Reply to this email directly, view it on GitHub <#2700 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABAA4SJ7B4FYQL2HH3FWEDWFKO4JANCNFSM5SNLLRJQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

alanakbik · 2022-11-01T08:32:39Z

@dchaplinsky form what I can tell, it does a mini-batch on each GPU and then gathers the gradients for a single update. So in the same amount of time, you double (with two GPUs) your mini-batch size. So this seems to be primarily to achieve higher mini-batch sizes by using multiple GPUs.

I think there is currently something wrong in the way the splits are distributed across GPUs. I would expect the dataset to be partitioned so that each GPU to get different splits. But currently it looks like each GPU gets the full data set, which is why there is no speedup.

dchaplinsky · 2022-11-01T08:58:35Z

Yeah, similar feelings. I ended up modifying the code for LM trainer, implementing a poor man version of grad accum and ramped up batch 5 times.

…

On Tue, Nov 1, 2022 at 10:32 AM Alan Akbik ***@***.***> wrote: @dchaplinsky <https://github.com/dchaplinsky> form what I can tell, it does a mini-batch on each GPU and then gathers the gradients for a single update. So in the same amount of time, you double (with two GPUs) your mini-batch size. So this seems to be primarily to achieve higher mini-batch sizes by using multiple GPUs. I think there is currently something wrong in the way the splits are distributed across GPUs. I would expect the dataset to be partitioned so that each GPU to get different splits. But currently it looks like each GPU gets the full data set, which is why there is no speedup. — Reply to this email directly, view it on GitHub <#2700 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABAA4QQ4HDMOHFL3O2UFU3WGDITFANCNFSM5SNLLRJQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stale · 2023-03-18T07:37:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dchaplinsky · 2023-03-18T09:19:56Z

Please keep it open and make it happen!

Borda · 2023-03-18T09:37:37Z

@aniketmaurya could you update it with 2.0? 🐿️

aniketmaurya · 2023-03-21T15:42:35Z

@aniketmaurya could you update it with 2.0? 🐿️

This branch was around ~500 commits behind! I am updating it with refactored the new LightningLite - Fabric

aniketmaurya · 2023-03-22T16:30:52Z

created a new PR with Fabric (revamped LightningLite). It is much cleaner!!

aniketmaurya marked this pull request as ready for review April 4, 2022 13:46

aniketmaurya changed the title ~~[WIP] LightningLite Integration~~ LightningLite Integration Apr 4, 2022

whoisjones reviewed Apr 6, 2022

View reviewed changes

Borda approved these changes Apr 7, 2022

View reviewed changes

alanakbik mentioned this pull request May 10, 2022

Is Multiple-GPU support available now? #2467

Closed

stale bot added the wontfix This will not be worked on label Mar 18, 2023

stale bot removed the wontfix This will not be worked on label Mar 18, 2023

aniketmaurya closed this Mar 21, 2023

aniketmaurya force-pushed the lite_poc1 branch from 8623575 to 857337d Compare March 21, 2023 15:41

LightningLite Integration #2700

LightningLite Integration #2700

Conversation

aniketmaurya commented Apr 3, 2022 • edited Loading

aniketmaurya commented Apr 4, 2022

alanakbik commented Apr 4, 2022

aniketmaurya commented Apr 4, 2022 • edited Loading

aniketmaurya commented Apr 4, 2022

Borda commented Apr 5, 2022

whoisjones left a comment

Choose a reason for hiding this comment

whoisjones Apr 6, 2022

Choose a reason for hiding this comment

aniketmaurya Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

whoisjones Apr 6, 2022

Choose a reason for hiding this comment

whoisjones Apr 6, 2022

Choose a reason for hiding this comment

aniketmaurya Apr 6, 2022

Choose a reason for hiding this comment

Borda commented Apr 6, 2022

aniketmaurya commented Apr 7, 2022

Borda left a comment

Choose a reason for hiding this comment

whoisjones commented May 6, 2022

aniketmaurya commented May 6, 2022

alanakbik commented May 6, 2022

aniketmaurya commented May 8, 2022 • edited Loading

alanakbik commented May 9, 2022

machinelearnear commented May 18, 2022

aniketmaurya commented May 18, 2022

alanakbik commented May 19, 2022

machinelearnear commented May 24, 2022

dchaplinsky commented Sep 17, 2022

Borda commented Sep 17, 2022

alanakbik commented Oct 27, 2022

aniketmaurya commented Oct 31, 2022

alanakbik commented Oct 31, 2022

alanakbik commented Oct 31, 2022

alanakbik commented Oct 31, 2022

dchaplinsky commented Oct 31, 2022 via email

alanakbik commented Nov 1, 2022

dchaplinsky commented Nov 1, 2022 via email

stale bot commented Mar 18, 2023

dchaplinsky commented Mar 18, 2023

Borda commented Mar 18, 2023

aniketmaurya commented Mar 21, 2023 • edited Loading

aniketmaurya commented Mar 22, 2023

aniketmaurya commented Apr 3, 2022 •

edited

Loading

aniketmaurya commented Apr 4, 2022 •

edited

Loading

aniketmaurya Apr 6, 2022 •

edited

Loading

aniketmaurya commented May 8, 2022 •

edited

Loading

aniketmaurya commented Mar 21, 2023 •

edited

Loading