Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightningLite Integration #2700

Closed
wants to merge 0 commits into from
Closed

Conversation

aniketmaurya
Copy link

@aniketmaurya aniketmaurya commented Apr 3, 2022

Fixes #2697

This PR attempts to show a refactor to integrate LightningLite for scalable model training with multiple hardware support, mixed precision and DDP.


  • Ready for Review
  • Ready to merge
  • Update Example
  • Update Tests

@aniketmaurya aniketmaurya marked this pull request as ready for review April 4, 2022 13:46
@aniketmaurya
Copy link
Author

hi @alanakbik, could you pls help in enabling CI tests?

@alanakbik
Copy link
Collaborator

@aniketmaurya done! just a heads-up: the release of flair 0.11 is imminent, so I'll review the PR after the release!

@aniketmaurya
Copy link
Author

aniketmaurya commented Apr 4, 2022

thanks @alanakbik! Sure, please take your time. I have one request, if you find some time could you help me with a text corpus for preparing an example?

@aniketmaurya aniketmaurya changed the title [WIP] LightningLite Integration LightningLite Integration Apr 4, 2022
@aniketmaurya
Copy link
Author

@aniketmaurya done! just a heads-up: the release of flair 0.11 is imminent, so I'll review the PR after the release!

@alanakbik btw we can jointly release this with flair 0.11. I can help with any test required for LightningLite Trainer integration and then we can potentially collaborate on a release article for cross marketing.

This is just an idea. Let me know what you think? 🙂

@Borda
Copy link

Borda commented Apr 5, 2022

This is awesome! in particular I like that the code is easier to read and reduce some conditional imports :)

Copy link
Member

@whoisjones whoisjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @aniketmaurya, awesome, looking forward to integrate this. Have you tested your development using multiple GPUs? See my comment in the review.

@@ -285,12 +296,12 @@ def train(
self.model.train()

# reset variables
hidden = self.model.init_hidden(mini_batch_size)
hidden = self.model.module.init_hidden(mini_batch_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running it on single machine - multiple GPUs? I get an error, obviously LightningLite wraps DistributedDataParallel which wraps the LanguageModel class - so this line should be self.model.module.module.init_hidden. However this looks not correct to me. When using just CPU, this configurations works. Can you check?

Copy link
Author

@aniketmaurya aniketmaurya Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking and reporting this @whoisjones, let me test this on multiple GPUs and I will fix it.

@@ -418,14 +425,15 @@ def evaluate(self, data_source, eval_batch_size, sequence_length):
total_loss = 0
ntokens = len(self.corpus.dictionary)

hidden = self.model.init_hidden(eval_batch_size)
hidden = self.model.module.init_hidden(eval_batch_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as mentioned before.

@@ -418,14 +425,15 @@ def evaluate(self, data_source, eval_batch_size, sequence_length):
total_loss = 0
ntokens = len(self.corpus.dictionary)

hidden = self.model.init_hidden(eval_batch_size)
hidden = self.model.module.init_hidden(eval_batch_size)

for i in range(0, data_source.size(0) - 1, sequence_length):
data, targets = self._get_batch(data_source, i, sequence_length)
prediction, rnn_output, hidden = self.model.forward(data, hidden)
output_flat = prediction.view(-1, ntokens)
total_loss += len(data) * self.loss_function(output_flat, targets).data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using multiple GPUs, we need to align tensors to be on the same device, I received an issue here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flair.device = self.device will move the tensor to appropriate device with multiple devices. I have added the fix. :)

@Borda
Copy link

Borda commented Apr 6, 2022

Have you tested your development using multiple GPUs? See my comment in the review.

I think that the next step would be to integrate with EcoCI, which will run nightly also on GPUs
Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI

@aniketmaurya
Copy link
Author

Hi @whoisjones, thanks again for reviewing the PR :)

I have fixed the GPU and CPU device mismatch.
For training with DDP strategy we could do conditional check but that won't generalize to other strategies.So we already have started a PR fix for this here 😃

Copy link

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@whoisjones
Copy link
Member

Hey @aniketmaurya have you already heard some from pytorch-lightning community?

@aniketmaurya
Copy link
Author

Hey @aniketmaurya have you already heard some from pytorch-lightning community?

Hi @whoisjones, sorry didn't get you :)

@alanakbik
Copy link
Collaborator

Hi @aniketmaurya I think the question is about this PR: Lightning-AI/pytorch-lightning#12597 - if I understand correctly, it needs to be merged first so that LightningLite runs for Flair with DDP strategy.

@aniketmaurya
Copy link
Author

aniketmaurya commented May 8, 2022

Hi @alanakbik, sorry for the late response! Right, we need to merge the PL side PR first but can do a quick fix until then -

The model (nn.Module) is wrapped into the DistributedDataParallel class so we are not able to access the init_hidden method directly, but we can use a conditional statement to get there until the PR is merged.
This method will work with single device training and DDP but won't generalize with other strategies available in PyTorch Lightning like Deepspeed and ddp_spwan.

DistributedDataParallel(
  (module): LanguageModel(
    (drop): Dropout(p=0.1, inplace=False)
    (encoder): Embedding(275, 100)
    (rnn): LSTM(100, 128)
    (decoder): Linear(in_features=128, out_features=275, bias=True)
  )
)

Use isinstance(DistributedDataParallel, model) to check DDP and access the init_hidden method.

@alanakbik
Copy link
Collaborator

Thanks for the quick fix @aniketmaurya! Its probably best though to wait until the PR is merged as otherwise it might be confusing for users that only a subset of PyTorch Lightning capabilities is available.

@machinelearnear
Copy link

Thanks everyone for the great work on this, I've been trying to get Flair to work on single instance multi-gpu as well, so very interested in this PR. I see that the Lighting PR was finally merged 7 days ago.

@aniketmaurya
Copy link
Author

Hi @machinelearnear, thank you for your support!
@alanakbik, Lightning has also deprecated Python 3.6. We will have to get the latest updated version with bug fixes by upgrading from Python 3.6.

I already have this PR, which removes Python 3.6 for Flair.

@alanakbik
Copy link
Collaborator

@aniketmaurya that is great, thanks for the update! We will do another minor release first and reserve the update to Python 3.7 for the next bigger release - I'll keep you posted!

@machinelearnear
Copy link

@aniketmaurya any plans to also integrate trainer.py next? I did some tests here: https://github.com/machinelearnear/use-lightninglite-sagemaker-and-flair/blob/main/example-flair/notebook_flair.ipynb but still having problems porting some functions to PL (batches are split differently across GPUs for some reason). Thanks for all the nice work!

@dchaplinsky
Copy link

Any updates on this one? I have two GPUs now and eager to try it with Flair!

@Borda
Copy link

Borda commented Sep 17, 2022

cc: @awaelchli 🐰

@alanakbik
Copy link
Collaborator

@dchaplinsky I tried today with the most recent lightning release and DDP strategy, using the code from the example that comes with the PR:

trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp')

It worked quite well in that it was automatically using 4 GPUs that were all going at 100%. It seems each GPU gets a different split of the data.

However, when epoch 1 ended, it got stuck without throwing an error, and I terminated execution after a while. @aniketmaurya any idea why this might be the case?

@aniketmaurya
Copy link
Author

hi @alanakbik, Is it possible to get the data and training details to reproduce this error? Also could you please provide the error trace?

@alanakbik
Copy link
Collaborator

Hello @aniketmaurya sure, I'll build a minimal example for you.

@alanakbik
Copy link
Collaborator

Here is a minimal training data example for my training script. Unpack this where you like and point the script to the root folder:
penn_lm.zip. Note that we usually split the training data into a folder with many smaller splits. The data loader can then load the splits asynchonously. Patience is computed against the splits which allows for annealing mid-epoch.

I used the following training script on a machine with 3 Nividia 3090 GPUs and Python 3.8:

import flair

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

flair.device = 'cuda:0'

# are you training a forward or backward LM?
is_forward_lm = True

# get your corpus, process forward and at the character level
dictionary = Dictionary.load('chars')

# get corpus
corpus = TextCorpus("resources/tasks/penn_lm/",
                    dictionary,
                    is_forward_lm,
                    character_level=True,
                    random_case_flip=True,
                    )

language_model = LanguageModel(dictionary,
                               is_forward_lm=True,
                               hidden_size=128,
                               nlayers=1,
                               )
print(language_model)

# train your language model
trainer = LanguageModelTrainer(accelerator="gpu", devices="auto", strategy='ddp')

# train your language model
trainer.train(language_model,
              corpus,
              "resources/LMs/penn-gpu-auto-ddp",
              sequence_length=50,
              mini_batch_size=10,
              learning_rate=10,
              patience=10,
              max_epochs=20,
              checkpoint=True,
              num_workers=4,
              )

print(language_model.generate_text())

It automatically takes all three GPUs and runs them at 100% (great), but at the end of the first epoch nothing more happens. Here is the last output I get:

2022-10-31 09:34:56,368 read text file with 2000 lines
2022-10-31 09:34:56,369 shuffled
2022-10-31 09:34:56,372 Sequence length is 50
2022-10-31 09:34:56,373 Split 1  - (09:34:56)
2022-10-31 09:34:56,379 read text file with 2000 lines
2022-10-31 09:34:56,380 shuffled
2022-10-31 09:34:56,382 read text file with 2000 lines
2022-10-31 09:34:56,383 shuffled
2022-10-31 09:34:56,384 read text file with 2000 lines
2022-10-31 09:34:56,385 shuffled
2022-10-31 09:34:56,386 read text file with 2000 lines
2022-10-31 09:34:56,386 Sequence length is 50
2022-10-31 09:34:56,386 shuffled
2022-10-31 09:34:56,387 Split 1  - (09:34:56)
2022-10-31 09:34:56,395 read text file with 2000 lines
2022-10-31 09:34:56,396 read text file with 2000 lines
2022-10-31 09:34:56,396 shuffled
2022-10-31 09:34:56,396 shuffled
2022-10-31 09:34:56,399 read text file with 2000 lines
2022-10-31 09:34:56,399 shuffled
2022-10-31 09:34:56,419 read text file with 42068 lines
2022-10-31 09:34:56,436 shuffled
2022-10-31 09:34:59,746 | split   1/ 23 |   100/  483 batches | ms/batch 33.72 | loss 3.0842 | ppl 21.8503
2022-10-31 09:34:59,746 | split   1/ 23 |   100/  478 batches | ms/batch 33.59 | loss 3.0945 | ppl 22.0756
2022-10-31 09:34:59,746 | split   1/ 23 |   100/  480 batches | ms/batch 34.54 | loss 3.1399 | ppl 23.1021
2022-10-31 09:34:59,927 | split   1/ 23 |   200/  483 batches | ms/batch  1.80 | loss 2.2430 | ppl 9.4220
2022-10-31 09:34:59,927 | split   1/ 23 |   200/  478 batches | ms/batch  1.80 | loss 2.2366 | ppl 9.3619
2022-10-31 09:34:59,927 | split   1/ 23 |   200/  480 batches | ms/batch  1.80 | loss 2.2417 | ppl 9.4089
2022-10-31 09:35:00,102 | split   1/ 23 |   300/  483 batches | ms/batch  1.75 | loss 2.0134 | ppl 7.4888
2022-10-31 09:35:00,102 | split   1/ 23 |   300/  478 batches | ms/batch  1.75 | loss 2.0211 | ppl 7.5464
2022-10-31 09:35:00,102 | split   1/ 23 |   300/  480 batches | ms/batch  1.75 | loss 2.0382 | ppl 7.6765
2022-10-31 09:35:00,278 | split   1/ 23 |   400/  480 batches | ms/batch  1.76 | loss 1.8816 | ppl 6.5641
2022-10-31 09:35:00,278 | split   1/ 23 |   400/  478 batches | ms/batch  1.76 | loss 1.8837 | ppl 6.5780
2022-10-31 09:35:00,278 | split   1/ 23 |   400/  483 batches | ms/batch  1.77 | loss 1.8589 | ppl 6.4168

On a single GPU on my local machine it works fine.

@alanakbik
Copy link
Collaborator

Quick update: it trains successfully if the train folder only contains a single split, so maybe the data loader messes up the execution.

@dchaplinsky
Copy link

dchaplinsky commented Oct 31, 2022 via email

@alanakbik
Copy link
Collaborator

@dchaplinsky form what I can tell, it does a mini-batch on each GPU and then gathers the gradients for a single update. So in the same amount of time, you double (with two GPUs) your mini-batch size. So this seems to be primarily to achieve higher mini-batch sizes by using multiple GPUs.

I think there is currently something wrong in the way the splits are distributed across GPUs. I would expect the dataset to be partitioned so that each GPU to get different splits. But currently it looks like each GPU gets the full data set, which is why there is no speedup.

@dchaplinsky
Copy link

dchaplinsky commented Nov 1, 2022 via email

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Mar 18, 2023
@dchaplinsky
Copy link

Please keep it open and make it happen!

@stale stale bot removed the wontfix This will not be worked on label Mar 18, 2023
@Borda
Copy link

Borda commented Mar 18, 2023

@aniketmaurya could you update it with 2.0? 🐿️

@aniketmaurya
Copy link
Author

aniketmaurya commented Mar 21, 2023

@aniketmaurya could you update it with 2.0? 🐿️

This branch was around ~500 commits behind! I am updating it with refactored the new LightningLite - Fabric

@aniketmaurya
Copy link
Author

created a new PR with Fabric (revamped LightningLite). It is much cleaner!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale Model Training with LightningLite
6 participants