Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tbptt doesn't work with validation #15057

Closed
DavidHoessle opened this issue Oct 10, 2022 · 2 comments
Closed

tbptt doesn't work with validation #15057

DavidHoessle opened this issue Oct 10, 2022 · 2 comments
Labels
lightningmodule pl.LightningModule loops Related to the Loop API question Further information is requested

Comments

@DavidHoessle
Copy link

Bug description

The tbptt_split_batch function doesn't seem to be called before validation_step and hiddens also doesn't seem to be passed to the validation_step method of the lightning module.
Hence validating a model with a validation_step doesn't seems possible even if training can be achived using tbptt for timeseries data that's to big to fit in on the GPU/ where the loss calculation is to big to fit on the GPU.

My current workarround is as follows:

def validation_step(self, batch, batch_idx):
        batch = self.tbptt_split_batch(batch, self.truncated_bptt_steps)
        hiddens = None
        losses = []

        for chung in batch:
            x = chung[0]  # x: [batch, seq, features]
            y_hat, hiddens = self(x, hiddens)  # y_hat: [batch, seq, features]
            loss = F.l1_loss(y_hat, x, reduction='sum')  # autoencoder model -> y_hat should be equal to x
            losses.append(loss)

        loss = sum([l/len(losses) for l in losses])
        self.log("val_loss", loss)
        return {"loss": loss}

However it would be great to have the same behaviour as with the training_step (if truncated_bptt_steps is defined the chuncks are passed to validation_step as well as the hiddens from the last chunk).
I'm also not quite sure if my loss aggregation is the same as the one implemented for the training_step losses of the tbptt chunks (mean).

How to reproduce the bug

1. Use a timeseries dataset where a single batch/sample doesn't fit in Mem or can't perform backward because the sequence length is too big.
2. Define a model with `truncated_bptt_steps` set.
3. Call `trainer.fit(train_dl, val_dl)`
4. See Out of Memory exception because `tbptt_split_batch` isn't called on:
  a. the `sanity_check`
  b. the `validation_step`
5. remove `validation_step` from the model and the model model runs without problems because `tbptt_split_batch` is called in the train_loop.

Error messages and logs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name    | Type | Params
---------------------------------
0 | encoder | RNN  | 3.4 K 
1 | decoder | RNN  | 324   
---------------------------------
3.7 K     Trainable params
0         Non-trainable params
3.7 K     Total params
0.015     Total estimated model params size (MB)
/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1892: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 0:  50%|██████████████████████████████████████████████████████████                                                          | 1/2 [00:47<00:47, 47.35s/it, loss=863, split_idx=33, v_num=, train_step_loss (l1)=356.0Traceback (most recent call last):                                                                                                                                                                     | 0/1 [00:00<?, ?it/s]
  File "/home/username/cwd/scripts/./train.py", line 94, in <module>
    main(args)
  File "/home/username/cwd/scripts/./train.py", line 71, in main
    train(trainer_args=args, model=model, train_dl=train_dataloader, val_dl=validation_dataloader)
  File "/home/username/cwd/scripts/./train.py", line 56, in train
    trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 370, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/username/cwd/scripts/utils/deeplearning/models/autoencoder.py", line 53, in validation_step
    y_hat, hiddens = self(x, hiddens)  # y_hat: [batch, seq, features]
  File "/home/username/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/username/cwd/scripts/utils/deeplearning/models/autoencoder.py", line 27, in forward
    _, enc_hiddens = self.encoder(x, enc_hiddens)  # hidden: [layer, batch, hidden]
  File "/home/username/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 471, in forward
    result = _VF.rnn_tanh(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: CUDA out of memory. Tried to allocate 58.22 GiB (GPU 0; 22.20 GiB total capacity; 6.67 GiB already allocated; 12.59 GiB free; 8.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0:  50%|█████     | 1/2 [01:18<01:18, 78.38s/it, loss=863, split_idx=33, v_num=, train_step_loss (l1)=356.0]

Environment

- Python: 3.9
- PyTorch Lightning: 1.7.7
- Pytorch: 1.12.1+cu116

Environment: AWS EC2 g5.16xlarge
- OS: Amazon Linux 2
- CUDA: 11.6
- GPU: NVIDIA A10G Tensor Core GPU

More info

No response

@DavidHoessle DavidHoessle added the needs triage Waiting to be triaged by maintainers label Oct 10, 2022
@awaelchli
Copy link
Contributor

@DavidHoessle There is no back-propagation/optimization happening in the validation step, hence there is no TBPTT. Your current "workaround" is fine but I'm wondering what it does for you vs. just running the full sequence through the model.

@awaelchli awaelchli added question Further information is requested loops Related to the Loop API lightningmodule pl.LightningModule and removed needs triage Waiting to be triaged by maintainers labels Oct 11, 2022
@stale
Copy link

stale bot commented Nov 13, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lightningmodule pl.LightningModule loops Related to the Loop API question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants