tbptt doesn't work with validation #15057

DavidHoessle · 2022-10-10T15:08:15Z

Bug description

The tbptt_split_batch function doesn't seem to be called before validation_step and hiddens also doesn't seem to be passed to the validation_step method of the lightning module.
Hence validating a model with a validation_step doesn't seems possible even if training can be achived using tbptt for timeseries data that's to big to fit in on the GPU/ where the loss calculation is to big to fit on the GPU.

My current workarround is as follows:

def validation_step(self, batch, batch_idx):
        batch = self.tbptt_split_batch(batch, self.truncated_bptt_steps)
        hiddens = None
        losses = []

        for chung in batch:
            x = chung[0]  # x: [batch, seq, features]
            y_hat, hiddens = self(x, hiddens)  # y_hat: [batch, seq, features]
            loss = F.l1_loss(y_hat, x, reduction='sum')  # autoencoder model -> y_hat should be equal to x
            losses.append(loss)

        loss = sum([l/len(losses) for l in losses])
        self.log("val_loss", loss)
        return {"loss": loss}

However it would be great to have the same behaviour as with the training_step (if truncated_bptt_steps is defined the chuncks are passed to validation_step as well as the hiddens from the last chunk).
I'm also not quite sure if my loss aggregation is the same as the one implemented for the training_step losses of the tbptt chunks (mean).

How to reproduce the bug

1. Use a timeseries dataset where a single batch/sample doesn't fit in Mem or can't perform backward because the sequence length is too big.
2. Define a model with `truncated_bptt_steps` set.
3. Call `trainer.fit(train_dl, val_dl)`
4. See Out of Memory exception because `tbptt_split_batch` isn't called on:
  a. the `sanity_check`
  b. the `validation_step`
5. remove `validation_step` from the model and the model model runs without problems because `tbptt_split_batch` is called in the train_loop.

Error messages and logs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name    | Type | Params
---------------------------------
0 | encoder | RNN  | 3.4 K 
1 | decoder | RNN  | 324   
---------------------------------
3.7 K     Trainable params
0         Non-trainable params
3.7 K     Total params
0.015     Total estimated model params size (MB)
/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1892: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 0:  50%|██████████████████████████████████████████████████████████                                                          | 1/2 [00:47<00:47, 47.35s/it, loss=863, split_idx=33, v_num=, train_step_loss (l1)=356.0Traceback (most recent call last):                                                                                                                                                                     | 0/1 [00:00<?, ?it/s]
  File "/home/username/cwd/scripts/./train.py", line 94, in <module>
    main(args)
  File "/home/username/cwd/scripts/./train.py", line 71, in main
    train(trainer_args=args, model=model, train_dl=train_dataloader, val_dl=validation_dataloader)
  File "/home/username/cwd/scripts/./train.py", line 56, in train
    trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 370, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/username/cwd/scripts/utils/deeplearning/models/autoencoder.py", line 53, in validation_step
    y_hat, hiddens = self(x, hiddens)  # y_hat: [batch, seq, features]
  File "/home/username/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/username/cwd/scripts/utils/deeplearning/models/autoencoder.py", line 27, in forward
    _, enc_hiddens = self.encoder(x, enc_hiddens)  # hidden: [layer, batch, hidden]
  File "/home/username/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/username/venv/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 471, in forward
    result = _VF.rnn_tanh(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: CUDA out of memory. Tried to allocate 58.22 GiB (GPU 0; 22.20 GiB total capacity; 6.67 GiB already allocated; 12.59 GiB free; 8.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0:  50%|█████     | 1/2 [01:18<01:18, 78.38s/it, loss=863, split_idx=33, v_num=, train_step_loss (l1)=356.0]

Environment

- Python: 3.9
- PyTorch Lightning: 1.7.7
- Pytorch: 1.12.1+cu116

Environment: AWS EC2 g5.16xlarge
- OS: Amazon Linux 2
- CUDA: 11.6
- GPU: NVIDIA A10G Tensor Core GPU

More info

No response

The text was updated successfully, but these errors were encountered:

awaelchli · 2022-10-11T08:10:04Z

@DavidHoessle There is no back-propagation/optimization happening in the validation step, hence there is no TBPTT. Your current "workaround" is fine but I'm wondering what it does for you vs. just running the full sequence through the model.

stale · 2022-11-13T16:35:20Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

DavidHoessle added the needs triage Waiting to be triaged by maintainers label Oct 10, 2022

awaelchli added question Further information is requested loops Related to the Loop API lightningmodule pl.LightningModule and removed needs triage Waiting to be triaged by maintainers labels Oct 11, 2022

carmocca mentioned this issue Dec 22, 2022

Remove truncated backpropagation from loops #16172

Merged

11 tasks

carmocca closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tbptt doesn't work with validation #15057

tbptt doesn't work with validation #15057

DavidHoessle commented Oct 10, 2022

awaelchli commented Oct 11, 2022

stale bot commented Nov 13, 2022

tbptt doesn't work with validation #15057

tbptt doesn't work with validation #15057

Comments

DavidHoessle commented Oct 10, 2022

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

awaelchli commented Oct 11, 2022

stale bot commented Nov 13, 2022