[RFC] Deprecate direct support for truncated backprop through time #8732

ananthsub · 2021-08-05T03:56:19Z

🚀 Feature

Deprecate direct framework support for truncated backprop through time

Motivation

We are auditing the Lightning components and APIs to assess opportunities for improvements:

Lightning today offers specific support for truncated backpropagation through time. This manifests in:

The LightningModule having a property for this: https://github.com/PyTorchLightning/pytorch-lightning/blob/963c26764682fa4cf64c93c5a7572ae0040e9c32/pytorch_lightning/core/lightning.py#L256-L267
which results in extra arguments passed to the training step: https://github.com/PyTorchLightning/pytorch-lightning/blob/963c26764682fa4cf64c93c5a7572ae0040e9c32/pytorch_lightning/core/lightning.py#L672-L674
along with helpers for how to split batches: https://github.com/PyTorchLightning/pytorch-lightning/blob/963c26764682fa4cf64c93c5a7572ae0040e9c32/pytorch_lightning/core/lightning.py#L1651-L1712
all of which receives special handling in the training loop: https://github.com/PyTorchLightning/pytorch-lightning/blob/963c26764682fa4cf64c93c5a7572ae0040e9c32/pytorch_lightning/loops/batch/training_batch_loop.py#L448-L459

However, this is highly specific for a particular use case. Moreover, with manual optimization, users have this flexibility to more finely control the batch splitting, loss, and optimizer step behavior. Put another way, users can already train with tbptt techniques without using this property at all. This points out that we no longer need this direct support on the core interface.

@PyTorchLightning/core-contributors

Pitch

Benefits:

Philosophically, we recommend users to use a more generic interface through which they can implement this logic. Importantly, this sets a precedent the framework is not and should not be on the critical path for every type of training technique. There are an endless supply, and the framework can easily buckle trying to directly support all of these. This is also vital because training techniques come and go, and the framework can quickly accrue huge debt trying to support features without clear value to users. For example, BPTT was critical with LSTMs, but if more users are shifting to transformers, this tool becomes less essential for users.
Tactically, we can simplify the framework & interfaces offered by deprecating these. this means less maintenance and risk of bugs

Approach:

Mark these hooks as deprecated in v1.5, and remove support in v1.7
In tutorials, we can demonstrate how to train these models with manual optimization to demonstrate that this is not a strict loss of functionality
We can still offer batch-splitting as a utility (outside of the LightningModule) if users find this helpful

Alternatives

Keep as is

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

cc @Borda @tchaton @justusschock @awaelchli @rohitgr7 @akihironitta

carmocca · 2021-08-05T14:40:07Z

In tutorials, we can demonstrate how to train these models with manual optimization to demonstrate that this is not a strict loss of functionality

If the alternative is manual optimization, they would lose the functionality of fault-tolerance at the split level (if and when we have it). This is because manual optimization users would need to handle the progress tracking updates themselves.

If this was instead extracted into a separate optional loop then that would work.

ananthsub · 2021-08-05T19:22:47Z

from @awaelchli

Regardless, I believe it would be interesting to see a lightning module rnn manual-implementation without the built-in TBTT. If we have that, we can examine the amount of boilerplate required and get a better picture of what impact this deprecation has. Perhaps I could help here. I could try to contribute an example here.

+1, that sounds like a reasonable next step

ananthsub · 2021-08-07T16:22:17Z

@anhnht3 since you thumbs-downed the RFC, could you describe how you're relying on tbptt in Lightning today?

anhnht3 · 2021-08-07T17:42:58Z

As a user, I often need to employ tbptt for RNN models. Let's say tbptt is going to be deprecated, I am not sure it's easy to implement it on my own without understanding Lightning's internals.

ananthsub · 2021-08-07T18:05:39Z

As a user, I often need to employ tbptt for RNN models. Let's say tbptt is going to be deprecated, I am not sure it's easy to implement it on my own without understanding Lightning's internals.

Could you share some example code for how you're currently training?

williamFalcon · 2021-08-07T18:15:25Z

I’m also curious here why we want to deprecate tbptt? this would make PL useless for sequencing work. in fact, it would also make it impossible.

and MAYBE (big maybe), it might be possible IF you understand PL at the depth of a core contributor. but that’s an unrealistic ask for 99.9% of users.

this also goes against the “no need to learn a new framework” core value we have.

ananthsub · 2021-08-07T23:34:14Z

@williamFalcon thanks for taking a look. The broader theme and question I want to answer is how we can support more flavors of training steps in a consistent manner. For flavors of training steps we have:

different input arguments
different output return value expectations

Which affect the contract between the trainer and the lightning module because these result in slightly different loop structures.

Here are some examples:

using a single optimizer vs multiple optimizers
-- by default we loop over optimizers 1 by 1 and call the training step once per optimizer
tbptt
-- we call the training step once per split batch
manual optimization
-- we call the training step once and users are expected to handle the bwd, optimizer step, and extensions like grad accumulation & mixed precision

And incoming requests:

Access to data iterator in training step to support inter-batch parallelism: Add a flavor of training_step that allows for expressing inter-batch parallelism #8316
Yielding losses from the training step Support yield in training_step for automatic optimization #7968

What I'd like to accomplish is a way we can overall widen Lightning's scope by supporting these various flavors. We can make clear that tbptt is one possible extension of the core Lightning, as well as lay a consistent pattern for how more extensions can be added in a way that doesn't bloat the core interfaces. Otherwise users looking at core/lightning.py can quickly be overwhelmed in terms of trying to figure out what is supported vs what they need to implement.

I think it's perfectly reasonable for us to be able to support these as extension loops inside of the Trainer in a way that abstracts this logic from the users, especially since these are quite targeted and aren't completely at odds with the rest of the loop structure.

I have some preliminary ideas here, but I need to do more exploration work on this: https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA/edit#heading=h.7ie80o6nh4et

yifuwang · 2021-08-09T09:35:25Z

Tactically, we can simplify the framework & interfaces offered by deprecating these. this means less maintenance and risk of bugs

I've been thinking about this as well. Currently, the logic for handling tbptt is on the critical path for all use cases even if they don't need it. In fact, (IIUC) it is also the sole reason that TrainingBatchLoop is a Loop (the support for multiple optimizers is not organized as a Loop). I definitely agree with you that there's opportunity for simplifying the framework's internal architecture.

However, we may be able to achieve the same architectural simplification by further decomposing TrainingBatchLoop, without dropping support for tbptt.

Currently, all use cases go through the same loop structure: FitLoop -> TrainingEpochLoop -> TrainingBatchLoop. The first two have rather clear responsibilities, while TrainingBatchLoop handles the combination of tbptt/non-tbptt, single optimizer/multiple optimizers, and automatic optimization/manual optimization. I feel the framework can becomes much simpler, more composable, and more flexible if the loop structure is selected based on the requested features. For example,

for tbptt + multiple optimizer + automatic optimizer, the loop structure can be
FitLoop -> TrainingEpochLoop -> TbpttLoop -> OptimizerLoop -> AutomaticOptimizationFlow

for non-tbptt + multiple optimizer + automatic optimizer:
FitLoop -> TrainingEpochLoop -> OptimizerLoop -> AutomaticOptimizationFlow

for non-tbptt + single optimizer + automatic optimizer:
FitLoop -> TrainingEpochLoop -> AutomaticOptimizationFlow

for non-tbptt + single optimizer + manual optimizer:
FitLoop -> TrainingEpochLoop -> ManualOptimizationFlow

On a side note, I feel it's worthwhile introducing an internal abstraction for optimization flows (for encapsulating the logic that actually calls training_step) and always attached it to the end of the loop structure (currently the inner most Loop invokes training_step). IMHO this would improve the maintainability of different optimization flows because (1) we envision supporting more of them (2) the complexity of different flows vary greatly (e.g. the flow for automatic optimization is much more complicated than that of manual optimization).

awaelchli · 2021-08-09T09:46:35Z

@yifuwang I've also thought about this. I have a WIP for extracting the optimizer loop for automatic optimization.
For selecting the structure of loop, it remains a question how to best do it since it depends on the properties of Trainer and LightningModule, so Lightning needs to build the loop at the time of calling fit, test, etc. when the model becomes available.

yifuwang · 2021-08-09T09:54:22Z

@awaelchli

it remains a question how to best do it since it depends on the properties of Trainer and LightningModule

Right. It's probably unavoidable to have some not-as-elegant code for interpreting the LightningModule/Trainer configuration and translating it into the loop structure. But IMHO having them contained is a good first step (as opposed to checking automatic_optimization all over the place).

On a side note, I feel it's worthwhile introducing an internal abstraction for optimization flows (for encapsulating the logic that actually calls training_step) and always attached it to the end of the loop structure (currently the inner most Loop invokes training_step). IMHO this would improve the maintainability of different optimization flows because (1) we envision supporting more of them (2) the complexity of different flows vary greatly (e.g. the flow for automatic optimization is much more complicated than that of manual optimization).

Curious what's your thoughts on this ^

awaelchli · 2021-08-09T12:52:40Z

Curious what's your thoughts on this ^

Generally yes I agree and I like it. Would need more details as to what responsibilities this abstraction has.
I have been working slightly in this direction with a abstraction for our closure handling internally with a LightningClosure here: #8642 This can simplify our logic in TrainingBatchLoop for handling the code paths for manual vs. auto optimization.

Perhaps the optimization flow abstraction you suggest could handle the creation of this closure as well.

stale · 2021-09-19T00:46:14Z