return_result role in training_batch_loop.py #9332

Answered by Jovp

Jovp asked this question in Lightning Trainer API: Trainer, LightningModule, LightningDataModule

Jovp
Sep 5, 2021

Hello,

I have been trying to debug OOM after a few iterations on my model.
After investigation I saw that in loops/batch/training_batch_loop.py there is this piece of code (L235):
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
if result is not None:
return_result.update(result)
return return_result.loss

What I see is that return_result will keep the computation graph as it contains the loss. So I wonder what is the role of this variable? Also, where is the graph released, I could no go any further than "_training_step_and_backward_closure"? I don't understand why my model runs fine for a few iterations then there is some increase in memory.

Answered by Jovp

Hello Justus!

Indeed I have seen during my investigations that master changed quite a lot for that part of the code.

For the specific concern that I described see:

#9343 and
#9336

pull request 9336 seems to address exactly the issue that I was referring to.

Many thanks to the team for the great responsiveness and great work!
Julien

View full answer

Replies: 2 comments 1 reply

justusschock
Sep 7, 2021
Maintainer

Are you running on master? We are currently updating our internals regarding the loops there. If not could you give master a try?

Best,
Justus

cc @awaelchli who designed the loops/closures

0 replies

Jovp
Sep 7, 2021
Author

Hello Justus!

Indeed I have seen during my investigations that master changed quite a lot for that part of the code.

For the specific concern that I described see:

#9343 and
#9336

pull request 9336 seems to address exactly the issue that I was referring to.

Many thanks to the team for the great responsiveness and great work!
Julien

1 reply

justusschock Sep 7, 2021
Maintainer

Hey Julien,

Glad that it was resolved that easily.

Answer selected by Jovp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment