return_result role in training_batch_loop.py #9332
-
Hello, I have been trying to debug OOM after a few iterations on my model. What I see is that return_result will keep the computation graph as it contains the loss. So I wonder what is the role of this variable? Also, where is the graph released, I could no go any further than "_training_step_and_backward_closure"? I don't understand why my model runs fine for a few iterations then there is some increase in memory. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @Jovp Are you running on master? We are currently updating our internals regarding the loops there. If not could you give master a try? Best, cc @awaelchli who designed the loops/closures |
Beta Was this translation helpful? Give feedback.
-
Hello Justus! Indeed I have seen during my investigations that master changed quite a lot for that part of the code. For the specific concern that I described see: pull request 9336 seems to address exactly the issue that I was referring to. Many thanks to the team for the great responsiveness and great work! |
Beta Was this translation helpful? Give feedback.
Hello Justus!
Indeed I have seen during my investigations that master changed quite a lot for that part of the code.
For the specific concern that I described see:
#9343 and
#9336
pull request 9336 seems to address exactly the issue that I was referring to.
Many thanks to the team for the great responsiveness and great work!
Julien