-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential Memory Leak Error #284
Comments
Actually I was just occupied by another project, so I had not solved this but closed the issue. |
Thanks for raising the issue @pandeydeep9 and @Phoveran, These leaks are worrisome. Could you share more about your setup? Which GPU, CPU, and versions of Python, PyTorch, and learn2learn? It seems to be hardward-dependent since @nightlessbaron wasn’t able to reproduce the bug on Colab. Also, are you running the mini-imagenet script as-is? |
I reduced the |
I have the same issue, even I run "maml_miniimagenet.py" on A5000 with 24GB |
Thanks for the additional feedback. Are you also using PyTorch v.1.10? And does commenting out the validation step also fix the memory leak? |
Yes, I use Pytorch 1.10.0, CUDA Version: 11.2, python 3.8.12, |
My setting: python: 3.9.7 |
Thanks for the extra info, I wonder if the issue cropped up with PyTorch 1.10 on CUDA 11+. As a temporary fix, does changing |
Yes, adding first_order=True on l. 112 solves the leak problem. Also, I guess this should give the expected results as I believe we can use the first order MAML during the validation/test phases and get the same results (i.e. do not need to track gradients for MAML during test/validation phases). Thanks |
We meet the same case and our setting is pytorch 1.10.0, python 3.9.5, cuda 11.5, tesla m40(24G). We are glad to see this issue published since we debug our code repeatedly and have no idea what's causing the increasing cuda memory occupation over val or test iterations. |
I was facing the same issue, but managed to solve it by downgrading Pytorch from version 1.10 to 1.9. I was using the following setup: learn2learn 0.1.6 Using this setup, memory usage kept increasing over epochs until an out-of-memory error occurred. However, when using Pytorch 1.9, memory usage stabilizes. |
Honestly, I tried to use maml for finetuning T5 transformer, befor adding "first_order=True", I just could run 2 tps, however, this way couldn't fix my problem. After adding this parameter, I could run 4 tps, but still got memory leak. I gues there are still some problems and exposed by huge networks such as transformer. learn2learn 0.1.6 |
The memory leak seems to have been introduced in PyTorch 1.10. @sjtugzx do you also see leaks with T5 on PyTorch 1.9? I haven't had time to investigate it yet, so help is welcome. |
I have a suggestion for a potential fix. It is a little bit hacky though. In my observations, the key problem leading to the memory leak seems to be that the compute graph for the gradient update is being created, even when In my code, what I've done to get rid of this extra unneeded memory usage at evaluation time is to add a # Update the module
self.module = maml_update(self.module, self.lr, gradients) becomes # Update the module
if eval:
with torch.no_grad():
self.module = maml_update(self.module, self.lr, gradients)
for p in self.module.parameters():
p.requires_grad = True
else:
self.module = maml_update(self.module, self.lr, gradients) I haven't investigated this in detail so I'm not sure if this is the best way to proceed, but let me know if this seems promising and if I should investigate further, and maybe even make a pull request. |
For people following, @kzhang2 and I have been discussing on slack and we came up with a fix. Expect a PR + release in the next 2 weeks. Meanwhile, the fix is to update the def update_module(module, updates=None, memo=None):
r"""
[[Source]](https://github.com/learnables/learn2learn/blob/master/learn2learn/utils.py)
**Description**
Updates the parameters of a module in-place, in a way that preserves differentiability.
The parameters of the module are swapped with their update values, according to:
\[
p \gets p + u,
\]
where \(p\) is the parameter, and \(u\) is its corresponding update.
**Arguments**
* **module** (Module) - The module to update.
* **updates** (list, *optional*, default=None) - A list of gradients for each parameter
of the model. If None, will use the tensors in .update attributes.
**Example**
~~~python
error = loss(model(X), y)
grads = torch.autograd.grad(
error,
model.parameters(),
create_graph=True,
)
updates = [-lr * g for g in grads]
l2l.update_module(model, updates=updates)
~~~
"""
if memo is None:
memo = {}
if updates is not None:
params = list(module.parameters())
if not len(updates) == len(list(params)):
msg = 'WARNING:update_module(): Parameters and updates have different length. ('
msg += str(len(params)) + ' vs ' + str(len(updates)) + ')'
print(msg)
for p, g in zip(params, updates):
p.update = g
# Update the params
for param_key in module._parameters:
p = module._parameters[param_key]
if p is not None and hasattr(p, 'update') and p.update is not None:
if p in memo:
module._parameters[param_key] = memo[p]
else:
updated = p + p.update
p.update = None
memo[p] = updated
module._parameters[param_key] = updated
# Second, handle the buffers if necessary
for buffer_key in module._buffers:
buff = module._buffers[buffer_key]
if buff is not None and hasattr(buff, 'update') and buff.update is not None:
if buff in memo:
module._buffers[buffer_key] = memo[buff]
else:
updated = buff + buff.update
buff.update = None
memo[buff] = updated
module._buffers[buffer_key] = updated
# Then, recurse for each submodule
for module_key in module._modules:
module._modules[module_key] = update_module(
module._modules[module_key],
updates=None,
memo=memo,
)
# Finally, rebuild the flattened parameters for RNNs
# See this issue for more details:
# https://github.com/learnables/learn2learn/issues/139
if hasattr(module, 'flatten_parameters'):
module._apply(lambda x: x)
return module |
Quick update: this is fixed, tested, and available in the new v0.1.7 release. |
Hi, I am using learn2learn and getting memory leak error. This is the code I am using: #Load model weights
model.load_state_dict(torch.load('mnist_model_weights_450.pth', map_location={'cuda:2' : 'cuda:0'}))
# run the test data
meta_test_loss = 0.0
for idx, (context_x, context_y, target_x, target_y) in enumerate(test_loader):
context_x, context_y, target_x, target_y = context_x.to(device), context_y.to(device), target_x.to(device), target_y.to(device)
effective_batch_size = context_x.size(0)
for i in range(effective_batch_size):
learner = maml.clone(first_order=True)
x_support, y_support = context_x[i], context_y[i]
x_query, y_query = target_x[i], target_y[i]
y_support = y_support.view(-1)
y_query = y_query.view(-1)
for _ in range(num_epochs):
wts, predictions = learner(x_support)
loss = custom_loss_function(predictions, y_support, wts)
learner.adapt(loss)
wts, predictions = learner(x_query)
loss = custom_loss_function(predictions, y_query, wts)
meta_test_loss += loss
meta_test_loss /= effective_batch_size
if idx % 10 == 0:
print(f"Iteration: {idx+1}, Meta test loss: {meta_test_loss}")
print(f"Final Meta test loss: {meta_test_loss}") I am getting this error:
learn2learn 0.2.0 Can anyone tell me how to fix it? I wrote the training loop similiarly but it runs |
I installed learn2learn using "pip install learn2learn". When I try to run maml_miniimagenet.py (from learn2learn/examples/vision/maml_miniimagenet.py ) with a batch size of 2 and shot = 1, I get the same error after 63 iterations. When I change to shot = 5, I get the error after 3 iterations.
When I look at nvidia-smi, the memory usage gradually increases with each iteration.
However, If I comment out the meta-validation loss part, (line 114-112 in this script) then I don't get the memory leak problem. I think the issue is similar to (Potential Memory Leak #278 ) I wonder why this issue is and how the issue can be solved?
The text was updated successfully, but these errors were encountered: