-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
one of the variables needed for gradient computation has been modified by an inplace operation #24996
Comments
cc @pacman100 |
Hi @levuloihust99 find_unused_parameters=True |
Setting |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@levuloihust99 same problem, do you find further reason? thanks. |
The solution is to set model = DDP(model, broadcast_buffers=False, ...) |
Thanks! It solves the issue. Could you explain why this causes the issue? |
Had the same issue and suddenly stopped by the post, the answer is here. |
System Info
Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I encountered the error
one of the variables needed for gradient computation has been modified by an inplace operation...
when training my model with DistributedDataParallel (DDP). My code run smoothly when I do not use DDP. I have spent time inspecting the problem and below is the minimal code for reproducing the problem.Suppose this code is put in a file named
debug_distributed.py
. I run this code with the command, and I got the error
If I do not use DDP, there is no such error. Specifically, put the following in a file named
debug_normal.py
and runpython debug_normal.py
This problem prevents me from training my BertModel in distributed mode. I found that the problem lies on the line
position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
. It seems like an "inplace operation" as the error suggests. If I change that line toposition_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length].clone()
, the problem will be gone.I think this problem is much more related to PyTorch. It may be a Pytorch bug. However, the simplest workaround is to add a
.clone()
as I showed above. Currently,transformers
of version>=4
uses this "inplace operation" and all>=4
versions oftransformers
will get this error. So, is there anyway to better fix the problem, so I don't need to change library (transformers
) code?Expected behavior
BertModel works in distributed training with DistributedDataParallel
The text was updated successfully, but these errors were encountered: