You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When upgrading from MONAI 0.9.0 to 1.0.0, my 3D segmentation code fails due to (most likely) new MetaTensor in transforms, when using DistributedDataParallel (multi-gpu)
the error is RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.
same issue was reported here (but for 2D MIL classification) #5081
and #5198
I've traced it down to this commit 63e36b6
(prior to it, the code is working fine)
wyli
changed the title
MetaTensor and DistributedDataParallel. bug
MetaTensor and DistributedDataParallel. bug (SyncBatchNormBackward is a view and is being modified inplace)
Oct 7, 2022
When upgrading from MONAI 0.9.0 to 1.0.0, my 3D segmentation code fails due to (most likely) new MetaTensor in transforms, when using DistributedDataParallel (multi-gpu)
the error is
RuntimeError: Output 0 of SyncBatchNormBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.
same issue was reported here (but for 2D MIL classification)
#5081
and #5198
I've traced it down to this commit
63e36b6
(prior to it, the code is working fine)
It seems the issue is that dataloader returns data as MetaTensor (and not torch.Tensor as before)
e.g. here https://github.com/Project-MONAI/tutorials/blob/main/pathology/multiple_instance_learning/panda_mil_train_evaluate_pytorch_gpu.py#L51
both data and target are MetaTensor types
if converting explicitly (on gpu or cpu):
then the code runs fine, but a bit slower. It seems there is something wrong with MetaTensor
The text was updated successfully, but these errors were encountered: