[FYI] Bug in R2+1D implementation #1265

bjuncek · 2019-08-27T11:15:01Z

I found that there is a lack of clarity in the original R2+1D paper and official code implementation for models utilizing BottleNeck layers, which makes it impossible to transfer weights from large pretrained models in C2 to the models implemented in torchvision.

For background info see the question in their repo.

The fix is very straightforward (change bottleneck midplanes computation), but the question is whether we should do it, which I suspect should be based on author's answer. I'm leaving this here just so that people are aware of it.

fmassa · 2019-08-28T11:34:19Z

This is related to the discussion in #1224 and we should follow-up accordingly.

bjuncek · 2019-09-04T09:30:13Z

All right - for now, I won't change the models as they are, because they actually yield better performance, if it's ok with you, I'll just send a PR to document the issue, and close it?

fmassa · 2019-09-04T16:50:17Z

Sounds good to me

bjuncek · 2019-09-26T15:50:39Z

After more thorough examination, and help from Du et al, it seems like there is a conceptual misunderstanding of the equation regarding calculation of midplanes in their paper. Here I propose a solution (BC breaking unfortunately), and more thorough discussion:

Background

In order to match the number of parameters of a "vanilla" 3D resnet, R2+1D models actually increase the number of feature maps in separated convolutions. This increase is done according to the equation

$\frac{td^2*N_{i-1}N_i}{d^2*N_{i-1}+tN_i}$

Issue

This equation (or the paper) doesn't specify wether this corresponds to the residual blocks or individual layers. The difference is that one downsampling layer in our case will always have less parameters for simple block (and thus, it will have less overall parameters). For example the original implementation would look like the following (for the downsampling layer only, rest of them have the same number of parameters):

Conv2Plus1D(
  (0): Conv3d(128, 288, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(288, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

while in our case, we'd have

Conv2Plus1D(
  (0): Conv3d(128, 230, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
  (1): BatchNorm3d(230, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): Conv3d(230, 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
)

all the way through.

In the bottleneck layers on the other hand, we have an opposite issue, as we'd always be overestimating the middle parameters because the output of the separated convolution is always smaller than the output of the whole block. For example

Bottleneck1: 
  3D conv (in: 64, 1,1,1, out: 64) 
  2+1D conv (
    in: 64, 1,3,3, out: 144
    in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2: 
  3D conv (in: 64*4, 1,1,1, out: 64)
  2+1D conv (
      in: 64, 1,3,3, out: 144
      in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)

When according to the block formula for the second bottlenect, one would expect to have the middle layer expanded as well, i.e.

Bottleneck1: 
  3D conv (in: 64, 1,1,1, out: 64) 
  2+1D conv (
    in: 64, 1,3,3, out: 144
    in 144, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)
Bottleneck2: 
  3D conv (in: 64*4, 1,1,1, out: 64)
  2+1D conv (
      in: 64, 1,3,3, out: 177
      in 177, 3, 1, 1, out: 64
  )
  3D conv (in: 64, 1,1,1, out: 64*4)

Solution

the solution is simple and should further boost the performance of pytorch R2+1D models, but it is BC breaking.

We could simply replace our R2+1D module with

class Conv2Plus1D(nn.Sequential):

    def __init__(self,
                 in_planes,
                 out_planes,
                 stride=1,
                 padding=1):

        midplanes = (in_planes * out_planes * 3 * 3 * 3) // (
                in_planes * 3 * 3 + 3 * out_planes)
        super(Conv2Plus1D, self).__init__(
            nn.Conv3d(in_planes, midplanes, kernel_size=(1, 3, 3),
                      stride=(1, stride, stride), padding=(0, padding, padding),
                      bias=False),
            nn.BatchNorm3d(midplanes),
            nn.ReLU(inplace=True),
            nn.Conv3d(midplanes, out_planes, kernel_size=(3, 1, 1),
                      stride=(stride, 1, 1), padding=(padding, 0, 0),
                      bias=False))

This would also allow us to remove midplanes from blocks which would in turn simplify them a bit.
Let me know if this is worth the hassle :)

cc @fmassa per offline discussion

fmassa · 2019-09-30T14:22:40Z

Removing midplanes sounds good to me (although it's a BC-breaking change). This is how the original model is meant to be implemented, right?

I didn't quite understand why the proposed solution would change the model anyhow though.
Wouldn't this be just moving

vision/torchvision/models/video/resnet.py

Line 87 in 17e355f

midplanes = (inplanes * planes * 3 * 3 * 3) // (inplanes * 3 * 3 + 3 * planes)

to

vision/torchvision/models/video/resnet.py

Line 45 in 17e355f

super(Conv2Plus1D, self).__init__(

? Doesn't this keep the same behavior as before?

Also, didn't the models have the same number of parameters as the Caffe2-equivalent?

bjuncek · 2019-09-30T14:39:39Z

Removing midplanes sounds good to me (although it's a BC-breaking change). This is how the original model is meant to be implemented, right?

Yes. Or at least that's how it is implemented in the repo of the original paper.

I didn't quite understand why the proposed solution would change the model anyhow though.
Because if you define it as it is now you have the same midplanes value for every convolution in every block, but in reality (i.e. their implementation) that's not the case.

Since example is worth a thousand words, imagine a fictional r25d that has multiple simpleblocks in first res block. mp_ours defines the number of midplanes as defined by us, and mp_vmz is number of midplanes as defined by the reference implementation.

SB1: 
  64->128; (mp_ours = 230, mp_vmz=230)
  128->128 (mp_ours = 230, mp_vmz=288)
SB2:
  128->128 (mp_ours = 288, mp_vmz=288)
  128->128 (mp_ours = 288, mp_vmz=288)
SB3:
  128->128 (mp_ours = 288, mp_vmz=288)
  128->128 (mp_ours = 288, mp_vmz=288)

...
...

We have two places where we pass in_channels and out_channels; first one is in the block definition, and the other one is in each convolution of the given block. The second convolut of the first simple block actually has in_channels==out_channels, but since we calculate midplanes when we define the block, we underestimate what that number should be for the second conv of the first simple block in each "layer" of the resnet.

Moving the calculation of midplanes to each convolution definition rather than to each residual block definition fixes this issue.

bjuncek · 2019-09-30T14:41:58Z

Also, didn't the models have the same number of parameters as the Caffe2-equivalent?

Yes, but up to the 3rd significant digit (the num parameters in caffe2 is given as 33.8M), which allows for quite a lot of variation. This is not a "major" difference in num of parameters, but enough to not make it compatible with Caffe2 pretrained models

fmassa · 2019-09-30T14:48:10Z

Ok, sounds good.

As this is a BC-breaking change, I believe we should do something similar to what we did for MNasNet in #1224, with the _version attribute, if it's not too much of a trouble.

bjuncek · 2019-09-30T14:51:55Z

Yeah, I'll take a stab at that

fmassa added bug module: models needs discussion module: video labels Aug 28, 2019

bjuncek mentioned this issue Sep 26, 2019

Wrong number of midplanes in r(2+1)d model's downsampling basic blocks (layer 2,3,4) facebookresearch/VMZ#89

Closed

bjuncek changed the title ~~[FYI] Bottleneck-based video models mismatch~~ [FYI] Bug in R2+1D implementation Sep 26, 2019

bjuncek mentioned this issue Sep 27, 2019

Upstream our changes to PyTorch r(2+1)d architecture moabitcoin/ig65m-pytorch#5

Closed

bjuncek mentioned this issue Oct 8, 2019

[video][r25d model update] fixing #1265 #1432

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FYI] Bug in R2+1D implementation #1265

[FYI] Bug in R2+1D implementation #1265

bjuncek commented Aug 27, 2019 •

edited

Loading

fmassa commented Aug 28, 2019

bjuncek commented Sep 4, 2019

fmassa commented Sep 4, 2019

bjuncek commented Sep 26, 2019 •

edited

Loading

fmassa commented Sep 30, 2019

bjuncek commented Sep 30, 2019

bjuncek commented Sep 30, 2019

fmassa commented Sep 30, 2019

bjuncek commented Sep 30, 2019

[FYI] Bug in R2+1D implementation #1265

[FYI] Bug in R2+1D implementation #1265

Comments

bjuncek commented Aug 27, 2019 • edited Loading

fmassa commented Aug 28, 2019

bjuncek commented Sep 4, 2019

fmassa commented Sep 4, 2019

bjuncek commented Sep 26, 2019 • edited Loading

Background

Issue

Solution

fmassa commented Sep 30, 2019

bjuncek commented Sep 30, 2019

bjuncek commented Sep 30, 2019

fmassa commented Sep 30, 2019

bjuncek commented Sep 30, 2019

bjuncek commented Aug 27, 2019 •

edited

Loading

bjuncek commented Sep 26, 2019 •

edited

Loading