-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FYI] Bug in R2+1D implementation #1265
Comments
This is related to the discussion in #1224 and we should follow-up accordingly. |
All right - for now, I won't change the models as they are, because they actually yield better performance, if it's ok with you, I'll just send a PR to document the issue, and close it? |
Sounds good to me |
After more thorough examination, and help from Du et al, it seems like there is a conceptual misunderstanding of the equation regarding calculation of midplanes in their paper. Here I propose a solution (BC breaking unfortunately), and more thorough discussion: BackgroundIn order to match the number of parameters of a "vanilla" 3D resnet, R2+1D models actually increase the number of feature maps in separated convolutions. This increase is done according to the equation IssueThis equation (or the paper) doesn't specify wether this corresponds to the residual blocks or individual layers. The difference is that one downsampling layer in our case will always have less parameters for simple block (and thus, it will have less overall parameters). For example the original implementation would look like the following (for the downsampling layer only, rest of them have the same number of parameters):
while in our case, we'd have
all the way through. In the bottleneck layers on the other hand, we have an opposite issue, as we'd always be overestimating the middle parameters because the output of the separated convolution is always smaller than the output of the whole block. For example
When according to the block formula for the second bottlenect, one would expect to have the middle layer expanded as well, i.e.
Solutionthe solution is simple and should further boost the performance of pytorch R2+1D models, but it is BC breaking. We could simply replace our R2+1D module with
This would also allow us to remove midplanes from blocks which would in turn simplify them a bit. cc @fmassa per offline discussion |
Removing I didn't quite understand why the proposed solution would change the model anyhow though. vision/torchvision/models/video/resnet.py Line 87 in 17e355f
to vision/torchvision/models/video/resnet.py Line 45 in 17e355f
Also, didn't the models have the same number of parameters as the Caffe2-equivalent? |
Yes. Or at least that's how it is implemented in the repo of the original paper.
Since example is worth a thousand words, imagine a fictional r25d that has multiple simpleblocks in first res block.
We have two places where we pass Moving the calculation of midplanes to each convolution definition rather than to each residual block definition fixes this issue. |
Yes, but up to the 3rd significant digit (the num parameters in caffe2 is given as 33.8M), which allows for quite a lot of variation. This is not a "major" difference in num of parameters, but enough to not make it compatible with Caffe2 pretrained models |
Ok, sounds good. As this is a BC-breaking change, I believe we should do something similar to what we did for MNasNet in #1224, with the |
Yeah, I'll take a stab at that |
I found that there is a lack of clarity in the original R2+1D paper and official code implementation for models utilizing BottleNeck layers, which makes it impossible to transfer weights from large pretrained models in C2 to the models implemented in torchvision.
For background info see the question in their repo.
The fix is very straightforward (change bottleneck midplanes computation), but the question is whether we should do it, which I suspect should be based on author's answer. I'm leaving this here just so that people are aware of it.
The text was updated successfully, but these errors were encountered: