Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
VideoMAE | no | ViT-B | 800 | 16x5x3 | script/log/checkpoint | script/log/checkpoint (w/o repeated aug) |
79.4 | 94.1 |
VideoMAE | no | ViT-B | 800 | 16x5x3 | same as above | TODO | 80.4 | 94.4 |
VideoMAE | no | ViT-B | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 80.9 | 94.7 |
VideoMAE | no | ViT-L | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 84.7 | 96.5 |
Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
VideoMAE | no | ViT-B | 800 | 16x2x3 | script/log/checkpoint | script/log/checkpoint (w/o repeated aug) |
69.6 | 92.0 |
VideoMAE | no | ViT-B | 2400 | 16x2x3 | script/log/checkpoint | script/log/checkpoint | 70.6 | 92.6 |
- We report the results of VideoMAE finetuned with
I3D dense sampling
on Kinetics400 anduniform sampling
on Something-Something V2, respectively. - #Frame = #input_frame x #clip x #crop.
- #input_frame means how many frames are input for model during the test phase.
- #crop means spatial crops (e.g., 3 for left/right/center crop).
- #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).