This is an unofficial repo based on PaddlePaddle of PiT (ICCV2021): Rethinking Spatial Dimensions of Vision Transformers
English | 简体中文
-
Official repo (PyTorch) PiT.
From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model.
The algorithm has been summarized in:
Model | Original Acc@1 | Reproduced Acc@1 | Image Size | batch_size | Crop_pct | epoch |
---|---|---|---|---|---|---|
pit_ti | 73.0 | 72.97 | 224 | 256*4GPUs | 0.9 | 300 (+10 COOLDOWN) |
-
NOTE that the result in the table above is obtained on the validation set of ILSVRC2012. It's worth mentioning that the Acc@1 on the validation set of Light_ILSVRC2012 is 73.17.
-
The model parameters and training logs have been placed in the output folder.
ImageNet-1k 2012, i.e. ILSVRC2012.
-
ILSVRC2012 is a large classification dataset, the size of which is 144GB. It contains 1,281,167 training images and 50,000 test images (1000 object categories).
-
To save time, this repo uses a lightweight version of ILSVRC2012, named Light_ILSVRC2012, whose size is 65GB. And the links are as follows: Light_ILSVRC2012_part_0.tar and Light_ILSVRC2012_part_1.tar.
-
You should arrange the dataset following this structure:
│imagenet/ ├──train/ │ ├── n01440764 │ │ ├── n01440764_10026.JPEG │ │ ├── n01440764_10027.JPEG │ │ ├── ...... │ ├── ...... ├──val/ │ ├── n01440764 │ │ ├── ILSVRC2012_val_00000293.JPEG │ │ ├── ILSVRC2012_val_00002138.JPEG │ │ ├── ...... │ ├── ......
You may also find this helpful.
My Environment:
- Python: 3.7.11
- PaddlePaddle: 2.2.2
- yacs==0.1.8
- scipy
- pyyaml
- Hardware: Tesla V100 * 4, p.s., I really appreciate what Baidu PaddlePaddle platform offers me.
git clone https://github.com/hatimwen/paddle_pit.git
cd paddle_pit
Please change the scripts you want to run in scripts according to the practical needs.
-
For multi-GPUs pc:
sh scripts/run_eval_multi.sh
-
For single-GPU pc:
sh scripts/run_eval.sh
-
For multi-GPUs pc:
sh scripts/run_train_multi.sh
-
For single-GPU pc:
sh scripts/run_train.sh
python predict.py \
-pretrained='output/Best_PiT' \
-img_path='images/ILSVRC2012_val_00004506.JPEG'
Output Results:
class_id: 244, prob: 0.8468140959739685
Clearly, the output is is in line with expectations.
|-- paddle_pit
|-- output
|-- configs
|-- pit_ti.yaml
|-- datasets
|-- ImageNet1K
|-- scripts
|-- run_train.sh
|-- run_train_multi.sh
|-- run_eval.sh
|-- run_eval_multi.sh
|-- augment.py
|-- config.py
|-- datasets.py
|-- droppath.py
|-- losses.py
|-- main_multi_gpu.py
|-- main_single_gpu.py
|-- mixup.py
|-- model_ema.py
|-- pit.py
|-- random_erasing.py
|-- regnet.py
|-- transforms.py
|-- utils.py
|-- README.md
|-- requirements.txt
Info | Description |
---|---|
Author | Hatimwen |
[email protected] | |
Date | 2022.01 |
Version | PaddlePaddle 2.2.2 |
Field | Classification |
Supported Devices | GPU |
AI Studio | AI Studio |
@inproceedings{heo2021pit,
title={Rethinking Spatial Dimensions of Vision Transformers},
author={Byeongho Heo and Sangdoo Yun and Dongyoon Han and Sanghyuk Chun and Junsuk Choe and Seong Joon Oh},
booktitle = {International Conference on Computer Vision (ICCV)},
year={2021},
}
Last but not least, thank PaddlePaddle very much for its holding 飞桨论文复现挑战赛(第五期), which helps me learn a lot. Meanwhile, I also thank Dr. Zhu's team very much for their PaddleViT since most of my codes are copied from it except for the whole alignment for the training process.