Skip to content

Latest commit

 

History

History
210 lines (155 loc) · 6.73 KB

README_en.md

File metadata and controls

210 lines (155 loc) · 6.73 KB

Implementation of PiT (ICCV2021) Based on PaddlePaddle

This is an unofficial repo based on PaddlePaddle of PiT (ICCV2021): Rethinking Spatial Dimensions of Vision Transformers

English | 简体中文

1 Introduction

From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model.

The algorithm has been summarized in:

drawing

Algorithm of PiT

2 Accuracy

Model Original Acc@1 Reproduced Acc@1 Image Size batch_size Crop_pct epoch
pit_ti 73.0 72.97 224 256*4GPUs 0.9 300
(+10 COOLDOWN)
  • NOTE that the result in the table above is obtained on the validation set of ILSVRC2012. It's worth mentioning that the Acc@1 on the validation set of Light_ILSVRC2012 is 73.17.

  • The model parameters and training logs have been placed in the output folder.

3 Dataset

ImageNet-1k 2012, i.e. ILSVRC2012.

  • ILSVRC2012 is a large classification dataset, the size of which is 144GB. It contains 1,281,167 training images and 50,000 test images (1000 object categories).

  • To save time, this repo uses a lightweight version of ILSVRC2012, named Light_ILSVRC2012, whose size is 65GB. And the links are as follows: Light_ILSVRC2012_part_0.tar and Light_ILSVRC2012_part_1.tar.

  • You should arrange the dataset following this structure:

    │imagenet/
    ├──train/
    │  ├── n01440764
    │  │   ├── n01440764_10026.JPEG
    │  │   ├── n01440764_10027.JPEG
    │  │   ├── ......
    │  ├── ......
    ├──val/
    │  ├── n01440764
    │  │   ├── ILSVRC2012_val_00000293.JPEG
    │  │   ├── ILSVRC2012_val_00002138.JPEG
    │  │   ├── ......
    │  ├── ......
    

    You may also find this helpful.

4 Environment

My Environment:

  • Python: 3.7.11
  • PaddlePaddle: 2.2.2
  • yacs==0.1.8
  • scipy
  • pyyaml
  • Hardware: Tesla V100 * 4, p.s., I really appreciate what Baidu PaddlePaddle platform offers me.

5 Quick start

step1: git and download

git clone https://github.com/hatimwen/paddle_pit.git
cd paddle_pit

step2: change arguments

Please change the scripts you want to run in scripts according to the practical needs.

step3: eval

  • For multi-GPUs pc:

    sh scripts/run_eval_multi.sh
  • For single-GPU pc:

    sh scripts/run_eval.sh

step4: train

  • For multi-GPUs pc:

    sh scripts/run_train_multi.sh
    
  • For single-GPU pc:

    sh scripts/run_train.sh
    

step5: predict

python predict.py \
-pretrained='output/Best_PiT' \
-img_path='images/ILSVRC2012_val_00004506.JPEG'

Picture (id: 244)

Output Results:

class_id: 244, prob: 0.8468140959739685

Clearly, the output is is in line with expectations.

6 Code Structure and Description

|-- paddle_pit
    |-- output
    |-- configs
        |-- pit_ti.yaml
    |-- datasets
        |-- ImageNet1K
    |-- scripts
        |-- run_train.sh
        |-- run_train_multi.sh
        |-- run_eval.sh
        |-- run_eval_multi.sh
    |-- augment.py
    |-- config.py
    |-- datasets.py
    |-- droppath.py
    |-- losses.py
    |-- main_multi_gpu.py
    |-- main_single_gpu.py
    |-- mixup.py
    |-- model_ema.py
    |-- pit.py
    |-- random_erasing.py
    |-- regnet.py
    |-- transforms.py
    |-- utils.py
    |-- README.md
    |-- requirements.txt

8 Model info

Info Description
Author Hatimwen
Email [email protected]
Date 2022.01
Version PaddlePaddle 2.2.2
Field Classification
Supported Devices GPU
AI Studio AI Studio

9 Citation

@inproceedings{heo2021pit,
    title={Rethinking Spatial Dimensions of Vision Transformers},
    author={Byeongho Heo and Sangdoo Yun and Dongyoon Han and Sanghyuk Chun and Junsuk Choe and Seong Joon Oh},
    booktitle = {International Conference on Computer Vision (ICCV)},
    year={2021},
}

Last but not least, thank PaddlePaddle very much for its holding 飞桨论文复现挑战赛(第五期), which helps me learn a lot. Meanwhile, I also thank Dr. Zhu's team very much for their PaddleViT since most of my codes are copied from it except for the whole alignment for the training process. ♥️