- Published in IEEE Transactions on Instrumentation and Measurement on March 5th, 2025.
β‘οΈ Read the paper on IEEE Xplore
β‘οΈ View the paper (PDF)
We propose an Attention-based CNN (A-CNN) model that addresses the challenges of small object classification in real-world manufacturing environments. Unlike traditional CNNs that struggle with object-to-image area ratio (OAR) constraints, our model leverages an attention mechanism to dynamically focus on small objects, achieving superior classification accuracy and efficiency.
Key innovations of our model include:
- Integration of an Attention module to adaptively extract Regions of Interest (ROI), increasing OAR without manual preprocessing.
- A multi-task learning framework that enables end-to-end training with minimal data labeling (only 5% of the dataset labeled), significantly reducing human effort and time.
- For an edge device, NVIDIA Jetson Nano, providing real-time inference (67.1 fps) while maintaining high accuracy (99.92%).
These contributions ensure that our A-CNN is not only effective but also practical for deployment in resource-constrained environments, such as automated optical inspection (AOI) systems.
This model utilizes a spatial transformer (Attention) module to sample the ROIs from the input images. The localization network predicts the center coordinates of the ROIs, and the classification network assigns class scores based on the ROIs. In the Attention module, the sizes of both the ROI and the resized ROI are hyperparameters.
This dataset was created as part of our research. It is publicly available to facilitate reproducibility and further advancements in the field.
β‘οΈ download dataset
- Images:
train data
: from device 0test data
: from device 1
- Labels:
- YOLO format labels corresponding to each image.
The A-CNN model can be effectively trained end-to-end with minimal data labeling compared to object detection methods. Experimental results show that the proposed A-CNN model achieves a classification accuracy of 99.92% and an inference speed of 62.9 fps on the NVIDIA Jetson Nano platform, outperforming the smallest models of YOLOv5, YOLOv7, YOLOv8, YOLOv9 and YOLOv10, state-of-the-art object detection algorithms, in terms of both accuracy and latency. Notably, our model has 3.8Γ faster than the fastest YOLO model, underscoring its efficiency in real-time applications. These findings highlight the potential of the A-CNN model as an accurate and practical solution for small object classification.
Comparison of the A-CNN with YOLO Object Detection Models
Model | Params (M) | FLOPsf (G) | Input (resized) | Accuracy (%) | Latencya (ms) |
---|---|---|---|---|---|
YOLOv5-Nano | 1.76 | 1.55 | 640Γ480 | 99.67 | 61 |
0.67 | 416Γ312 | 97.92 | 61 | ||
0.22 | 224Γ168 | 82.83 | 55 | ||
YOLOv7-Tiny | 6.02 | 4.95 | 640Γ480 | 99.83 | 135 |
2.15 | 416Γ312 | 98.42 | 135 | ||
0.69 | 224Γ168 | 95.33 | 130 | ||
YOLOv8-Nano | 3.01 | 3.01 | 640Γ480 | 98.95 | 72 |
1.33 | 416Γ312 | 95.58 | 44 | ||
0.43 | 224Γ168 | 64.00 | 44 | ||
YOLOv9-Tiny | 2.01 | 2.94 | 640Γ480 | 99.50 | 112 |
1.28 | 416Γ312 | 99.08 | 102 | ||
0.41 | 224Γ168 | 77.08 | 95 | ||
YOLOv10-Nano | 2.71 | 3.15 | 640Γ480 | 99.75 | 84 |
1.36 | 416Γ312 | 99.33 | 59 | ||
0.44 | 224Γ168 | 75.08 | 57 | ||
A-CNN (base) | 0.71 | 2.22 | 640Γ480 | 99.75 | 14.9 (6.2) |
A-CNN (best) | 0.70 | 1.00 | 640Γ480 | 99.82 | 15.5 (6.3) |
A-CNN (opt) | 0.68 | 0.38 | 640Γ480 | 99.92 | 15.9 (6.6) |
Notes:
- f FLOPs in the forward process of model, excluding the pre- and post-processing for YOLO models.
- a End-to-end inference time measured on the NVIDIA Jetson Nano, including the pre- and post-processing.
Values in parentheses indicate inference time using TensorRT with FP32 precision.
If you use this dataset, please cite the following paper:
Hyun-Yong Kim, Taek-Joon Yi, and Jong-Yun Lee
An Attention-based Convolutional Neural Network with Spatial Transformer Module for Automated Optical Inspection of Small Objects
IEEE Transactions on Instrumentation and Measurement, 2025.
DOI: 10.1109/TIM.2025.3548240