This repository provides an optimized version of fused-ssim, achieving better performance while maintaining the same interface. Below are performance comparisons measured using tests/test.py
.
Implementation | Forward Time (ms) | Backward Time (ms) | Inference Time (ms) |
---|---|---|---|
Optimized Fused-SSIM | 2.49 | 2.68 | 1.43 |
Original Fused-SSIM | 3.66 | 3.52 | 2.59 |
- The original implementation computes multiple statistics (mean, variance, covariance) in separate passes, requiring multiple memory accesses.
- Optimization: The computations are fused into a single pass, reducing redundant memory accesses by using shared memory for intermediate results.
- The Gaussian filter is symmetric (e.g.,
G_00 = G_10
), but the original implementation does not take advantage of this. - Optimization: Pairs symmetric elements to halve the number of multiplications from 11 to 6.
- The original implementation uses
#define
macros for Gaussian coefficients, increasing register pressure. - Optimization: Stores coefficients in CUDA constant memory (
__constant__ float cGauss[11]
) for:- Faster memory access.
- Reduced register pressure.
- Improved scalability across GPUs.
- The original implementation uses 32x32 thread blocks.
- Optimization: Uses 16x16 blocks, which improves GPU occupancy by reducing resource usage per block.
- The original implementation loads
img1
andimg2
into separate shared memory arrays. - Optimization:
- Uses a 3D shared memory array (
sTile[SHARED_Y][SHARED_X][2]
) to load both images simultaneously, reducing global memory accesses. - Stores intermediate sums in a unified shared memory array, improving data locality and minimizing memory fragmentation.
- Uses a 3D shared memory array (
- The original implementation performs separate convolution operations for each statistic.
- Optimization: Uses a single convolution pass for all statistics, reducing redundant computations.
Special thanks to Florian Hahlbohm for helping me verifying that my optimization don't break anything.
If you use this optimized fused-SSIM implementation for your research, please cite both the original paper and this implementation:
@inproceedings{optimized-fused-ssim,
author = {Janusch Patas},
title = {Optimized Fused-SSIM},
year = {2025},
url = {https://github.com/MrNeRF/optimized-fused-ssim},
}
@inproceedings{taming3dgs,
author = {Mallick, Saswat Subhajyoti and Goel, Rahul and Kerbl, Bernhard and Steinberger, Markus and Carrasco, Francisco Vicente and De La Torre, Fernando},
title = {Taming 3DGS: High-Quality Radiance Fields with Limited Resources},
year = {2024},
url = {https://doi.org/10.1145/3680528.3687694},
doi = {10.1145/3680528.3687694},
booktitle = {SIGGRAPH Asia 2024 Conference Papers},
series = {SA '24}
}