Transpose operator is slow when source stride is a power of 2 #66

robertknight · 2024-03-26T12:28:21Z

When doing some analysis into the performance of the Transpose operator, I noticed that performance is significantly worse when the copy done in contiguous_data leads to traversing the source tensor with a stride that is a power of 2 for most iterations.

Examples where this happens:

Transpose a square matrix with size of side N being a power of 2
Moving a dimension inwards, where all the dims to the right have power-of-2 sizes ([4, 1500, 8, 64] => [4, 8, 64, 1500])

To reproduce:

cargo test --release -p rten bench_transpose -- --nocapture --ignored

Note the "overhead" factors for 512x512, 1024x1024 matrices. Modify the benchmark to use non-power-of-2 sizes (eg. 1023x1023) and compare again. The overhead reported compared to just copying the data becomes much lower.

Power-of-2 dimension sizes are common in real models (eg. see examples used in bench_transpose), so this happens often.

The current transpose implementation is currently very naive. It creates a view of the input, permutes the strides and then iterates over the indices using a nested loop, copying source elements into a contiguous destination.

The text was updated successfully, but these errors were encountered:

This demonstrates the issue in #66.

robertknight · 2024-04-01T16:42:32Z

Some useful references on the subject of accelerating N-dimensional tensor transposes:

robertknight added a commit that referenced this issue Apr 1, 2024

Add benchmarks / notes about transpose ops and power-of-2 sizes

d6ba0dd

This demonstrates the issue in #66.

robertknight mentioned this issue Apr 1, 2024

Add benchmarks / notes about transpose ops and power-of-2 sizes #75

Merged

robertknight mentioned this issue Apr 5, 2024

Add blocking / tiling for transpose operations #78

Merged

robertknight closed this as completed in #78 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transpose operator is slow when source stride is a power of 2 #66

Transpose operator is slow when source stride is a power of 2 #66

robertknight commented Mar 26, 2024

robertknight commented Apr 1, 2024

Transpose operator is slow when source stride is a power of 2 #66

Transpose operator is slow when source stride is a power of 2 #66

Comments

robertknight commented Mar 26, 2024

robertknight commented Apr 1, 2024