You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When doing some analysis into the performance of the Transpose operator, I noticed that performance is significantly worse when the copy done in contiguous_data leads to traversing the source tensor with a stride that is a power of 2 for most iterations.
Examples where this happens:
Transpose a square matrix with size of side N being a power of 2
Moving a dimension inwards, where all the dims to the right have power-of-2 sizes ([4, 1500, 8, 64] => [4, 8, 64, 1500])
To reproduce:
cargo test --release -p rten bench_transpose -- --nocapture --ignored
Note the "overhead" factors for 512x512, 1024x1024 matrices. Modify the benchmark to use non-power-of-2 sizes (eg. 1023x1023) and compare again. The overhead reported compared to just copying the data becomes much lower.
Power-of-2 dimension sizes are common in real models (eg. see examples used in bench_transpose), so this happens often.
The current transpose implementation is currently very naive. It creates a view of the input, permutes the strides and then iterates over the indices using a nested loop, copying source elements into a contiguous destination.
The text was updated successfully, but these errors were encountered:
When doing some analysis into the performance of the Transpose operator, I noticed that performance is significantly worse when the copy done in
contiguous_data
leads to traversing the source tensor with a stride that is a power of 2 for most iterations.Examples where this happens:
[4, 1500, 8, 64]
=>[4, 8, 64, 1500]
)To reproduce:
Note the "overhead" factors for 512x512, 1024x1024 matrices. Modify the benchmark to use non-power-of-2 sizes (eg. 1023x1023) and compare again. The overhead reported compared to just copying the data becomes much lower.
Power-of-2 dimension sizes are common in real models (eg. see examples used in
bench_transpose
), so this happens often.The current transpose implementation is currently very naive. It creates a view of the input, permutes the strides and then iterates over the indices using a nested loop, copying source elements into a contiguous destination.
The text was updated successfully, but these errors were encountered: