Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blocking / tiling for transpose operations #78

Merged
merged 5 commits into from
Apr 5, 2024

Conversation

robertknight
Copy link
Owner

Before this PR the transpose operation had a naive implementation which used nested for loops to generate all valid indices and the source/dest offsets, then copied the value from the permuted/transposed source view to the destination.

This actually worked quite well except for cases where the strides after permuting/transposing the source were a multiple of the cache line size, especially powers of 2. In that case cache conflicts caused performance to be much worse. Transposing a 1024x1024 matrix was ~8x slower than copying the data (excluding the time taken for memory allocation), whereas for a 1023x1023 matrix it was only ~1.5x.

This PR fixes the issue by adding a simple implementation of blocking and tiling for transpose operations. Initially this is only enabled for cases where the last stride of the transposed tensor is a power of 2 greater than 32. Also the layout of the permuted tensor is simplified to minimize the number of dimensions before performing the copy. This maximizes the iteration count of each loop.

Some numbers from the bench_transpose benchmark comparing copy vs naive/reference transpose vs blocked transpose:

transpose shape [512, 512] perm [0, 1] copy 0.045ms ref transpose 0.042ms opt transpose 0.039ms overhead 0
transpose shape [128, 128] perm [1, 0] copy 0.002ms ref transpose 0.010ms opt transpose 0.009ms overhead 4.532522
transpose shape [256, 256] perm [1, 0] copy 0.010ms ref transpose 0.065ms opt transpose 0.041ms overhead 2.8640494
transpose shape [512, 512] perm [1, 0] copy 0.045ms ref transpose 0.393ms opt transpose 0.172ms overhead 1.6292478
transpose shape [1024, 1024] perm [1, 0] copy 0.210ms ref transpose 2.202ms opt transpose 0.703ms overhead 1.8715324
transpose shape [127, 127] perm [1, 0] copy 0.001ms ref transpose 0.006ms opt transpose 0.006ms overhead 3.6574275
transpose shape [255, 255] perm [1, 0] copy 0.007ms ref transpose 0.020ms opt transpose 0.019ms overhead 1.4803344
transpose shape [513, 513] perm [1, 0] copy 0.038ms ref transpose 0.087ms opt transpose 0.085ms overhead 0.4125464
transpose shape [1023, 1023] perm [1, 0] copy 0.230ms ref transpose 0.729ms opt transpose 0.646ms overhead 1.6150845
transpose shape [4, 1500, 8, 64] perm [0, 2, 1, 3] copy 1.012ms ref transpose 1.488ms opt transpose 1.455ms overhead 0.33376133
transpose shape [4, 8, 1500, 64] perm [0, 2, 1, 3] copy 1.023ms ref transpose 1.023ms opt transpose 1.008ms overhead 0
transpose shape [1, 1500, 8, 64] perm [0, 2, 3, 1] copy 0.118ms ref transpose 1.097ms opt transpose 0.452ms overhead 2.439674
transpose shape [1, 288, 8, 64] perm [0, 2, 1, 3] copy 0.022ms ref transpose 0.031ms opt transpose 0.029ms overhead 0.0066960254

Fixes #66.

For some tests the default printed output is not the most useful.
This is a method for simplifying layouts by combining consecutive dimensions.
This can optimize various algorithms that loop over dimensions by increasing the
iteration count and reducing the loop depth.
Add an alternative strategy for copying elements from a transposed tensor into a
contiguous buffer, using blocking, and enable it to be used via
`Tensor::copy_from`.

The existing naive copy implementation performs well except when the strides of
the source view lead to a significant rate of cache conflicts. This typically
happens when the last stride is a multiple of the cache line size, and
especially when it is a power of 2. To improve this, detect this case and
switch to an alternative copying procedure which uses blocking and tiling.

Using the `bench_transpose` benchmark in `src/ops/layout.rs, this avoids the
significant increase in overhead, vs a simple memory copy, when the source
stride is a power of 2.
 - Add a few more test cases

 - Avoid memory allocations in the benchmark. This ensures that we are
   benchmarking just the copy / transpose and not memory allocation.

 - Check the outputs once after the final iteration of each sub-benchmark. This
   makes it quicker to spot when an optimization produces wrong results.

 - Add reference transpose implementation comparison to benchmark

   The reference transpose roughly matches the non-blocked copy code path
   in rten-tensor. It thus shows the effect of the blocking copy optimizations.
@robertknight robertknight merged commit aa1e1ce into main Apr 5, 2024
2 checks passed
@robertknight robertknight deleted the transpose-opt-v2 branch April 5, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transpose operator is slow when source stride is a power of 2
1 participant