Add blocking / tiling for transpose operations #78

robertknight · 2024-04-05T22:35:14Z

Before this PR the transpose operation had a naive implementation which used nested for loops to generate all valid indices and the source/dest offsets, then copied the value from the permuted/transposed source view to the destination.

This actually worked quite well except for cases where the strides after permuting/transposing the source were a multiple of the cache line size, especially powers of 2. In that case cache conflicts caused performance to be much worse. Transposing a 1024x1024 matrix was ~8x slower than copying the data (excluding the time taken for memory allocation), whereas for a 1023x1023 matrix it was only ~1.5x.

This PR fixes the issue by adding a simple implementation of blocking and tiling for transpose operations. Initially this is only enabled for cases where the last stride of the transposed tensor is a power of 2 greater than 32. Also the layout of the permuted tensor is simplified to minimize the number of dimensions before performing the copy. This maximizes the iteration count of each loop.

Some numbers from the bench_transpose benchmark comparing copy vs naive/reference transpose vs blocked transpose:

transpose shape [512, 512] perm [0, 1] copy 0.045ms ref transpose 0.042ms opt transpose 0.039ms overhead 0
transpose shape [128, 128] perm [1, 0] copy 0.002ms ref transpose 0.010ms opt transpose 0.009ms overhead 4.532522
transpose shape [256, 256] perm [1, 0] copy 0.010ms ref transpose 0.065ms opt transpose 0.041ms overhead 2.8640494
transpose shape [512, 512] perm [1, 0] copy 0.045ms ref transpose 0.393ms opt transpose 0.172ms overhead 1.6292478
transpose shape [1024, 1024] perm [1, 0] copy 0.210ms ref transpose 2.202ms opt transpose 0.703ms overhead 1.8715324
transpose shape [127, 127] perm [1, 0] copy 0.001ms ref transpose 0.006ms opt transpose 0.006ms overhead 3.6574275
transpose shape [255, 255] perm [1, 0] copy 0.007ms ref transpose 0.020ms opt transpose 0.019ms overhead 1.4803344
transpose shape [513, 513] perm [1, 0] copy 0.038ms ref transpose 0.087ms opt transpose 0.085ms overhead 0.4125464
transpose shape [1023, 1023] perm [1, 0] copy 0.230ms ref transpose 0.729ms opt transpose 0.646ms overhead 1.6150845
transpose shape [4, 1500, 8, 64] perm [0, 2, 1, 3] copy 1.012ms ref transpose 1.488ms opt transpose 1.455ms overhead 0.33376133
transpose shape [4, 8, 1500, 64] perm [0, 2, 1, 3] copy 1.023ms ref transpose 1.023ms opt transpose 1.008ms overhead 0
transpose shape [1, 1500, 8, 64] perm [0, 2, 3, 1] copy 0.118ms ref transpose 1.097ms opt transpose 0.452ms overhead 2.439674
transpose shape [1, 288, 8, 64] perm [0, 2, 1, 3] copy 0.022ms ref transpose 0.031ms opt transpose 0.029ms overhead 0.0066960254

Fixes #66.

For some tests the default printed output is not the most useful.

This is a method for simplifying layouts by combining consecutive dimensions. This can optimize various algorithms that loop over dimensions by increasing the iteration count and reducing the loop depth.

Add an alternative strategy for copying elements from a transposed tensor into a contiguous buffer, using blocking, and enable it to be used via `Tensor::copy_from`. The existing naive copy implementation performs well except when the strides of the source view lead to a significant rate of cache conflicts. This typically happens when the last stride is a multiple of the cache line size, and especially when it is a power of 2. To improve this, detect this case and switch to an alternative copying procedure which uses blocking and tiling. Using the `bench_transpose` benchmark in `src/ops/layout.rs, this avoids the significant increase in overhead, vs a simple memory copy, when the source stride is a power of 2.

- Add a few more test cases - Avoid memory allocations in the benchmark. This ensures that we are benchmarking just the copy / transpose and not memory allocation. - Check the outputs once after the final iteration of each sub-benchmark. This makes it quicker to spot when an optimization produces wrong results. - Add reference transpose implementation comparison to benchmark The reference transpose roughly matches the non-blocked copy code path in rten-tensor. It thus shows the effect of the blocking copy optimizations.

robertknight added 5 commits April 4, 2024 08:07

Make description argument to run_bench optional

727d8ca

For some tests the default printed output is not the most useful.

Implement TensorView::merge_axes

3df9553

This is a method for simplifying layouts by combining consecutive dimensions. This can optimize various algorithms that loop over dimensions by increasing the iteration count and reducing the loop depth.

Merge tensor axes before iterating during a transpose

1aed5a4

robertknight force-pushed the transpose-opt-v2 branch from 6885ff9 to e82cf19 Compare April 5, 2024 22:47

robertknight merged commit aa1e1ce into main Apr 5, 2024
2 checks passed

robertknight deleted the transpose-opt-v2 branch April 5, 2024 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blocking / tiling for transpose operations #78

Add blocking / tiling for transpose operations #78

robertknight commented Apr 5, 2024

Add blocking / tiling for transpose operations #78

Add blocking / tiling for transpose operations #78

Conversation

robertknight commented Apr 5, 2024