Add blocking / tiling for transpose operations #78
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before this PR the transpose operation had a naive implementation which used nested for loops to generate all valid indices and the source/dest offsets, then copied the value from the permuted/transposed source view to the destination.
This actually worked quite well except for cases where the strides after permuting/transposing the source were a multiple of the cache line size, especially powers of 2. In that case cache conflicts caused performance to be much worse. Transposing a 1024x1024 matrix was ~8x slower than copying the data (excluding the time taken for memory allocation), whereas for a 1023x1023 matrix it was only ~1.5x.
This PR fixes the issue by adding a simple implementation of blocking and tiling for transpose operations. Initially this is only enabled for cases where the last stride of the transposed tensor is a power of 2 greater than 32. Also the layout of the permuted tensor is simplified to minimize the number of dimensions before performing the copy. This maximizes the iteration count of each loop.
Some numbers from the
bench_transpose
benchmark comparing copy vs naive/reference transpose vs blocked transpose:Fixes #66.