Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Expand Qgemm UDOT kernel to 8x8 block (#8562)
Create a new M8 loop processing A[8x8] B[8x8] per iteration. Avoid saving registers on paths that are not needed. Adjusted M2 and M1 loop, using more registers to relax the loop carrying dependencies. Nearly 7% improvement observed on Surface Pro X 2 with model ssd_mobilenet_v2_300 About 4.5% improvement on resnet50 on Surface Pro X 2.
- Loading branch information