Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add int8 fallback kernel for older Arm CPUs that don't support UDOT #544

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Jan 23, 2025

This uses a sequence of instructions to emulate UDOT using widening multiplies and pairwise adds.

In the process the helpers for instantiating GemmExecutor with different kernels were refactored into a generic trait to make it easier to exercise each of the available kernels for a given set of input and output types.

With the changes on this branch I was able to successfully run the Whisper example on a Raspberry Pi Zero 2 (Quad-core Arm Cortex A53, 512MB RAM) using the base and tiny models, at ~0.9x real time for base and ~2.1x for tiny.

@robertknight robertknight changed the title Add fallback for older Arm CPUs that don't support UDOT Add int8 fallback kernel for older Arm CPUs that don't support UDOT Jan 23, 2025
@robertknight robertknight force-pushed the udot-fallback branch 2 times, most recently from bd4d33e to 8b7c11f Compare January 23, 2025 22:00
This uses a sequence of instructions to emulate UDOT using widening multiplies
and pairwise adds.

In the process the helpers for instantiating `GemmExecutor` with different
kernels were refactored into a generic trait to make it easier to exercise each
of the available kernels for a given set of input and output types.
@robertknight robertknight merged commit 6b2fbae into main Jan 23, 2025
2 checks passed
@robertknight robertknight deleted the udot-fallback branch January 23, 2025 22:23
@robertknight
Copy link
Owner Author

For future reference I note that other projects (eg. ORT, XNNPACK) use a completely different data layout for SDOT/UDOT kernels and kernels using only Neon instructions. For non-dotprod kernels they use something more like the outer product approach of the floating point kernels. Adding an emulation for the UDOT instruction was a convenient way to get something working acceptably well on these older CPUs, but suboptimal. I don't know yet how much of a difference it will make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant