-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try improving complex products on Skylake with _mm512_fmaddsub
#159
Comments
I made the change for skylake. As we don't have an f16 version of fmaddsub I skipped those. https://github.com/MarkReedZ/SimSIMD/tree/fmaddsub The tests pass, but the dot_f32/64c's may have a change in behavior as the deltas are different. Need to review.
|
@MarkReedZ, the first error looks huge. Any chance it contains a mistake? |
Any chance we need to negate the odd elements of the |
For testing purposes, I’d also recommend setting the number of dimensions to a small value, like 8, to see errors more clearly 🤗 |
For my testing I have a test.c and plug in the new function vs serial. Claude in cursor successfully does this for me without typos which is 🤗. Haven't checked this yet though.
My reading of the fmaddsub is that it multiplies all entries then subtracts them in dst so we should have
|
I don't think fmaddsub makes sense to use. its (a*b) plus or minus c. Your comment makes sense as a rewrite of the original code to do the negation once at the end. We don't need to flip a sign bit within the loop if we just fma the entire vector then flip the sign bit at the end before accumulating. I made this change for skylake, and can do the same for haswell if we think it is significant.
Xor at end
Original
|
@MarkReedZ, 5% is also a win 😄 |
Looks like a 2-10% improvement across the updated complex products. bf16c_genoa would see a 25% improvement, but the bf16 intrinsics multiply and pairwise accumulate into f32 so we can't move the xor out of the loop. TODO: review neon, sve, and sapphire. |
Intel Skylake and many newer CPU generation with AVX-512 support - have
_mm512_fmaddsub_*
intrinsics, that perform a fused multiply-add, with different sign for elements at different positions. Current complex dot products perform { 2 FMA + XOR + PSHUFB } for every 32 pairs of scalars. It's not gonna result in huge performance gains, but using this intrinsic we can remove one XOR and use one less register.This can affect:
simsimd_dot_f64c_skylake
simsimd_vdot_f64c_skylake
simsimd_dot_f32c_skylake
simsimd_vdot_f32c_skylake
simsimd_dot_bf16c_genoa
simsimd_vdot_bf16c_genoa
simsimd_dot_f16c_sapphire
simsimd_vdot_f16c_sapphire
The text was updated successfully, but these errors were encountered: