Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try improving complex products on Skylake with _mm512_fmaddsub #159

Closed
ashvardanian opened this issue Sep 4, 2024 · 8 comments
Closed
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@ashvardanian
Copy link
Owner

Intel Skylake and many newer CPU generation with AVX-512 support - have _mm512_fmaddsub_* intrinsics, that perform a fused multiply-add, with different sign for elements at different positions. Current complex dot products perform { 2 FMA + XOR + PSHUFB } for every 32 pairs of scalars. It's not gonna result in huge performance gains, but using this intrinsic we can remove one XOR and use one less register.

This can affect:

  • simsimd_dot_f64c_skylake
  • simsimd_vdot_f64c_skylake
  • simsimd_dot_f32c_skylake
  • simsimd_vdot_f32c_skylake
  • simsimd_dot_bf16c_genoa
  • simsimd_vdot_bf16c_genoa
  • simsimd_dot_f16c_sapphire
  • simsimd_vdot_f16c_sapphire
@ashvardanian ashvardanian added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 4, 2024
@MarkReedZ
Copy link
Contributor

MarkReedZ commented Sep 9, 2024

I made the change for skylake. As we don't have an f16 version of fmaddsub I skipped those.

https://github.com/MarkReedZ/SimSIMD/tree/fmaddsub
https://godbolt.org/z/T1fhxsx7K

The tests pass, but the dot_f32/64c's may have a change in behavior as the deltas are different. Need to review.

New:
dot_f64c_skylake_128d/min_time:10.000/threads:1        40.8 ns         40.7 ns    295875661 abs_delta=0.124342 bytes=50.2581G/s pairs=24.5401M/s relative_error=33.8808
vdot_f64c_skylake_128d/min_time:10.000/threads:1       46.0 ns         46.0 ns    318435604 abs_delta=18.9837a bytes=44.5111G/s pairs=21.7339M/s relative_error=-286.436a

Old:
dot_f64c_skylake_128d/min_time:10.000/threads:1        43.8 ns         43.8 ns    252979290 abs_delta=19.475a bytes=46.8086G/s pairs=22.8558M/s relative_error=336.66a
vdot_f64c_skylake_128d/min_time:10.000/threads:1       56.5 ns         56.5 ns    256321669 abs_delta=21.4469a bytes=36.2777G/s pairs=17.7137M/s relative_error=-185.992a

@ashvardanian
Copy link
Owner Author

@MarkReedZ, the first error looks huge. Any chance it contains a mistake?

@ashvardanian
Copy link
Owner Author

Any chance we need to negate the odd elements of the ab_real_vec after the main loop?

@ashvardanian
Copy link
Owner Author

For testing purposes, I’d also recommend setting the number of dimensions to a small value, like 8, to see errors more clearly 🤗

@MarkReedZ
Copy link
Contributor

MarkReedZ commented Sep 9, 2024

For my testing I have a test.c and plug in the new function vs serial. Claude in cursor successfully does this for me without typos which is 🤗. Haven't checked this yet though.

Any chance we need to negate the odd elements of the ab_real_vec after the main loop?

My reading of the fmaddsub is that it multiplies all entries then subtracts them in dst so we should have ar*br - ai*bi . I'll add some prints and see what is up later as this seems correct.

From
    ab_real_vec = _mm512_fmadd_ps(_mm512_castsi512_ps(_mm512_xor_si512(b_vec, sign_flip_vec)), a_vec, ab_real_vec);

To
    // ab_real += ar * br - ai * bi;                                                                              \
    // fmaddsub adds the odd entries and subtracts the even (imaginary)
    ab_real_vec = _mm512_fmaddsub_ps(a_vec, b_vec, ab_real_vec);

@MarkReedZ
Copy link
Contributor

MarkReedZ commented Sep 11, 2024

I don't think fmaddsub makes sense to use. its (a*b) plus or minus c.

Your comment makes sense as a rewrite of the original code to do the negation once at the end. We don't need to flip a sign bit within the loop if we just fma the entire vector then flip the sign bit at the end before accumulating.

I made this change for skylake, and can do the same for haswell if we think it is significant.

//   ar * br - ai * bi

a = { 5,1,  5,1 }
b = { 5,1,  5,1 }
fmaddsub:
loop 1:  { 5*5+0, 1*1-0 }  == { 25,1 }      // Wrong
loop 2:  { 5*5+25,  1*1 - 1 } == { 50, 0 }  // Wrong

Original fmadd with xor in loop: 
loop 1:  { 5*5, -1*1 }  == { 25,-1 }
loop 2:  { 5*5+25,  -1*1+ -1 } == { 50, -2 }

New fmadd with xor at the end 
loop 1:  { 5*5,  1*1 }  == { 25, 1 }
loop 2:  { 5*5+25,  1*1+ 1 } == { 50, 2 }
Xor:  { 50, -2 }

Xor at end

dot_f32c_skylake_1536d/min_time:10.000/threads:1         491 ns          491 ns     29059719 abs_delta=3.49868n bytes=25.0227G/s pairs=2.03635M/s relative_error=-582.366n

Original

dot_f32c_skylake_1536d/min_time:10.000/threads:1         519 ns          519 ns     22944179 abs_delta=4.07972n bytes=21.3485G/s pairs=1.73734M/s relative_error=-197.711n

@ashvardanian
Copy link
Owner Author

@MarkReedZ, 5% is also a win 😄

@MarkReedZ
Copy link
Contributor

MarkReedZ commented Sep 12, 2024

Looks like a 2-10% improvement across the updated complex products.

bf16c_genoa would see a 25% improvement, but the bf16 intrinsics multiply and pairwise accumulate into f32 so we can't move the xor out of the loop.

TODO: review neon, sve, and sapphire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants