Try improving complex products on Skylake with `_mm512_fmaddsub` #159

ashvardanian · 2024-09-04T18:15:47Z

Intel Skylake and many newer CPU generation with AVX-512 support - have _mm512_fmaddsub_* intrinsics, that perform a fused multiply-add, with different sign for elements at different positions. Current complex dot products perform { 2 FMA + XOR + PSHUFB } for every 32 pairs of scalars. It's not gonna result in huge performance gains, but using this intrinsic we can remove one XOR and use one less register.

This can affect:

simsimd_dot_f64c_skylake
simsimd_vdot_f64c_skylake
simsimd_dot_f32c_skylake
simsimd_vdot_f32c_skylake
simsimd_dot_bf16c_genoa
simsimd_vdot_bf16c_genoa
simsimd_dot_f16c_sapphire
simsimd_vdot_f16c_sapphire

The text was updated successfully, but these errors were encountered:

MarkReedZ · 2024-09-09T15:23:58Z

I made the change for skylake. As we don't have an f16 version of fmaddsub I skipped those.

https://github.com/MarkReedZ/SimSIMD/tree/fmaddsub
https://godbolt.org/z/T1fhxsx7K

The tests pass, but the dot_f32/64c's may have a change in behavior as the deltas are different. Need to review.

New:
dot_f64c_skylake_128d/min_time:10.000/threads:1        40.8 ns         40.7 ns    295875661 abs_delta=0.124342 bytes=50.2581G/s pairs=24.5401M/s relative_error=33.8808
vdot_f64c_skylake_128d/min_time:10.000/threads:1       46.0 ns         46.0 ns    318435604 abs_delta=18.9837a bytes=44.5111G/s pairs=21.7339M/s relative_error=-286.436a

Old:
dot_f64c_skylake_128d/min_time:10.000/threads:1        43.8 ns         43.8 ns    252979290 abs_delta=19.475a bytes=46.8086G/s pairs=22.8558M/s relative_error=336.66a
vdot_f64c_skylake_128d/min_time:10.000/threads:1       56.5 ns         56.5 ns    256321669 abs_delta=21.4469a bytes=36.2777G/s pairs=17.7137M/s relative_error=-185.992a

ashvardanian · 2024-09-09T16:01:44Z

@MarkReedZ, the first error looks huge. Any chance it contains a mistake?

ashvardanian · 2024-09-09T16:07:24Z

Any chance we need to negate the odd elements of the ab_real_vec after the main loop?

ashvardanian · 2024-09-09T16:10:10Z

For testing purposes, I’d also recommend setting the number of dimensions to a small value, like 8, to see errors more clearly 🤗

MarkReedZ · 2024-09-09T17:58:58Z

For my testing I have a test.c and plug in the new function vs serial. Claude in cursor successfully does this for me without typos which is 🤗. Haven't checked this yet though.

Any chance we need to negate the odd elements of the ab_real_vec after the main loop?

My reading of the fmaddsub is that it multiplies all entries then subtracts them in dst so we should have ar*br - ai*bi . I'll add some prints and see what is up later as this seems correct.

From
    ab_real_vec = _mm512_fmadd_ps(_mm512_castsi512_ps(_mm512_xor_si512(b_vec, sign_flip_vec)), a_vec, ab_real_vec);

To
    // ab_real += ar * br - ai * bi;                                                                              \
    // fmaddsub adds the odd entries and subtracts the even (imaginary)
    ab_real_vec = _mm512_fmaddsub_ps(a_vec, b_vec, ab_real_vec);

MarkReedZ · 2024-09-11T20:17:47Z

I don't think fmaddsub makes sense to use. its (a*b) plus or minus c.

Your comment makes sense as a rewrite of the original code to do the negation once at the end. We don't need to flip a sign bit within the loop if we just fma the entire vector then flip the sign bit at the end before accumulating.

I made this change for skylake, and can do the same for haswell if we think it is significant.

//   ar * br - ai * bi

a = { 5,1,  5,1 }
b = { 5,1,  5,1 }
fmaddsub:
loop 1:  { 5*5+0, 1*1-0 }  == { 25,1 }      // Wrong
loop 2:  { 5*5+25,  1*1 - 1 } == { 50, 0 }  // Wrong

Original fmadd with xor in loop: 
loop 1:  { 5*5, -1*1 }  == { 25,-1 }
loop 2:  { 5*5+25,  -1*1+ -1 } == { 50, -2 }

New fmadd with xor at the end 
loop 1:  { 5*5,  1*1 }  == { 25, 1 }
loop 2:  { 5*5+25,  1*1+ 1 } == { 50, 2 }
Xor:  { 50, -2 }

Xor at end

dot_f32c_skylake_1536d/min_time:10.000/threads:1         491 ns          491 ns     29059719 abs_delta=3.49868n bytes=25.0227G/s pairs=2.03635M/s relative_error=-582.366n

Original

dot_f32c_skylake_1536d/min_time:10.000/threads:1         519 ns          519 ns     22944179 abs_delta=4.07972n bytes=21.3485G/s pairs=1.73734M/s relative_error=-197.711n

ashvardanian · 2024-09-11T20:31:06Z

@MarkReedZ, 5% is also a win 😄

MarkReedZ · 2024-09-12T19:47:09Z

Looks like a 2-10% improvement across the updated complex products.

bf16c_genoa would see a 25% improvement, but the bf16 intrinsics multiply and pairwise accumulate into f32 so we can't move the xor out of the loop.

TODO: review neon, sve, and sapphire.

ashvardanian added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 4, 2024

ashvardanian mentioned this issue Sep 8, 2024

Harley Seal AVX-512 implementations #138

Draft

MarkReedZ mentioned this issue Sep 12, 2024

Improve: 2-10% faster complex dot products by moving the xor out of t… #181

Merged

ashvardanian closed this as completed in b6f1c41 Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try improving complex products on Skylake with `_mm512_fmaddsub` #159

Try improving complex products on Skylake with `_mm512_fmaddsub` #159

ashvardanian commented Sep 4, 2024

MarkReedZ commented Sep 9, 2024 •

edited

Loading

ashvardanian commented Sep 9, 2024

ashvardanian commented Sep 9, 2024

ashvardanian commented Sep 9, 2024

MarkReedZ commented Sep 9, 2024 •

edited

Loading

MarkReedZ commented Sep 11, 2024 •

edited

Loading

ashvardanian commented Sep 11, 2024

MarkReedZ commented Sep 12, 2024 •

edited

Loading

Try improving complex products on Skylake with _mm512_fmaddsub #159

Try improving complex products on Skylake with _mm512_fmaddsub #159

Comments

ashvardanian commented Sep 4, 2024

MarkReedZ commented Sep 9, 2024 • edited Loading

ashvardanian commented Sep 9, 2024

ashvardanian commented Sep 9, 2024

ashvardanian commented Sep 9, 2024

MarkReedZ commented Sep 9, 2024 • edited Loading

MarkReedZ commented Sep 11, 2024 • edited Loading

ashvardanian commented Sep 11, 2024

MarkReedZ commented Sep 12, 2024 • edited Loading

Try improving complex products on Skylake with `_mm512_fmaddsub` #159

Try improving complex products on Skylake with `_mm512_fmaddsub` #159

MarkReedZ commented Sep 9, 2024 •

edited

Loading

MarkReedZ commented Sep 9, 2024 •

edited

Loading

MarkReedZ commented Sep 11, 2024 •

edited

Loading

MarkReedZ commented Sep 12, 2024 •

edited

Loading