Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing substract_bf16x32_genoa #160

Closed
ashvardanian opened this issue Sep 5, 2024 · 3 comments · Fixed by #161
Closed

Optimizing substract_bf16x32_genoa #160

ashvardanian opened this issue Sep 5, 2024 · 3 comments · Fixed by #161
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@ashvardanian
Copy link
Owner

Can this be reduced to 2x subtractions, 2x shuffles, & 1 blend?
image

@ashvardanian ashvardanian added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 5, 2024
ashvardanian added a commit that referenced this issue Sep 5, 2024
@ashvardanian ashvardanian linked a pull request Sep 8, 2024 that will close this issue
@ashvardanian
Copy link
Owner Author

@MarkReedZ, in case you will be looking into this, better to merge into the linked feature branch ;)

@MarkReedZ
Copy link
Contributor

Was this what you were thinking? Several instructions shorter, but only 0.5% faster. I like the _mm512_permutex2var_epi16, but your original unpacking was readable.

Godbolt: https://godbolt.org/z/aPYf55s81

    //  The following code is expanding a bf16 packed _m512i to two f32's for the subtraction then
    //  packing them back again.
    __m512i zero = _mm512_setzero_si512();
    __m512i idx_bot = _mm512_set_epi8(
        31, 30,  0,  0, 29, 28,  0,  0, 27, 26,  0,  0, 25, 24,  0,  0,
        23, 22,  0,  0, 21, 20,  0,  0, 19, 18,  0,  0, 17, 16,  0,  0,
        15, 14,  0,  0, 13, 12,  0,  0, 11, 10,  0,  0,  9,  8,  0,  0,
         7,  6,  0,  0,  5,  4,  0,  0,  3,  2,  0,  0,  1,  0,  0,  0
    );
    __m512i idx_top = _mm512_set_epi8(
        63, 62, 0, 0, 61, 60, 0, 0, 59, 58, 0, 0, 57, 56, 0, 0,
        55, 54, 0, 0, 53, 52, 0, 0, 51, 50, 0, 0, 49, 48, 0, 0,
        47, 46, 0, 0, 45, 44, 0, 0, 43, 42, 0, 0, 41, 40, 0, 0,
        39, 38, 0, 0, 37, 36, 0, 0, 35, 34, 0, 0, 33, 32, 0, 0
    );

    __m512i a_top = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_top, a_i16));
    __m512i a_bot = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_bot, a_i16));

    __m512i b_top = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_top, b_i16));
    __m512i b_bot = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_bot, b_i16));
                                
    __m512 d_top = _mm512_sub_ps( _mm512_castsi512_ps(a_top), _mm512_castsi512_ps(b_top) );
    __m512 d_bot = _mm512_sub_ps( _mm512_castsi512_ps(a_bot), _mm512_castsi512_ps(b_bot) );
                   
    __m512i indices2 = _mm512_set_epi16(
        31, 29, 27, 25, 23, 21, 19, 17,
        15, 13, 11, 9, 7, 5, 3, 1,
        63, 61, 59, 57, 55, 53, 51, 49,
        47, 45, 43, 41, 39, 37, 35, 33
    );
    return _mm512_permutex2var_epi16( _mm512_castps_si512(d_top), indices2, _mm512_castps_si512(d_bot) );

@ashvardanian
Copy link
Owner Author

I was thinking about something different, @MarkReedZ. I'll try today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants