Completes SSE and adds some MMX intrinsics #247

gnzlbg · 2017-12-22T12:55:49Z

MMX:

_mm_cmpgt_pi{8,16,32}
_mm_unpack{hi,lo}_pi{8,16,32}

SSE (is now complete):

_mm_cvtp{i,u}{8,16}_ps
add test for _m_pmulhuw

alexcrichton · 2017-12-22T16:15:36Z

How come the instruction limit was increased? And how come some of the mmx intrinsics use the sse feature?

gnzlbg · 2017-12-22T17:00:47Z

The instruction limit was increased because for 32 bit targets LLVM generates many instructions. Why this is the case? I don’t know but the resulting code contains the intrinsic and works. I don’t know why SSE is required, without it llvm fails/failed to generate the code. This was already the case for the already implemented MMX intrinsics (might be that MMX wasn’t whitelisted before and that i never got to retry once it became whitelisted; there is a FIXME for this, but somebody has to investigate it).

…

On Fri 22. Dec 2017 at 17:15, Alex Crichton ***@***.***> wrote: How come the instruction limit was increased? And how come some of the mmx intrinsics use the sse feature? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#247 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3NphfiXmH1i9Mw7psxs5Nu5NTprBg9ks5tC9WpgaJpZM4RLAph> .

alexcrichton · 2017-12-22T20:24:37Z

Can you gist the code that LLVM generates? Are there any clues why it's generating so much code?

Could we try removing the +sse and see what happens?

gnzlbg · 2018-01-03T12:41:51Z

So the +sse is no longer required. It probably was required initially because mmx wasn't white-listed.

gnzlbg · 2018-01-03T13:05:03Z

These are the dissasemblies:

__mm_cvtpi8_ps_cvtpi2ps

---- x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpi8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x18,%esp 
	 3: call 28eeb <_ZN8coresimd3x864i6863sse13_mm_cvtpi8_ps17h8142cdf4acba5c00E+0xb> 
	 4: pop %eax 
	 5: add $0x169dbd,%eax 
	 6: pand -0x84f18(%eax),%xmm0 
	 7: 
	 8: packuswb %xmm0,%xmm0 
	 9: movq %xmm0,-0x8(%ebp) 
	10: pxor %xmm0,%xmm0 
	11: movq -0x8(%ebp),%mm0 
	12: movq %xmm0,-0x10(%ebp) 
	13: movq -0x10(%ebp),%mm1 
	14: movq %xmm0,-0x18(%ebp) 
	15: movq -0x18(%ebp),%mm3 
	16: pcmpgtb %mm0,%mm1 
	17: punpcklbw %mm1,%mm0 
	18: pcmpgtw %mm0,%mm3 
	19: movq %mm0,%mm2 
	20: punpckhwd %mm3,%mm2 
	21: punpcklwd %mm3,%mm0 
	22: cvtpi2ps %mm2,%xmm0 
	23: movlhps %xmm0,%xmm0 
	24: cvtpi2ps %mm0,%xmm0 
	25: add $0x18,%esp 
	26: pop %ebp 
	27: ret 
	28: xchg %ax,%ax

__mm_cvtpu8_ps_cvtpi2ps

---- x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpu8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x18,%esp 
	 3: call 28f8b <_ZN8coresimd3x864i6863sse13_mm_cvtpu8_ps17h02bd6822e8032715E+0xb> 
	 4: pop %eax 
	 5: add $0x169d1d,%eax 
	 6: pand -0x84f18(%eax),%xmm0 
	 7: 
	 8: packuswb %xmm0,%xmm0 
	 9: movq %xmm0,-0x8(%ebp) 
	10: pxor %xmm0,%xmm0 
	11: movq -0x8(%ebp),%mm0 
	12: movq %xmm0,-0x10(%ebp) 
	13: punpcklbw -0x10(%ebp),%mm0 
	14: movq %xmm0,-0x18(%ebp) 
	15: movq -0x18(%ebp),%mm1 
	16: movq %mm0,%mm2 
	17: pcmpgtw %mm0,%mm1 
	18: punpckhwd %mm1,%mm2 
	19: punpcklwd %mm1,%mm0 
	20: cvtpi2ps %mm2,%xmm0 
	21: movlhps %xmm0,%xmm0 
	22: cvtpi2ps %mm0,%xmm0 
	23: add $0x18,%esp 
	24: pop %ebp 
	25: ret 
	26: xchg %ax,%ax 
	27: xchg %ax,%ax 
	28: xchg %ax,%ax 
	29: xchg %ax,%ax 
	30: nop

alexcrichton · 2018-01-03T15:49:49Z

Those disassemblies indidcates bugs to me rather then a suggestion we should raise the instruction limit? In the first one the call to _mm_cvtpi8_ps isn't inlined and in the second the call to _mm_cvtpu8_ps isn't inlined. In both it seems like they're probably not corresponding to whatever Intel intended?

gnzlbg · 2018-01-03T16:13:11Z

In the first one the call to _mm_cvtpi8_ps isn't inlined and in the second the call to _mm_cvtpu8_ps isn't inlined.

I've added a check for failing inlining to see if these are the only places where this happens (blame says that this limit was 20 before, but was raised to 30 here: f3f5a9c#diff-4e26d84d43aa89efbb3b16f299e18ec7R274).

Inlining is also broken in the following places:

`i586-unknown-linux-gnu`

__mm256_alignr_epi8
__mm256_i32gather_pd
__mm256_i32gather_ps
__mm256_i64gather_pd
__mm256_i64gather_ps
__mm_i32gather_pd
__mm_i32gather_ps
__mm_i64gather_pd
__mm_i64gather_ps

`i686-unknown-linux-gnu`

__mm256_alignr_epi8
__mm256_i32gather_pd
__mm256_i32gather_ps
__mm256_i64gather_pd
__mm256_i64gather_ps
__mm_i32gather_pd
__mm_i32gather_ps
__mm_i64gather_pd
__mm_i64gather_ps
__mm_cmpgt_pi8
__mm_unpackhi_pi8
__mm_unpacklo_pi8
__m_maskmovq
__m_pavgb
__m_pmaxub
__m_pminub
__m_psadbw
__mm_avg_pu8
__mm_cvtpi8_ps
__mm_cvtpu8_ps
__mm_maskmove_si64
__mm_max_pu8
__mm_min_pu8
__mm_sad_pu8
__mm_abs_pi16
__mm_abs_pi8
__mm_alignr_pi8
__mm_hadd_pi16
__mm_hadds_pi16
__mm_hsub_pi16
__mm_hsubs_pi16
__mm_maddubs_pi16
__mm_mulhrs_pi16
__mm_shuffle_pi8
__mm_sign_pi16
__mm_sign_pi8

alexcrichton · 2018-01-03T16:15:30Z

Oh dear that sounds bad! Want some help in fixing those?

gnzlbg · 2018-01-03T16:21:50Z

Oh dear that sounds bad! Want some help in fixing those?

I could move these intrinsics to the x86_64 module and be done with it... but... the functions are marked as #[inline(always)] yet they are not getting inlined.

It is not obvious to me why this is the case, are we hitting rust-lang/rust#44367 ?

alexcrichton · 2018-01-03T22:30:43Z

I've fixed a number of assertions about call instructions in #261, but some of them were actually false positives in that they're just PIC related business (like the two you gisted above).

I think in general though on x86 the 64-bit vector types that aren't __m64 should be avoided. For LLVM to work they need to be the underlying x86_mmx type and only __m64 is defined as that. The other types in theory should be defined in such a way as x86_mmx, but for now they aren't.

MMX: - `_mm_cmpgt_pi{8,16,32}` - `_mm_unpack{hi,lo}_pi{8,16,32}` SSE (is now complete): - `_mm_cvtp{i,u}{8,16}_ps` - add test for `_m_pmulhuw`

gnzlbg · 2018-01-04T14:10:01Z

4 intrinsics are still failing because they require 23, 22 (2x), and 21 instructions.

alexcrichton · 2018-01-04T14:33:04Z

Are they mnx things that aren't taking __m64? If so thats probably the fix

gnzlbg · 2018-01-04T15:06:01Z

Are they mnx things that aren't taking __m64?

All mmx things are taking __m64.

The SSE test failing are (with intel intrinsics guide definitions):

_mm_cvtpi16_ps: __m128 _mm_cvtpi16_ps (__m64 a)
_mm_cvtpi8_ps: __m128 _mm_cvtpi8_ps (__m64 a)
_mm_cvtpu16_ps: __m128 _mm_cvtpu16_ps (__m64 a)
_mm_cvtpu8_ps: __m128 _mm_cvtpu8_ps (__m64 a)

Their code is:

/// Converts the lower 4 8-bit values of `a` into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpi8_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let b = mmx::_mm_cmpgt_pi8(b, a);
    let b = mmx::_mm_unpacklo_pi8(a, b);
    _mm_cvtpi16_ps(b)
}

/// Converts the lower 4 8-bit values of `a` into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpu8_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let b = mmx::_mm_unpacklo_pi8(a, b);
    _mm_cvtpi16_ps(b)
}

/// Converts a 64-bit vector of `i16`s into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpi16_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let b = mmx::_mm_cmpgt_pi16(b, a);
    let c = mmx::_mm_unpackhi_pi16(a, b);
    let r = i586::_mm_setzero_ps();
    let r = cvtpi2ps(r, c);
    let r = i586::_mm_movelh_ps(r, r);
    let c = mmx::_mm_unpacklo_pi16(a, b);
    cvtpi2ps(r, c)
}

/// Converts a 64-bit vector of `i16`s into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpu16_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let c = mmx::_mm_unpackhi_pi16(a, b);
    let r = i586::_mm_setzero_ps();
    let r = cvtpi2ps(r, c);
    let r = i586::_mm_movelh_ps(r, r);
    let c = mmx::_mm_unpacklo_pi16(a, b);
    cvtpi2ps(r, c)
}

It might just be that these functions are longer than 20 instructions on 32-bit. The disassembly doesn't show any call instruction:

---- x86::i686::sse::assert__mm_cvtpi16_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpi16_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: movq %mm0,%mm2 
	 6: xorps %xmm0,%xmm0 
	 7: movq -0x8(%ebp),%mm1 
	 8: pcmpgtw %mm0,%mm1 
	 9: punpckhwd %mm1,%mm2 
	10: punpcklwd %mm1,%mm0 
	11: cvtpi2ps %mm2,%xmm0 
	12: movlhps %xmm0,%xmm0 
	13: cvtpi2ps %mm0,%xmm0 
	14: add $0x8,%esp 
	15: pop %ebp 
	16: ret 
	17: xchg %ax,%ax 
	18: xchg %ax,%ax 
	19: xchg %ax,%ax 
	20: xchg %ax,%ax 
	21: xchg %ax,%ax 
	22: nop 
thread 'x86::i686::sse::assert__mm_cvtpi16_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 23 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
---- x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpi8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: xorps %xmm0,%xmm0 
	 6: movq -0x8(%ebp),%mm1 
	 7: movq %mm1,%mm2 
	 8: pcmpgtb %mm0,%mm2 
	 9: punpcklbw %mm2,%mm0 
	10: pcmpgtw %mm0,%mm1 
	11: movq %mm0,%mm2 
	12: punpckhwd %mm1,%mm2 
	13: punpcklwd %mm1,%mm0 
	14: cvtpi2ps %mm2,%xmm0 
	15: movlhps %xmm0,%xmm0 
	16: cvtpi2ps %mm0,%xmm0 
	17: add $0x8,%esp 
	18: pop %ebp 
	19: ret 
	20: xchg %ax,%ax 
thread 'x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 21 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
---- x86::i686::sse::assert__mm_cvtpu16_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpu16_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: movq %mm0,%mm2 
	 6: xorps %xmm0,%xmm0 
	 7: movq -0x8(%ebp),%mm1 
	 8: punpckhwd %mm1,%mm2 
	 9: punpcklwd %mm1,%mm0 
	10: cvtpi2ps %mm2,%xmm0 
	11: movlhps %xmm0,%xmm0 
	12: cvtpi2ps %mm0,%xmm0 
	13: add $0x8,%esp 
	14: pop %ebp 
	15: ret 
	16: xchg %ax,%ax 
	17: xchg %ax,%ax 
	18: xchg %ax,%ax 
	19: xchg %ax,%ax 
	20: xchg %ax,%ax 
	21: xchg %ax,%ax 
	22: xchg %ax,%ax 
thread 'x86::i686::sse::assert__mm_cvtpu16_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 23 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
---- x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpu8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: xorps %xmm0,%xmm0 
	 6: movq -0x8(%ebp),%mm1 
	 7: punpcklbw %mm1,%mm0 
	 8: pcmpgtw %mm0,%mm1 
	 9: movq %mm0,%mm2 
	10: punpckhwd %mm1,%mm2 
	11: punpcklwd %mm1,%mm0 
	12: cvtpi2ps %mm2,%xmm0 
	13: movlhps %xmm0,%xmm0 
	14: cvtpi2ps %mm0,%xmm0 
	15: add $0x8,%esp 
	16: pop %ebp 
	17: ret 
	18: xchg %ax,%ax 
	19: xchg %ax,%ax 
	20: xchg %ax,%ax 
	21: xchg %ax,%ax 
thread 'x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 22 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9

alexcrichton · 2018-01-04T15:15:23Z

bummer :(

Want to add exceptions for them in stdsimd-test/src/lib.rs?

MaloJaffre · 2018-01-04T15:23:28Z

I noticed suspicious nop and xchg %ax,%ax (effectively no-ops) at the end of those fonctions
(after ret) and I think they must not be part of the instructions count.

alexcrichton · 2018-01-04T16:15:43Z

@MaloJaffre yeah I think that's mostly just the output of objdump, but adding a speific exception for these also seems ok

gnzlbg requested a review from alexcrichton December 22, 2017 15:08

gnzlbg force-pushed the finish_sse branch 2 times, most recently from 1c011af to fb28c3d Compare January 3, 2018 10:06

gnzlbg mentioned this pull request Jan 3, 2018

Minor license issues #259

Open

gnzlbg force-pushed the finish_sse branch from fb28c3d to a9d10f7 Compare January 3, 2018 11:19

gnzlbg force-pushed the finish_sse branch from a9d10f7 to c5b3483 Compare January 3, 2018 12:50

gnzlbg force-pushed the finish_sse branch from a812b4a to 5640258 Compare January 3, 2018 16:14

gnzlbg force-pushed the finish_sse branch 2 times, most recently from db38067 to 875c926 Compare January 4, 2018 09:28

gnzlbg added 2 commits January 4, 2018 12:19

Completes SSE and adds some MMX intrinsics

b36c7e8

MMX: - `_mm_cmpgt_pi{8,16,32}` - `_mm_unpack{hi,lo}_pi{8,16,32}` SSE (is now complete): - `_mm_cvtp{i,u}{8,16}_ps` - add test for `_m_pmulhuw`

fmt and clippy

275071e

gnzlbg force-pushed the finish_sse branch from 875c926 to 275071e Compare January 4, 2018 11:19

add an exception for intrinsics using cvtpi2ps

4b4ea61

alexcrichton merged commit bf6d801 into rust-lang:master Jan 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completes SSE and adds some MMX intrinsics #247

Completes SSE and adds some MMX intrinsics #247

gnzlbg commented Dec 22, 2017

alexcrichton commented Dec 22, 2017

gnzlbg commented Dec 22, 2017 via email

alexcrichton commented Dec 22, 2017

gnzlbg commented Jan 3, 2018

gnzlbg commented Jan 3, 2018

alexcrichton commented Jan 3, 2018

gnzlbg commented Jan 3, 2018

alexcrichton commented Jan 3, 2018

gnzlbg commented Jan 3, 2018

alexcrichton commented Jan 3, 2018

gnzlbg commented Jan 4, 2018

alexcrichton commented Jan 4, 2018

gnzlbg commented Jan 4, 2018

alexcrichton commented Jan 4, 2018

MaloJaffre commented Jan 4, 2018 •

edited

Loading

alexcrichton commented Jan 4, 2018

Completes SSE and adds some MMX intrinsics #247

Completes SSE and adds some MMX intrinsics #247

Conversation

gnzlbg commented Dec 22, 2017

alexcrichton commented Dec 22, 2017

gnzlbg commented Dec 22, 2017 via email

alexcrichton commented Dec 22, 2017

gnzlbg commented Jan 3, 2018

gnzlbg commented Jan 3, 2018

__mm_cvtpi8_ps_cvtpi2ps

__mm_cvtpu8_ps_cvtpi2ps

alexcrichton commented Jan 3, 2018

gnzlbg commented Jan 3, 2018

i586-unknown-linux-gnu

i686-unknown-linux-gnu

alexcrichton commented Jan 3, 2018

gnzlbg commented Jan 3, 2018

alexcrichton commented Jan 3, 2018

gnzlbg commented Jan 4, 2018

alexcrichton commented Jan 4, 2018

gnzlbg commented Jan 4, 2018

alexcrichton commented Jan 4, 2018

MaloJaffre commented Jan 4, 2018 • edited Loading

alexcrichton commented Jan 4, 2018

`i586-unknown-linux-gnu`

`i686-unknown-linux-gnu`

MaloJaffre commented Jan 4, 2018 •

edited

Loading