Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completes SSE and adds some MMX intrinsics #247

Merged
merged 3 commits into from
Jan 4, 2018

Conversation

gnzlbg
Copy link
Contributor

@gnzlbg gnzlbg commented Dec 22, 2017

MMX:

  • _mm_cmpgt_pi{8,16,32}
  • _mm_unpack{hi,lo}_pi{8,16,32}

SSE (is now complete):

  • _mm_cvtp{i,u}{8,16}_ps
  • add test for _m_pmulhuw

@gnzlbg gnzlbg requested a review from alexcrichton December 22, 2017 15:08
@alexcrichton
Copy link
Member

How come the instruction limit was increased? And how come some of the mmx intrinsics use the sse feature?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Dec 22, 2017 via email

@alexcrichton
Copy link
Member

Can you gist the code that LLVM generates? Are there any clues why it's generating so much code?

Could we try removing the +sse and see what happens?

@gnzlbg gnzlbg force-pushed the finish_sse branch 2 times, most recently from 1c011af to fb28c3d Compare January 3, 2018 10:06
@gnzlbg gnzlbg mentioned this pull request Jan 3, 2018
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 3, 2018

So the +sse is no longer required. It probably was required initially because mmx wasn't white-listed.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 3, 2018

These are the dissasemblies:

__mm_cvtpi8_ps_cvtpi2ps

---- x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpi8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x18,%esp 
	 3: call 28eeb <_ZN8coresimd3x864i6863sse13_mm_cvtpi8_ps17h8142cdf4acba5c00E+0xb> 
	 4: pop %eax 
	 5: add $0x169dbd,%eax 
	 6: pand -0x84f18(%eax),%xmm0 
	 7: 
	 8: packuswb %xmm0,%xmm0 
	 9: movq %xmm0,-0x8(%ebp) 
	10: pxor %xmm0,%xmm0 
	11: movq -0x8(%ebp),%mm0 
	12: movq %xmm0,-0x10(%ebp) 
	13: movq -0x10(%ebp),%mm1 
	14: movq %xmm0,-0x18(%ebp) 
	15: movq -0x18(%ebp),%mm3 
	16: pcmpgtb %mm0,%mm1 
	17: punpcklbw %mm1,%mm0 
	18: pcmpgtw %mm0,%mm3 
	19: movq %mm0,%mm2 
	20: punpckhwd %mm3,%mm2 
	21: punpcklwd %mm3,%mm0 
	22: cvtpi2ps %mm2,%xmm0 
	23: movlhps %xmm0,%xmm0 
	24: cvtpi2ps %mm0,%xmm0 
	25: add $0x18,%esp 
	26: pop %ebp 
	27: ret 
	28: xchg %ax,%ax 

__mm_cvtpu8_ps_cvtpi2ps

---- x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpu8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x18,%esp 
	 3: call 28f8b <_ZN8coresimd3x864i6863sse13_mm_cvtpu8_ps17h02bd6822e8032715E+0xb> 
	 4: pop %eax 
	 5: add $0x169d1d,%eax 
	 6: pand -0x84f18(%eax),%xmm0 
	 7: 
	 8: packuswb %xmm0,%xmm0 
	 9: movq %xmm0,-0x8(%ebp) 
	10: pxor %xmm0,%xmm0 
	11: movq -0x8(%ebp),%mm0 
	12: movq %xmm0,-0x10(%ebp) 
	13: punpcklbw -0x10(%ebp),%mm0 
	14: movq %xmm0,-0x18(%ebp) 
	15: movq -0x18(%ebp),%mm1 
	16: movq %mm0,%mm2 
	17: pcmpgtw %mm0,%mm1 
	18: punpckhwd %mm1,%mm2 
	19: punpcklwd %mm1,%mm0 
	20: cvtpi2ps %mm2,%xmm0 
	21: movlhps %xmm0,%xmm0 
	22: cvtpi2ps %mm0,%xmm0 
	23: add $0x18,%esp 
	24: pop %ebp 
	25: ret 
	26: xchg %ax,%ax 
	27: xchg %ax,%ax 
	28: xchg %ax,%ax 
	29: xchg %ax,%ax 
	30: nop 

@alexcrichton
Copy link
Member

Those disassemblies indidcates bugs to me rather then a suggestion we should raise the instruction limit? In the first one the call to _mm_cvtpi8_ps isn't inlined and in the second the call to _mm_cvtpu8_ps isn't inlined. In both it seems like they're probably not corresponding to whatever Intel intended?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 3, 2018

In the first one the call to _mm_cvtpi8_ps isn't inlined and in the second the call to _mm_cvtpu8_ps isn't inlined.

I've added a check for failing inlining to see if these are the only places where this happens (blame says that this limit was 20 before, but was raised to 30 here: f3f5a9c#diff-4e26d84d43aa89efbb3b16f299e18ec7R274).

Inlining is also broken in the following places:

i586-unknown-linux-gnu

  • __mm256_alignr_epi8
  • __mm256_i32gather_pd
  • __mm256_i32gather_ps
  • __mm256_i64gather_pd
  • __mm256_i64gather_ps
  • __mm_i32gather_pd
  • __mm_i32gather_ps
  • __mm_i64gather_pd
  • __mm_i64gather_ps

i686-unknown-linux-gnu

  • __mm256_alignr_epi8
  • __mm256_i32gather_pd
  • __mm256_i32gather_ps
  • __mm256_i64gather_pd
  • __mm256_i64gather_ps
  • __mm_i32gather_pd
  • __mm_i32gather_ps
  • __mm_i64gather_pd
  • __mm_i64gather_ps
  • __mm_cmpgt_pi8
  • __mm_unpackhi_pi8
  • __mm_unpacklo_pi8
  • __m_maskmovq
  • __m_pavgb
  • __m_pmaxub
  • __m_pminub
  • __m_psadbw
  • __mm_avg_pu8
  • __mm_cvtpi8_ps
  • __mm_cvtpu8_ps
  • __mm_maskmove_si64
  • __mm_max_pu8
  • __mm_min_pu8
  • __mm_sad_pu8
  • __mm_abs_pi16
  • __mm_abs_pi8
  • __mm_alignr_pi8
  • __mm_hadd_pi16
  • __mm_hadds_pi16
  • __mm_hsub_pi16
  • __mm_hsubs_pi16
  • __mm_maddubs_pi16
  • __mm_mulhrs_pi16
  • __mm_shuffle_pi8
  • __mm_sign_pi16
  • __mm_sign_pi8

@alexcrichton
Copy link
Member

Oh dear that sounds bad! Want some help in fixing those?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 3, 2018

Oh dear that sounds bad! Want some help in fixing those?

I could move these intrinsics to the x86_64 module and be done with it... but... the functions are marked as #[inline(always)] yet they are not getting inlined.

It is not obvious to me why this is the case, are we hitting rust-lang/rust#44367 ?

@alexcrichton
Copy link
Member

I've fixed a number of assertions about call instructions in #261, but some of them were actually false positives in that they're just PIC related business (like the two you gisted above).

I think in general though on x86 the 64-bit vector types that aren't __m64 should be avoided. For LLVM to work they need to be the underlying x86_mmx type and only __m64 is defined as that. The other types in theory should be defined in such a way as x86_mmx, but for now they aren't.

@gnzlbg gnzlbg force-pushed the finish_sse branch 2 times, most recently from db38067 to 875c926 Compare January 4, 2018 09:28
gnzlbg added 2 commits January 4, 2018 12:19
MMX:

- `_mm_cmpgt_pi{8,16,32}`
- `_mm_unpack{hi,lo}_pi{8,16,32}`

SSE (is now complete):

- `_mm_cvtp{i,u}{8,16}_ps`
- add test for `_m_pmulhuw`
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 4, 2018

4 intrinsics are still failing because they require 23, 22 (2x), and 21 instructions.

@alexcrichton
Copy link
Member

Are they mnx things that aren't taking __m64? If so thats probably the fix

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jan 4, 2018

Are they mnx things that aren't taking __m64?

All mmx things are taking __m64.

The SSE test failing are (with intel intrinsics guide definitions):

  • _mm_cvtpi16_ps: __m128 _mm_cvtpi16_ps (__m64 a)
  • _mm_cvtpi8_ps: __m128 _mm_cvtpi8_ps (__m64 a)
  • _mm_cvtpu16_ps: __m128 _mm_cvtpu16_ps (__m64 a)
  • _mm_cvtpu8_ps: __m128 _mm_cvtpu8_ps (__m64 a)

Their code is:

/// Converts the lower 4 8-bit values of `a` into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpi8_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let b = mmx::_mm_cmpgt_pi8(b, a);
    let b = mmx::_mm_unpacklo_pi8(a, b);
    _mm_cvtpi16_ps(b)
}

/// Converts the lower 4 8-bit values of `a` into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpu8_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let b = mmx::_mm_unpacklo_pi8(a, b);
    _mm_cvtpi16_ps(b)
}

/// Converts a 64-bit vector of `i16`s into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpi16_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let b = mmx::_mm_cmpgt_pi16(b, a);
    let c = mmx::_mm_unpackhi_pi16(a, b);
    let r = i586::_mm_setzero_ps();
    let r = cvtpi2ps(r, c);
    let r = i586::_mm_movelh_ps(r, r);
    let c = mmx::_mm_unpacklo_pi16(a, b);
    cvtpi2ps(r, c)
}

/// Converts a 64-bit vector of `i16`s into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpu16_ps(a: __m64) -> f32x4 {
    let b = mmx::_mm_setzero_si64();
    let c = mmx::_mm_unpackhi_pi16(a, b);
    let r = i586::_mm_setzero_ps();
    let r = cvtpi2ps(r, c);
    let r = i586::_mm_movelh_ps(r, r);
    let c = mmx::_mm_unpacklo_pi16(a, b);
    cvtpi2ps(r, c)
}

It might just be that these functions are longer than 20 instructions on 32-bit. The disassembly doesn't show any call instruction:

---- x86::i686::sse::assert__mm_cvtpi16_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpi16_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: movq %mm0,%mm2 
	 6: xorps %xmm0,%xmm0 
	 7: movq -0x8(%ebp),%mm1 
	 8: pcmpgtw %mm0,%mm1 
	 9: punpckhwd %mm1,%mm2 
	10: punpcklwd %mm1,%mm0 
	11: cvtpi2ps %mm2,%xmm0 
	12: movlhps %xmm0,%xmm0 
	13: cvtpi2ps %mm0,%xmm0 
	14: add $0x8,%esp 
	15: pop %ebp 
	16: ret 
	17: xchg %ax,%ax 
	18: xchg %ax,%ax 
	19: xchg %ax,%ax 
	20: xchg %ax,%ax 
	21: xchg %ax,%ax 
	22: nop 
thread 'x86::i686::sse::assert__mm_cvtpi16_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 23 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
---- x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpi8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: xorps %xmm0,%xmm0 
	 6: movq -0x8(%ebp),%mm1 
	 7: movq %mm1,%mm2 
	 8: pcmpgtb %mm0,%mm2 
	 9: punpcklbw %mm2,%mm0 
	10: pcmpgtw %mm0,%mm1 
	11: movq %mm0,%mm2 
	12: punpckhwd %mm1,%mm2 
	13: punpcklwd %mm1,%mm0 
	14: cvtpi2ps %mm2,%xmm0 
	15: movlhps %xmm0,%xmm0 
	16: cvtpi2ps %mm0,%xmm0 
	17: add $0x8,%esp 
	18: pop %ebp 
	19: ret 
	20: xchg %ax,%ax 
thread 'x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 21 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
---- x86::i686::sse::assert__mm_cvtpu16_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpu16_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: movq %mm0,%mm2 
	 6: xorps %xmm0,%xmm0 
	 7: movq -0x8(%ebp),%mm1 
	 8: punpckhwd %mm1,%mm2 
	 9: punpcklwd %mm1,%mm0 
	10: cvtpi2ps %mm2,%xmm0 
	11: movlhps %xmm0,%xmm0 
	12: cvtpi2ps %mm0,%xmm0 
	13: add $0x8,%esp 
	14: pop %ebp 
	15: ret 
	16: xchg %ax,%ax 
	17: xchg %ax,%ax 
	18: xchg %ax,%ax 
	19: xchg %ax,%ax 
	20: xchg %ax,%ax 
	21: xchg %ax,%ax 
	22: xchg %ax,%ax 
thread 'x86::i686::sse::assert__mm_cvtpu16_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 23 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
---- x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps stdout ----
	disassembly for coresimd::x86::i686::sse::_mm_cvtpu8_ps: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: sub $0x8,%esp 
	 3: movl $0x0,-0x4(%ebp) 
	 4: movl $0x0,-0x8(%ebp) 
	 5: xorps %xmm0,%xmm0 
	 6: movq -0x8(%ebp),%mm1 
	 7: punpcklbw %mm1,%mm0 
	 8: pcmpgtw %mm0,%mm1 
	 9: movq %mm0,%mm2 
	10: punpckhwd %mm1,%mm2 
	11: punpcklwd %mm1,%mm0 
	12: cvtpi2ps %mm2,%xmm0 
	13: movlhps %xmm0,%xmm0 
	14: cvtpi2ps %mm0,%xmm0 
	15: add $0x8,%esp 
	16: pop %ebp 
	17: ret 
	18: xchg %ax,%ax 
	19: xchg %ax,%ax 
	20: xchg %ax,%ax 
	21: xchg %ax,%ax 
thread 'x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 22 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9

@alexcrichton
Copy link
Member

bummer :(

Want to add exceptions for them in stdsimd-test/src/lib.rs?

@MaloJaffre
Copy link
Contributor

MaloJaffre commented Jan 4, 2018

I noticed suspicious nop and xchg %ax,%ax (effectively no-ops) at the end of those fonctions
(after ret) and I think they must not be part of the instructions count.

@alexcrichton alexcrichton merged commit bf6d801 into rust-lang:master Jan 4, 2018
@alexcrichton
Copy link
Member

@MaloJaffre yeah I think that's mostly just the output of objdump, but adding a speific exception for these also seems ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants