-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completes SSE and adds some MMX intrinsics #247
Conversation
How come the instruction limit was increased? And how come some of the mmx intrinsics use the sse feature? |
The instruction limit was increased because for 32 bit targets LLVM
generates many instructions. Why this is the case? I don’t know but the
resulting code contains the intrinsic and works.
I don’t know why SSE is required, without it llvm fails/failed to generate
the code. This was already the case for the already implemented MMX
intrinsics (might be that MMX wasn’t whitelisted before and that i never
got to retry once it became whitelisted; there is a FIXME for this, but
somebody has to investigate it).
…On Fri 22. Dec 2017 at 17:15, Alex Crichton ***@***.***> wrote:
How come the instruction limit was increased? And how come some of the mmx
intrinsics use the sse feature?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#247 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3NphfiXmH1i9Mw7psxs5Nu5NTprBg9ks5tC9WpgaJpZM4RLAph>
.
|
Can you gist the code that LLVM generates? Are there any clues why it's generating so much code? Could we try removing the |
1c011af
to
fb28c3d
Compare
So the |
These are the dissasemblies: __mm_cvtpi8_ps_cvtpi2ps---- x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps stdout ----
disassembly for coresimd::x86::i686::sse::_mm_cvtpi8_ps:
0: push %ebp
1: mov %esp,%ebp
2: sub $0x18,%esp
3: call 28eeb <_ZN8coresimd3x864i6863sse13_mm_cvtpi8_ps17h8142cdf4acba5c00E+0xb>
4: pop %eax
5: add $0x169dbd,%eax
6: pand -0x84f18(%eax),%xmm0
7:
8: packuswb %xmm0,%xmm0
9: movq %xmm0,-0x8(%ebp)
10: pxor %xmm0,%xmm0
11: movq -0x8(%ebp),%mm0
12: movq %xmm0,-0x10(%ebp)
13: movq -0x10(%ebp),%mm1
14: movq %xmm0,-0x18(%ebp)
15: movq -0x18(%ebp),%mm3
16: pcmpgtb %mm0,%mm1
17: punpcklbw %mm1,%mm0
18: pcmpgtw %mm0,%mm3
19: movq %mm0,%mm2
20: punpckhwd %mm3,%mm2
21: punpcklwd %mm3,%mm0
22: cvtpi2ps %mm2,%xmm0
23: movlhps %xmm0,%xmm0
24: cvtpi2ps %mm0,%xmm0
25: add $0x18,%esp
26: pop %ebp
27: ret
28: xchg %ax,%ax __mm_cvtpu8_ps_cvtpi2ps---- x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps stdout ----
disassembly for coresimd::x86::i686::sse::_mm_cvtpu8_ps:
0: push %ebp
1: mov %esp,%ebp
2: sub $0x18,%esp
3: call 28f8b <_ZN8coresimd3x864i6863sse13_mm_cvtpu8_ps17h02bd6822e8032715E+0xb>
4: pop %eax
5: add $0x169d1d,%eax
6: pand -0x84f18(%eax),%xmm0
7:
8: packuswb %xmm0,%xmm0
9: movq %xmm0,-0x8(%ebp)
10: pxor %xmm0,%xmm0
11: movq -0x8(%ebp),%mm0
12: movq %xmm0,-0x10(%ebp)
13: punpcklbw -0x10(%ebp),%mm0
14: movq %xmm0,-0x18(%ebp)
15: movq -0x18(%ebp),%mm1
16: movq %mm0,%mm2
17: pcmpgtw %mm0,%mm1
18: punpckhwd %mm1,%mm2
19: punpcklwd %mm1,%mm0
20: cvtpi2ps %mm2,%xmm0
21: movlhps %xmm0,%xmm0
22: cvtpi2ps %mm0,%xmm0
23: add $0x18,%esp
24: pop %ebp
25: ret
26: xchg %ax,%ax
27: xchg %ax,%ax
28: xchg %ax,%ax
29: xchg %ax,%ax
30: nop |
Those disassemblies indidcates bugs to me rather then a suggestion we should raise the instruction limit? In the first one the call to |
I've added a check for failing inlining to see if these are the only places where this happens (blame says that this limit was 20 before, but was raised to 30 here: f3f5a9c#diff-4e26d84d43aa89efbb3b16f299e18ec7R274). Inlining is also broken in the following places:
|
Oh dear that sounds bad! Want some help in fixing those? |
I could move these intrinsics to the It is not obvious to me why this is the case, are we hitting rust-lang/rust#44367 ? |
I've fixed a number of assertions about I think in general though on x86 the 64-bit vector types that aren't |
db38067
to
875c926
Compare
MMX: - `_mm_cmpgt_pi{8,16,32}` - `_mm_unpack{hi,lo}_pi{8,16,32}` SSE (is now complete): - `_mm_cvtp{i,u}{8,16}_ps` - add test for `_m_pmulhuw`
4 intrinsics are still failing because they require 23, 22 (2x), and 21 instructions. |
Are they mnx things that aren't taking __m64? If so thats probably the fix |
All The SSE test failing are (with intel intrinsics guide definitions):
Their code is: /// Converts the lower 4 8-bit values of `a` into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpi8_ps(a: __m64) -> f32x4 {
let b = mmx::_mm_setzero_si64();
let b = mmx::_mm_cmpgt_pi8(b, a);
let b = mmx::_mm_unpacklo_pi8(a, b);
_mm_cvtpi16_ps(b)
}
/// Converts the lower 4 8-bit values of `a` into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpu8_ps(a: __m64) -> f32x4 {
let b = mmx::_mm_setzero_si64();
let b = mmx::_mm_unpacklo_pi8(a, b);
_mm_cvtpi16_ps(b)
}
/// Converts a 64-bit vector of `i16`s into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpi16_ps(a: __m64) -> f32x4 {
let b = mmx::_mm_setzero_si64();
let b = mmx::_mm_cmpgt_pi16(b, a);
let c = mmx::_mm_unpackhi_pi16(a, b);
let r = i586::_mm_setzero_ps();
let r = cvtpi2ps(r, c);
let r = i586::_mm_movelh_ps(r, r);
let c = mmx::_mm_unpacklo_pi16(a, b);
cvtpi2ps(r, c)
}
/// Converts a 64-bit vector of `i16`s into a 128-bit vector of 4 `f32`s.
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(cvtpi2ps))]
pub unsafe fn _mm_cvtpu16_ps(a: __m64) -> f32x4 {
let b = mmx::_mm_setzero_si64();
let c = mmx::_mm_unpackhi_pi16(a, b);
let r = i586::_mm_setzero_ps();
let r = cvtpi2ps(r, c);
let r = i586::_mm_movelh_ps(r, r);
let c = mmx::_mm_unpacklo_pi16(a, b);
cvtpi2ps(r, c)
} It might just be that these functions are longer than 20 instructions on 32-bit. The disassembly doesn't show any ---- x86::i686::sse::assert__mm_cvtpi16_ps_cvtpi2ps stdout ----
disassembly for coresimd::x86::i686::sse::_mm_cvtpi16_ps:
0: push %ebp
1: mov %esp,%ebp
2: sub $0x8,%esp
3: movl $0x0,-0x4(%ebp)
4: movl $0x0,-0x8(%ebp)
5: movq %mm0,%mm2
6: xorps %xmm0,%xmm0
7: movq -0x8(%ebp),%mm1
8: pcmpgtw %mm0,%mm1
9: punpckhwd %mm1,%mm2
10: punpcklwd %mm1,%mm0
11: cvtpi2ps %mm2,%xmm0
12: movlhps %xmm0,%xmm0
13: cvtpi2ps %mm0,%xmm0
14: add $0x8,%esp
15: pop %ebp
16: ret
17: xchg %ax,%ax
18: xchg %ax,%ax
19: xchg %ax,%ax
20: xchg %ax,%ax
21: xchg %ax,%ax
22: nop
thread 'x86::i686::sse::assert__mm_cvtpi16_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 23 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
---- x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps stdout ----
disassembly for coresimd::x86::i686::sse::_mm_cvtpi8_ps:
0: push %ebp
1: mov %esp,%ebp
2: sub $0x8,%esp
3: movl $0x0,-0x4(%ebp)
4: movl $0x0,-0x8(%ebp)
5: xorps %xmm0,%xmm0
6: movq -0x8(%ebp),%mm1
7: movq %mm1,%mm2
8: pcmpgtb %mm0,%mm2
9: punpcklbw %mm2,%mm0
10: pcmpgtw %mm0,%mm1
11: movq %mm0,%mm2
12: punpckhwd %mm1,%mm2
13: punpcklwd %mm1,%mm0
14: cvtpi2ps %mm2,%xmm0
15: movlhps %xmm0,%xmm0
16: cvtpi2ps %mm0,%xmm0
17: add $0x8,%esp
18: pop %ebp
19: ret
20: xchg %ax,%ax
thread 'x86::i686::sse::assert__mm_cvtpi8_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 21 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
---- x86::i686::sse::assert__mm_cvtpu16_ps_cvtpi2ps stdout ----
disassembly for coresimd::x86::i686::sse::_mm_cvtpu16_ps:
0: push %ebp
1: mov %esp,%ebp
2: sub $0x8,%esp
3: movl $0x0,-0x4(%ebp)
4: movl $0x0,-0x8(%ebp)
5: movq %mm0,%mm2
6: xorps %xmm0,%xmm0
7: movq -0x8(%ebp),%mm1
8: punpckhwd %mm1,%mm2
9: punpcklwd %mm1,%mm0
10: cvtpi2ps %mm2,%xmm0
11: movlhps %xmm0,%xmm0
12: cvtpi2ps %mm0,%xmm0
13: add $0x8,%esp
14: pop %ebp
15: ret
16: xchg %ax,%ax
17: xchg %ax,%ax
18: xchg %ax,%ax
19: xchg %ax,%ax
20: xchg %ax,%ax
21: xchg %ax,%ax
22: xchg %ax,%ax
thread 'x86::i686::sse::assert__mm_cvtpu16_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 23 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9
---- x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps stdout ----
disassembly for coresimd::x86::i686::sse::_mm_cvtpu8_ps:
0: push %ebp
1: mov %esp,%ebp
2: sub $0x8,%esp
3: movl $0x0,-0x4(%ebp)
4: movl $0x0,-0x8(%ebp)
5: xorps %xmm0,%xmm0
6: movq -0x8(%ebp),%mm1
7: punpcklbw %mm1,%mm0
8: pcmpgtw %mm0,%mm1
9: movq %mm0,%mm2
10: punpckhwd %mm1,%mm2
11: punpcklwd %mm1,%mm0
12: cvtpi2ps %mm2,%xmm0
13: movlhps %xmm0,%xmm0
14: cvtpi2ps %mm0,%xmm0
15: add $0x8,%esp
16: pop %ebp
17: ret
18: xchg %ax,%ax
19: xchg %ax,%ax
20: xchg %ax,%ax
21: xchg %ax,%ax
thread 'x86::i686::sse::assert__mm_cvtpu8_ps_cvtpi2ps' panicked at 'instruction found, but the disassembly contains too many instructions: #instructions = 22 >= 20 (limit)', stdsimd-test/src/lib.rs:367:9 |
bummer :( Want to add exceptions for them in stdsimd-test/src/lib.rs? |
I noticed suspicious |
@MaloJaffre yeah I think that's mostly just the output of |
MMX:
_mm_cmpgt_pi{8,16,32}
_mm_unpack{hi,lo}_pi{8,16,32}
SSE (is now complete):
_mm_cvtp{i,u}{8,16}_ps
_m_pmulhuw