Sub-optimal codegen for f32 tuples #32045

bsteinb · 2016-03-04T18:16:42Z

Similar to the behavior observed in #32031, tuples of f32 and f64 seem to be passed to functions in GPRs.

The f32 tuple takes an especially large hit, since the two f32 are passed inside a single 64 bit GPR and have to be excracted and compressed via shift and or instructions. Even with inlining turned on, this does not go away.

The f64 tuple is not as bad as the f32 tuple. Without inlining it does some moves to and from the SIMD registers and with inlining turned on, the tuple is kept in a SIMD register and the loop is vectorized and unrolled.

EDIT: Forgot to link to the code example on playpen.

The text was updated successfully, but these errors were encountered:

Aatch · 2016-03-06T00:58:14Z

The reason for this is due to us passing composite types less than a word in size as integers. In the vast majority of cases this results in much better code generation, so this is an unfortunate edge case where that isn't true.

The only real thing I can see us doing here is passing two-field structs as two-arguments, as long as both arguments fit into registers, something we should probably be doing anyway.

bsteinb · 2016-03-06T22:50:21Z

Would it make sense to introduce a special case for composite types made up only of floating point types? They could be passed inside SIMD registers.

dotdash · 2016-03-07T07:45:28Z

Yes, this should get passed as a 2xf32 vector IIRC. Same thing that the C
ABI does basically.
Am 06.03.2016 23:50 schrieb "Benedikt Steinbusch" <[email protected]

:

Would it make sense to introduce a special case for composite types made
up only of floating point types? They could be passed inside SIMD registers.

—
Reply to this email directly or view it on GitHub
#32045 (comment).

bsteinb · 2016-04-09T07:59:38Z

Hey, the situation here has changed slightly in the meantime. Without inlining the compiler (nightly on playpen) still crams the two f32 into a single 64 bit GPR. With inlining, it sticks each f32 in its own GPR, and for each loop iteration moves each of them to a SIMD register to perform two additions and then moves them back into the GPR, omitting the shift and or dance.

dsprenkels · 2016-06-18T17:08:29Z

I would like to take this issue, and see if I can fix it.

My proposal is to just make an exception for aggregate types containing (only) floating point types when casting to an integer type. I am however not sure what to do with heterogeneous aggregate types (containing both f32 and f64). I feel that it would be fine to leave them just as is, preventing the shift-or dance. Modern processors have more than enough xmm registers anyway.

dsprenkels · 2016-10-08T20:59:28Z

I have now looked at this issue for some time. I think "Rust" functions could be adjusted for the "C" ABI (code), which would fix the sub-optimal codegen.

This however has some side effects, for which I do not have the experience to fix.
It also breaks a lot of codegen test cases. In most of these cases I do not really know if the codegen has just become better or equivalent, or if a test case purposely breaks.

cc @eddyb

@stoklund

Use ty::layout for ABI computation instead of LLVM types. This is the first step in creating a backend-agnostic library for computing call ABI details from signatures. I wanted to open the PR *before* attempting to move `cabi_*` from trans to avoid rebase churn in #39999. **EDIT**: As I suspected, #39999 needs this PR to fully work (see #39999 (comment)). The first 3 commits add more APIs to `ty::layout` and replace non-ABI uses of `sizing_type_of`. These APIs are probably usable by other backends, and miri too (cc @stoklund @solson). The last commit rewrites `rustc_trans::cabi_*` to use `ty::layout` and new `rustc_trans::abi` APIs. Also, during the process, a couple trivial bugs were identified and fixed: * `msp430`, `nvptx`, `nvptx64`: type sizes *in bytes* were compared with `32` and `64` * `x86` (`fastcall`): `f64` was incorrectly not treated the same way as `f32` Although not urgent, this PR also uses the more general "homogenous aggregate" logic to fix #32045.

sanxiyn added the A-codegen Area: Code generation label Mar 7, 2016

eddyb mentioned this issue Apr 8, 2017

Use ty::layout for ABI computation instead of LLVM types. #40658

Merged

bors closed this as completed in #40658 Apr 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-optimal codegen for f32 tuples #32045

Sub-optimal codegen for f32 tuples #32045

bsteinb commented Mar 4, 2016

Aatch commented Mar 6, 2016

bsteinb commented Mar 6, 2016

dotdash commented Mar 7, 2016

bsteinb commented Apr 9, 2016

dsprenkels commented Jun 18, 2016 •

edited

Loading

dsprenkels commented Oct 8, 2016

Sub-optimal codegen for f32 tuples #32045

Sub-optimal codegen for f32 tuples #32045

Comments

bsteinb commented Mar 4, 2016

Aatch commented Mar 6, 2016

bsteinb commented Mar 6, 2016

dotdash commented Mar 7, 2016

bsteinb commented Apr 9, 2016

dsprenkels commented Jun 18, 2016 • edited Loading

dsprenkels commented Oct 8, 2016

dsprenkels commented Jun 18, 2016 •

edited

Loading