-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64: Add lea-based lowering for iadd #5986
Conversation
Oh I should also mention that on the |
Subscribe to Label Action
This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me with the below addressed
;; Higher-priority cases than the previous two where a load can be sunk into | ||
;; the add instruction itself. Note that both operands are tested for | ||
;; sink-ability since addition is commutative | ||
(rule -4 (lower (has_type (fits_in_64 ty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says "higher-priority" but the actual priority given is less than the other cases. Something doesn't add up here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a negative priority though I think this is higher than the prior two?
(I can switch to giving all positive priority too)
Before I merge this one thing I'd like to explore is to try a strategy of "always use |
This commit replaces the previous computation of `Amode` with a different set of rules that are intended to achieve the same purpose but are structured differently. The motivation for this commit is going to become more relevant in the next commit where `lea` will be used for the `iadd` instruction, possibly, on x64. When doing so it caused a stack overflow in the test suite during the compilation phase of a wasm module, namely as part of the `amode_add` function. This function is recursively defined in terms of itself and recurses as deep as the deepest `iadd`-chain in a program. A particular test in our test suite has a 10k-long chain of `iadd` which ended up causing a stack overflow in debug mode. This stack overflow is caused because the `amode_add` helper in ISLE unconditionally peels all the `iadd` nodes away and looks at all of them, even if most end up in intermediate registers along the way. Given that structure I couldn't find a way to easily abort the recursion. The new `to_amode` helper is structured in a similar fashion but attempts to instead only recurse far enough to fold items into the final `Amode` instead of recursing through items which themselves don't end up in the `Amode`. Put another way previously the `amode_add` helper might emit `x64_add` instructions, but it no longer does that. This goal of this commit is to preserve all the original `Amode` optimizations, however. For some parts, though, it relies more on egraph optimizations to run since if an `iadd` is 10k deep it doesn't try to find a constant buried 9k levels inside there to fold into the `Amode`. The hope, though, is that with egraphs having run already it's shuffled constants to the right most of the time and already folded any possible together.
This commit adds a rule for the lowering of `iadd` to use `lea` for 32 and 64-bit addition. The theoretical benefit of `lea` over the `add` instruction is that the `lea` variant can emulate a 3-operand instruction which doesn't destructively modify on of its operands. Additionally the `lea` operation can fold in other components such as constant additions and shifts. In practice, however, if `lea` is unconditionally used instead of `iadd` it ends up losing 10% performance on a local `meshoptimizer` benchmark. My best guess as to what's going on here is that my CPU's dedicated units for address computation are all overloaded while the ALUs are basically idle in a memory-intensive loop. Previously when the ALU was used for `add` and the address units for stores/loads it in theory pipelined things better (most of this is me shooting in the dark). To prevent the performance loss here I've updated the lowering of `iadd` to conditionally sometimes use `lea` and sometimes use `add` depending on how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm` continue to use `add` (and its subsequent hypothetical extra `mov` necessary into the result). More complicated ones like `a + b + $imm` or `a + b << c + $imm` use `lea` as it can remove the need for extra instructions. Locally at least this fixes the performance loss relative to unconditionally using `lea`. One note is that this adds an `OperandSize` argument to the `MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit `lea` in addition to the preexisting 64-bit encoding.
Ok I've pushed up a different way of doing this, namely deciding after regalloc whether to use
with all others as "no difference" Given that I'm tempted to go with this strategy instead and leave it open to, in the future, add a form of regalloc constraint or heuristic that attempts to keep the src/dst here in the same register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Can't argue with those benchmark results for such a relatively simple change.
This commit adds a rule for the lowering of
iadd
to uselea
for 32and 64-bit addition. The theoretical benefit of
lea
over theadd
instruction is that the
lea
variant can emulate a 3-operandinstruction which doesn't destructively modify on of its operands.
Additionally the
lea
operation can fold in other components such asconstant additions and shifts.
In practice, however, if
lea
is unconditionally used instead ofiadd
it ends up losing 10% performance on a local
meshoptimizer
benchmark.My best guess as to what's going on here is that my CPU's dedicated
units for address computation are all overloaded while the ALUs are
basically idle in a memory-intensive loop. Previously when the ALU was
used for
add
and the address units for stores/loads it in theorypipelined things better (most of this is me shooting in the dark). To
prevent the performance loss here I've updated the lowering of
iadd
toconditionally sometimes use
lea
and sometimes useadd
depending onhow "complicated" the
Amode
is. Simple ones likea + b
ora + $imm
continue to use
add
(and its subsequent hypothetical extramov
necessary into the result). More complicated ones like
a + b + $imm
ora + b << c + $imm
uselea
as it can remove the need for extrainstructions. Locally at least this fixes the performance loss relative
to unconditionally using
lea
.One note is that this adds an
OperandSize
argument to theMInst::LoadEffectiveAddress
variant to add an encoding for 32-bitlea
in addition to the preexisting 64-bit encoding.Additionally this PR has a prior commit which is a "no functional changes intended" update to the
Amode
computation in the x64 backend to rely less on recursion and avoid blowing the stack at compile time for very-long-chains of theiadd
instruction.