x64: Add lea-based lowering for iadd #5986

alexcrichton · 2023-03-10T21:06:20Z

This commit adds a rule for the lowering of iadd to use lea for 32
and 64-bit addition. The theoretical benefit of lea over the add
instruction is that the lea variant can emulate a 3-operand
instruction which doesn't destructively modify on of its operands.
Additionally the lea operation can fold in other components such as
constant additions and shifts.

In practice, however, if lea is unconditionally used instead of iadd
it ends up losing 10% performance on a local meshoptimizer benchmark.
My best guess as to what's going on here is that my CPU's dedicated
units for address computation are all overloaded while the ALUs are
basically idle in a memory-intensive loop. Previously when the ALU was
used for add and the address units for stores/loads it in theory
pipelined things better (most of this is me shooting in the dark). To
prevent the performance loss here I've updated the lowering of iadd to
conditionally sometimes use lea and sometimes use add depending on
how "complicated" the Amode is. Simple ones like a + b or a + $imm
continue to use add (and its subsequent hypothetical extra mov
necessary into the result). More complicated ones like a + b + $imm or
a + b << c + $imm use lea as it can remove the need for extra
instructions. Locally at least this fixes the performance loss relative
to unconditionally using lea.

One note is that this adds an OperandSize argument to the
MInst::LoadEffectiveAddress variant to add an encoding for 32-bit
lea in addition to the preexisting 64-bit encoding.

Additionally this PR has a prior commit which is a "no functional changes intended" update to the Amode computation in the x64 backend to rely less on recursion and avoid blowing the stack at compile time for very-long-chains of the iadd instruction.

alexcrichton · 2023-03-10T21:23:48Z

Oh I should also mention that on the meshoptimizer program this improves performance by 5% for me on one of the benchmarks.

github-actions · 2023-03-10T21:44:59Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

fitzgen

r=me with the below addressed

cranelift/codegen/src/isa/x64/lower.isle

fitzgen · 2023-03-10T21:50:35Z

cranelift/codegen/src/isa/x64/lower.isle

+;; Higher-priority cases than the previous two where a load can be sunk into
+;; the add instruction itself. Note that both operands are tested for
+;; sink-ability since addition is commutative
 (rule -4 (lower (has_type (fits_in_64 ty)


This says "higher-priority" but the actual priority given is less than the other cases. Something doesn't add up here.

With a negative priority though I think this is higher than the prior two?

(I can switch to giving all positive priority too)

alexcrichton · 2023-03-13T14:21:15Z

Before I merge this one thing I'd like to explore is to try a strategy of "always use lea" but then during emission switch to using add if the src/dst registers are the same. Something like an AddOrLea instruction which, depending on the results of regalloc, emits one of the two instructions. I'm interested to compare that to this performance and it more easily leaves this open in the future to a tweak in regalloc's constraints where something could be specified like "please try to make these two the same"

This commit replaces the previous computation of `Amode` with a different set of rules that are intended to achieve the same purpose but are structured differently. The motivation for this commit is going to become more relevant in the next commit where `lea` will be used for the `iadd` instruction, possibly, on x64. When doing so it caused a stack overflow in the test suite during the compilation phase of a wasm module, namely as part of the `amode_add` function. This function is recursively defined in terms of itself and recurses as deep as the deepest `iadd`-chain in a program. A particular test in our test suite has a 10k-long chain of `iadd` which ended up causing a stack overflow in debug mode. This stack overflow is caused because the `amode_add` helper in ISLE unconditionally peels all the `iadd` nodes away and looks at all of them, even if most end up in intermediate registers along the way. Given that structure I couldn't find a way to easily abort the recursion. The new `to_amode` helper is structured in a similar fashion but attempts to instead only recurse far enough to fold items into the final `Amode` instead of recursing through items which themselves don't end up in the `Amode`. Put another way previously the `amode_add` helper might emit `x64_add` instructions, but it no longer does that. This goal of this commit is to preserve all the original `Amode` optimizations, however. For some parts, though, it relies more on egraph optimizations to run since if an `iadd` is 10k deep it doesn't try to find a constant buried 9k levels inside there to fold into the `Amode`. The hope, though, is that with egraphs having run already it's shuffled constants to the right most of the time and already folded any possible together.

This commit adds a rule for the lowering of `iadd` to use `lea` for 32 and 64-bit addition. The theoretical benefit of `lea` over the `add` instruction is that the `lea` variant can emulate a 3-operand instruction which doesn't destructively modify on of its operands. Additionally the `lea` operation can fold in other components such as constant additions and shifts. In practice, however, if `lea` is unconditionally used instead of `iadd` it ends up losing 10% performance on a local `meshoptimizer` benchmark. My best guess as to what's going on here is that my CPU's dedicated units for address computation are all overloaded while the ALUs are basically idle in a memory-intensive loop. Previously when the ALU was used for `add` and the address units for stores/loads it in theory pipelined things better (most of this is me shooting in the dark). To prevent the performance loss here I've updated the lowering of `iadd` to conditionally sometimes use `lea` and sometimes use `add` depending on how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm` continue to use `add` (and its subsequent hypothetical extra `mov` necessary into the result). More complicated ones like `a + b + $imm` or `a + b << c + $imm` use `lea` as it can remove the need for extra instructions. Locally at least this fixes the performance loss relative to unconditionally using `lea`. One note is that this adds an `OperandSize` argument to the `MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit `lea` in addition to the preexisting 64-bit encoding.

alexcrichton · 2023-03-15T02:10:52Z

Ok I've pushed up a different way of doing this, namely deciding after regalloc whether to use add or lea. The local meshoptimizer benchmark shows no loss on the number that went down if lea is always used, and the number that went up due to using lea still went up. Sightglass locally says:

compilation :: cycles :: benchmarks/bz2/benchmark.wasm
  lea.so is 1.05x to 1.06x faster than main.so!
execution :: cycles :: benchmarks/bz2/benchmark.wasm
  lea.so is 1.02x to 1.03x faster than main.so!
ution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
  lea.so is 1.01x to 1.02x faster than main.so!
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
  lea.so is 1.00x to 1.01x faster than main.so!

with all others as "no difference"

Given that I'm tempted to go with this strategy instead and leave it open to, in the future, add a form of regalloc constraint or heuristic that attempts to keep the src/dst here in the same register.

fitzgen

Nice! Can't argue with those benchmark results for such a relatively simple change.

alexcrichton force-pushed the lea branch from f1067eb to fd4810a Compare March 10, 2023 21:19

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Mar 10, 2023

fitzgen approved these changes Mar 10, 2023

View reviewed changes

alexcrichton added 3 commits March 14, 2023 17:44

Conditionally use lea based on regalloc

47faeeb

alexcrichton force-pushed the lea branch from fd4810a to 47faeeb Compare March 15, 2023 02:05

alexcrichton requested a review from fitzgen March 15, 2023 02:10

fitzgen approved these changes Mar 15, 2023

View reviewed changes

fitzgen added this pull request to the merge queue Mar 15, 2023

Merged via the queue into bytecodealliance:main with commit fcddb9c Mar 15, 2023

alexcrichton deleted the lea branch March 15, 2023 17:52

jameysharp mentioned this pull request Apr 5, 2023

Add release notes for 8.0.0 #6145

Merged

elliottt mentioned this pull request May 15, 2023

Revert the lea-based lowering for iadd in the x64 backend #6377

Merged

alexcrichton mentioned this pull request Jul 20, 2023

Cranelift: Utilize AMode::RegScaled on aarch64 #6743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64: Add lea-based lowering for iadd #5986

x64: Add lea-based lowering for iadd #5986

alexcrichton commented Mar 10, 2023

alexcrichton commented Mar 10, 2023

github-actions bot commented Mar 10, 2023

fitzgen left a comment

fitzgen Mar 10, 2023

alexcrichton Mar 15, 2023

alexcrichton commented Mar 13, 2023

alexcrichton commented Mar 15, 2023

fitzgen left a comment

x64: Add lea-based lowering for iadd #5986

x64: Add lea-based lowering for iadd #5986

Conversation

alexcrichton commented Mar 10, 2023

alexcrichton commented Mar 10, 2023

github-actions bot commented Mar 10, 2023

Subscribe to Label Action

fitzgen left a comment

Choose a reason for hiding this comment

fitzgen Mar 10, 2023

Choose a reason for hiding this comment

alexcrichton Mar 15, 2023

Choose a reason for hiding this comment

alexcrichton commented Mar 13, 2023

alexcrichton commented Mar 15, 2023

fitzgen left a comment

Choose a reason for hiding this comment