Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64: Add lea-based lowering for iadd #5986

Merged
merged 3 commits into from
Mar 15, 2023
Merged

Conversation

alexcrichton
Copy link
Member

This commit adds a rule for the lowering of iadd to use lea for 32
and 64-bit addition. The theoretical benefit of lea over the add
instruction is that the lea variant can emulate a 3-operand
instruction which doesn't destructively modify on of its operands.
Additionally the lea operation can fold in other components such as
constant additions and shifts.

In practice, however, if lea is unconditionally used instead of iadd
it ends up losing 10% performance on a local meshoptimizer benchmark.
My best guess as to what's going on here is that my CPU's dedicated
units for address computation are all overloaded while the ALUs are
basically idle in a memory-intensive loop. Previously when the ALU was
used for add and the address units for stores/loads it in theory
pipelined things better (most of this is me shooting in the dark). To
prevent the performance loss here I've updated the lowering of iadd to
conditionally sometimes use lea and sometimes use add depending on
how "complicated" the Amode is. Simple ones like a + b or a + $imm
continue to use add (and its subsequent hypothetical extra mov
necessary into the result). More complicated ones like a + b + $imm or
a + b << c + $imm use lea as it can remove the need for extra
instructions. Locally at least this fixes the performance loss relative
to unconditionally using lea.

One note is that this adds an OperandSize argument to the
MInst::LoadEffectiveAddress variant to add an encoding for 32-bit
lea in addition to the preexisting 64-bit encoding.

Additionally this PR has a prior commit which is a "no functional changes intended" update to the Amode computation in the x64 backend to rely less on recursion and avoid blowing the stack at compile time for very-long-chains of the iadd instruction.

@alexcrichton
Copy link
Member Author

Oh I should also mention that on the meshoptimizer program this improves performance by 5% for me on one of the benchmarks.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Mar 10, 2023
@github-actions
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me with the below addressed

cranelift/codegen/src/isa/x64/lower.isle Outdated Show resolved Hide resolved
Comment on lines +89 to 62
;; Higher-priority cases than the previous two where a load can be sunk into
;; the add instruction itself. Note that both operands are tested for
;; sink-ability since addition is commutative
(rule -4 (lower (has_type (fits_in_64 ty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "higher-priority" but the actual priority given is less than the other cases. Something doesn't add up here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a negative priority though I think this is higher than the prior two?

(I can switch to giving all positive priority too)

@alexcrichton
Copy link
Member Author

Before I merge this one thing I'd like to explore is to try a strategy of "always use lea" but then during emission switch to using add if the src/dst registers are the same. Something like an AddOrLea instruction which, depending on the results of regalloc, emits one of the two instructions. I'm interested to compare that to this performance and it more easily leaves this open in the future to a tweak in regalloc's constraints where something could be specified like "please try to make these two the same"

This commit replaces the previous computation of `Amode` with a
different set of rules that are intended to achieve the same purpose but
are structured differently. The motivation for this commit is going to
become more relevant in the next commit where `lea` will be used for the
`iadd` instruction, possibly, on x64. When doing so it caused a stack
overflow in the test suite during the compilation phase of a wasm
module, namely as part of the `amode_add` function. This function is
recursively defined in terms of itself and recurses as deep as the
deepest `iadd`-chain in a program. A particular test in our test suite
has a 10k-long chain of `iadd` which ended up causing a stack overflow
in debug mode.

This stack overflow is caused because the `amode_add` helper in ISLE
unconditionally peels all the `iadd` nodes away and looks at all of
them, even if most end up in intermediate registers along the way. Given
that structure I couldn't find a way to easily abort the recursion. The
new `to_amode` helper is structured in a similar fashion but attempts to
instead only recurse far enough to fold items into the final `Amode`
instead of recursing through items which themselves don't end up in the
`Amode`. Put another way previously the `amode_add` helper might emit
`x64_add` instructions, but it no longer does that.

This goal of this commit is to preserve all the original `Amode`
optimizations, however. For some parts, though, it relies more on egraph
optimizations to run since if an `iadd` is 10k deep it doesn't try to
find a constant buried 9k levels inside there to fold into the `Amode`.
The hope, though, is that with egraphs having run already it's shuffled
constants to the right most of the time and already folded any possible
together.
This commit adds a rule for the lowering of `iadd` to use `lea` for 32
and 64-bit addition. The theoretical benefit of `lea` over the `add`
instruction is that the `lea` variant can emulate a 3-operand
instruction which doesn't destructively modify on of its operands.
Additionally the `lea` operation can fold in other components such as
constant additions and shifts.

In practice, however, if `lea` is unconditionally used instead of `iadd`
it ends up losing 10% performance on a local `meshoptimizer` benchmark.
My best guess as to what's going on here is that my CPU's dedicated
units for address computation are all overloaded while the ALUs are
basically idle in a memory-intensive loop. Previously when the ALU was
used for `add` and the address units for stores/loads it in theory
pipelined things better (most of this is me shooting in the dark). To
prevent the performance loss here I've updated the lowering of `iadd` to
conditionally sometimes use `lea` and sometimes use `add` depending on
how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm`
continue to use `add` (and its subsequent hypothetical extra `mov`
necessary into the result). More complicated ones like `a + b + $imm` or
`a + b << c + $imm` use `lea` as it can remove the need for extra
instructions. Locally at least this fixes the performance loss relative
to unconditionally using `lea`.

One note is that this adds an `OperandSize` argument to the
`MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit
`lea` in addition to the preexisting 64-bit encoding.
@alexcrichton
Copy link
Member Author

Ok I've pushed up a different way of doing this, namely deciding after regalloc whether to use add or lea. The local meshoptimizer benchmark shows no loss on the number that went down if lea is always used, and the number that went up due to using lea still went up. Sightglass locally says:

compilation :: cycles :: benchmarks/bz2/benchmark.wasm
  lea.so is 1.05x to 1.06x faster than main.so!
execution :: cycles :: benchmarks/bz2/benchmark.wasm
  lea.so is 1.02x to 1.03x faster than main.so!
ution :: cycles :: benchmarks/spidermonkey/benchmark.wasm
  lea.so is 1.01x to 1.02x faster than main.so!
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
  lea.so is 1.00x to 1.01x faster than main.so!

with all others as "no difference"

Given that I'm tempted to go with this strategy instead and leave it open to, in the future, add a form of regalloc constraint or heuristic that attempts to keep the src/dst here in the same register.

@alexcrichton alexcrichton requested a review from fitzgen March 15, 2023 02:10
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can't argue with those benchmark results for such a relatively simple change.

@fitzgen fitzgen added this pull request to the merge queue Mar 15, 2023
Merged via the queue into bytecodealliance:main with commit fcddb9c Mar 15, 2023
@alexcrichton alexcrichton deleted the lea branch March 15, 2023 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants