Cranelift aarch64 backend: use constant pool to defer constant emission to the end, unless out of range #1549

cfallin · 2020-04-17T20:13:16Z

Currently, to avoid issues with (i) the PC-relative addressing range of LDR, and (ii) the fact that the constant pool needs to be relocatable, the aarch64 backend emits constants inline with code and branches around them. This is obviously suboptimal for code-density and branchiness reasons.

To handle the first issue, we should use the ConstantPool to collect constants, as the old backend does, and then modify our emission logic to defer constant emission to the end unless we're about to go out of range of pending constant references, in which case we emit a "constant island".

To handle the second issue, we need to emit constant-pool relocations.

It's still unclear how we can handle the intersection of these two, i.e., when constant references become out-of-range because the client relocates the constant pool further away. One known use case of the relocatable constant pool is for SpiderMonkey to insert its own epilogue into Wasm functions; there, at least, we can bound how much further away the constant pool will be, so perhaps we can just have a slightly-conservative limit for when we must emit a constant island.

The text was updated successfully, but these errors were encountered:

abrown · 2020-04-17T20:26:12Z

For reference, there are other things to think about with constant pools; see #1385.

akirilov-arm · 2020-11-04T23:59:39Z

There is a set of constants that might be particularly problematic - those generated for Inst::LoadExtName. The complication is that we need to emit relocations that modify the value in the constant pool itself. @cfallin, has a scheme similar to a Global Offset Table (GOT) been considered?

There is also another problem that is quite similar - lack of deferred emission of instructions, in particular traps. Consider integer division; currently the AArch64 backend generates:

div rd, rn, rm
cbnz rm, #8
udf
cmn rm, #1
ccmp rn, #1, #nzcv, eq
b.vc #8
udf

The natural expectation is that most of the time the UDF instructions are not executed, so this sequence has the same deficiencies as inline constants, i.e. lower code density and more complicated control flow. Furthermore, it contains forward conditional branches, which may very well be predicted as not taken - exactly the opposite of the common case.

I think that both issues (the one in the previous paragraph and constant pools) could be solved simultaneously.

BTW the limited range of literal loads could be extended by emitting a combination of ADRP + LDR instead of just a literal load, which gives a range of approximately 8 GB (4 GB in either direction).

cfallin · 2020-11-05T02:40:27Z

There is a set of constants that might be particularly problematic - those generated for Inst::LoadExtName. The complication is that we need to emit relocations that modify the value in the constant pool itself. @cfallin, has a scheme similar to a Global Offset Table (GOT) been considered?

@akirilov-arm no, we haven't considered anything like a GOT, though at some point we do need to fill out support for relocation types like this. Certainly the early/bringup work for the aarch64 backend was the JIT use-case, where GOTs are not really used (or at least, e.g. SpiderMonkey has a table of wasm functions and IIRC globals but it generates custom code for accesses); so we've gotten away with the relatively simple set of relocation types so far.

Longer-term, I think we ought to just emit the proper relocations for e.g. GOT references; we'll have to wire through the relocation-type support to the relevant backends, including cranelift-object (if not already there) and wasmtime's relocation handling; and we'll have to have a notion of "memory model" which is sort of a superset of our current RelocDistance, indicating when a symbol is external and the compilation mode calls for a GOT.

deferred emission of instructions, in particular traps.

This is an interesting one: my first instinct is that it might be simpler to handle this case directly during lowering, by keeping a list of deferred trap-paths (combination of MachLabel and TrapCode) and emitting and binding to labels a sequence of trap instructions at the tail of the function. There are some complications there for SpiderMonkey (it expects a fallthrough return and stitches its own epilogue on after Cranelift returns) but otherwise I think it would work. If we need a more general mechanism, we could defer emission via a mechanism on the MachBuffer, but I'm hesitant to add that complexity unless we really do need it. Open to other ideas here, of course :-)

bjorn3 · 2021-02-03T20:50:32Z

This is done, right?

cfallin · 2021-02-03T20:58:17Z

We actually currently synthesize up to 64-bit values inline with MOVZ/MOVK/MOVN instructions, and use this for f32/f64 as well (integer constant then move to float reg). 128-bit constants are included inline and then branched around. @abrown added the constant-pool functionality but we haven't made use of this in the aarch64 backend yet. It is still worthwhile, I think, at least to evaluate; let's leave it open to track.

akirilov-arm · 2021-02-04T00:33:46Z

Some 64-bit floating-point constants are also loaded from an inline literal (these cases may be reduced further - refer to the comments). Note that based on my microbenchmark data in #2296, it definitely makes sense to consider using constant pools.

cfallin · 2021-02-04T00:43:10Z

@akirilov-arm: yes, sorry, that update somehow slipped off my radar -- thanks very much for the detailed evaluation! I agree that based on that update, we should probably at least replace the inline-constant cases with constant pool usages; the branch-around seems to have a significant impact.

akirilov-arm · 2021-02-04T00:53:58Z

@cfallin This is a side question, but do we have a mechanism to share a constant between several users instead of rematerializing it each time? This is particularly relevant to the bitmask extraction operations, in which case the constants are always the same. Of course, the increased register pressure is a trade-off to consider.

Also, some of the proposals to the Wasm SIMD specification (see WebAssembly/simd#395 for example) are made with such a capability in mind.

cfallin · 2021-02-04T01:17:42Z

@akirilov-arm we can share within a single function at least: @abrown's work in #2328 added deduplication of VCodeConstants and the plumbing to use the MachBuffer's constant-island support to emit them. (We also dedup statically, i.e. at the codegen crate level: the "well-known constant" option takes a &'static [u8] to avoid carrying duplicates of constant data by value in the VCode.)

We can't dedup across function bodies; that might be nice to have, but it would require a bit more plumbing due to our separate function compilation (which enables other things like parallelization). In principle we could have an "rodata object" that we aggregate from all MachCompileResults after-the-fact, I suppose. Do you think we need that level of deduplication?

abrown · 2021-02-04T18:02:23Z

I think #2468 may be somewhat applicable to this thread; I wanted an ergonomic way to add constants during lowering (LowerCtx::emit_constant) and that PR is a few steps down that path. I abandoned it since it is not strictly necessary (more of a "nice to have") but I can finish it if we figure out some of this stuff out. E.g., it would be sort of nice to have a global rodata.

akirilov-arm · 2021-02-04T18:48:12Z

@cfallin

Do you think we need that level of deduplication?

No, I don't think so. As for the existing deduplication - if I understand correctly, it avoids keeping several copies of the same data inside the pool, but each use of the data still requires a literal load. I was thinking more in the direction of keeping the constant data live in a register and avoiding memory operations; this approach might be relevant even for more complicated integer constants, say those that need more than 2 instructions to materialize (arbitrary limit - I don't have hard data to justify it).

cfallin · 2021-02-04T21:27:28Z

Ah, OK. In theory, at least for constant duplication at the CLIF level, this may be covered by CSE. Constants that arise during lowering will not be CSE'd, so either post-lowering opts or constant rematerialization at the regalloc level could help. We might get to either or both of these someday. A constant pool with dedup is still better than literal loads from separate constant locations if we're going to do multiple literal loads (perhaps because opts are disabled or we haven't implemented remat yet); so these are complimentary I think.

cfallin added enhancement cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Apr 17, 2020

cfallin added the cranelift:area:aarch64 Issues related to AArch64 backend. label Apr 18, 2020

akirilov-arm mentioned this issue Oct 19, 2020

Cranelift AArch64: Support the SIMD bitmask extraction operations #2296

Closed

akirilov-arm added the cranelift Issues related to the Cranelift code generator label Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift aarch64 backend: use constant pool to defer constant emission to the end, unless out of range #1549

Cranelift aarch64 backend: use constant pool to defer constant emission to the end, unless out of range #1549

cfallin commented Apr 17, 2020

abrown commented Apr 17, 2020

akirilov-arm commented Nov 4, 2020 •

edited

Loading

cfallin commented Nov 5, 2020

bjorn3 commented Feb 3, 2021

cfallin commented Feb 3, 2021

akirilov-arm commented Feb 4, 2021

cfallin commented Feb 4, 2021

akirilov-arm commented Feb 4, 2021

cfallin commented Feb 4, 2021

abrown commented Feb 4, 2021

akirilov-arm commented Feb 4, 2021

cfallin commented Feb 4, 2021

Cranelift aarch64 backend: use constant pool to defer constant emission to the end, unless out of range #1549

Cranelift aarch64 backend: use constant pool to defer constant emission to the end, unless out of range #1549

Comments

cfallin commented Apr 17, 2020

abrown commented Apr 17, 2020

akirilov-arm commented Nov 4, 2020 • edited Loading

cfallin commented Nov 5, 2020

bjorn3 commented Feb 3, 2021

cfallin commented Feb 3, 2021

akirilov-arm commented Feb 4, 2021

cfallin commented Feb 4, 2021

akirilov-arm commented Feb 4, 2021

cfallin commented Feb 4, 2021

abrown commented Feb 4, 2021

akirilov-arm commented Feb 4, 2021

cfallin commented Feb 4, 2021

akirilov-arm commented Nov 4, 2020 •

edited

Loading