diff --git a/accepted/cranelift-isel-pre-rfc.md b/accepted/cranelift-isel-pre-rfc.md new file mode 100644 index 0000000..b6e254f --- /dev/null +++ b/accepted/cranelift-isel-pre-rfc.md @@ -0,0 +1,798 @@ +# Summary + +There is a need to improve the mechanisms for describing instruction +lowering logic in Cranelift backends. + +The Cranelift compiler backend design was recently (in 2020) +overhauled to simplify its operation and remove the +difficult-to-understand legalization/encoding/recipe-based system. The +new design instead allows for straightforward lowering: the backend +does a `match` over opcodes, and emits some machine-specific +instructions. This has worked well enough, and has allowed for the +proliferation of multiple new backends (new x64, aarch64, s390x, +partial arm32) and a wide range of supported Wasm features (e.g., +SIMD), all in a more theoretically flexible instruction selector +framework (e.g., allowing backends to merge loads, complex address +generation, and extends/shifts into other operations). + +However, as the new backends increase in complexity, there is a +crossover point at which the handwritten pattern-matching itself has +become unwieldy, and a DSL of *some* sort would likely reduce +repetition, avoid errors, and allow for more flexibility in +refactoring. + +This pre-RFC explores the design space for "instruction selector +DSLs", and suggests a list of requirements that should be met by any +system which would take over the new codegen backend logic. + +The intent of this pre-RFC is to achieve consensus on a set of +requirements (possibly those presented here, or others gathered from +contributors) so that we can design an appropriate DSL, and to pose +discussion questions that will start to direct us toward the +appropriate design. It is, as much as possible, unopinionated: we will +try to summarize pros and cons where design tradeoffs are +discussed. The one exception is that it is slightly biased in the +direction of "we need a DSL" -- otherwise, why discuss this? -- but +this too is a question that is certainly open for discussion. + +# Background and Motivation + +## Approaches to Instruction Selection + +Instruction selection -- the problem of translating a program in some +intermediate representation (IR) into instructions for some CPU +architecture -- is a well-structured problem that lends itself to +approaches that use generated code (sometimes known as the +"meta-compiler" or "compiler-compiler" approach). This is because, at +some basic level, it is a *pattern-matching* problem, and the code +that performs the translation is matching patterns in the input and +generating certain combinations of instructions in the output for each +pattern. + +There are competing approaches to implement such a pattern-matching +system. On the one hand, any framework or system that abstracts the +common logic of pattern matching away from the patterns themselves +will add to complexity. In the initial design for Cranelift's backend +framework (and its first x86-specific lowering rules), this complexity +became overwhelming, and presented problems for engineers wishing to +improve existing backends or contribute new ones [1]. Such a system, +if mismatched to the characteristics of the problem at hand, can also +contribute to inefficiency: for example, if it works by applying rules +incrementally until no more rules apply, there may be unnecessary +overhead in dynamically scheduling rules, and in constructing +intermediate states in memory, when the final translation for a +particular input pattern could be known statically. + +On the other hand, one can implement a compiler backend that +translates the input to output instructions in one step with +hand-written code. This is the answer that we arrived at in +Cranelift's new backend design. By going to the other extreme and +avoiding any significant pattern-matching-specific framework, it +achieves simplicity, approachability, and efficiency (in the ideal +case). + +[1]: https://github.com/bytecodealliance/wasmtime/issues/1141 + +This technique is extremely approachable when backends are simple. For +example, the toplevel `match` on opcode and arm for an `add` +instruction might appear as follows, if the backend always lowers an +"add" operation to a single add instruction with no other +operation-combining or optimizations (slightly simplified but +representative): + +```rust +fn lower_inst(ctx: &mut Ctx, insn: Insn, /* ... */) { + + let op = ctx.data(insn).opcode(); + match op { + // ... + + Opcode::Iadd => { + let rn = put_input_in_reg(ctx, insn, 0); + let rm = put_input_in_reg(ctx, insn, 1); + let rd = get_output_reg(ctx, insn, 0); + + ctx.emit(Inst::AluRRR { + op: Opcode::Add64, + rn, + rm, + rd, + }); + } + } +} +``` + +Adding support for a new IR-level opcode then has relatively little +overhead: it requires only a new match arm and (if needed for the +lowering) new machine instructions in the `Inst` enum. + +## Matching Combinations of Operators + +The next natural step in this hand-written style is to perform deeper +pattern matching and generate better code when possible, perhaps +matching on, and combining, more than one operation in the input +program into one instruction in the output. Deeper pattern matching is +usually written via ordinary conditionals, forming a decision tree +(or, actually, DAG) of sorts. Decisions are made depending on further +properties of a portion of the input: an opcode, an immediate, a +branch condition, etc. Sometimes the decision is an N-way choice: +match on an input type and generate one of N possible instructions, +for example. In other cases, the decision is binary: e.g., if an +immediate is small enough to fit in a certain field, then generate an +immediate-form instruction, otherwise fall back to general case. + +For a simple example of this deeper matching, consider the addition of +`MADD` (multiply-add) support to the above add-op lowering code: + +```rust + match op { + // ... + + Opcode::Iadd => { + if let Some((mul_insn, 0)) = maybe_input_insn(ctx, 0, Opcode::Imul) { + let rn = put_input_in_reg(ctx, mul_insn, 0); + let rm = put_input_in_reg(ctx, mul_insn, 1); + let ra = put_input_in_reg(ctx, insn, 1); + ctx.emit(Inst::AluRRRR { op: Opcode::MAdd64, ... }); + } else if let Some((mul_insn, 1)) = maybe_input_insn(ctx, 1, Opcode::Imul) { + // ... + } else { + // normal Add as above. + } + } + } +``` + +This is "optimal" in the sense that (i) it checks further up the tree +of operands only if we have matched the pattern so far, and (ii) it +shares work to find the "prefix" of the match (the Add at the +root). With care, more complex patterns can also be combined in this +way, and this has been done successfully in the backends so far. + +## Downsides as Complexity Increases + +The major downsides of this approach, however, have become more +prominent as the backends grow more complex. The hand-written +simplicity that allowed us to achieve efficient lowering (direct +translation rather than iterative application of in-place edits) and +that made the backends approachable for contributors (to great +success!) has become bogged down in layers of nested, manual +pattern-matching. + +There are three major disadvantages. The first is that it is simply +tedious, and this discourages work on more optimal code generation in +specific cases where better lowerings exist. In contrast, a "pattern +language" that could express a new lowering rule in a few lines would +allow for much quicker, and more thorough, optimization of the +backends by domain experts. This is especially relevant for SIMD and +other arithmetic opcodes where complex sequences must be emitted to +obtain the correct semantics in various cases. + +The second disadvantage is that hand-written code is difficult to +refactor or migrate across API or backend design changes. For example, +as part of the regalloc2 effort, an attempt was made at first to +migrate enough of the Cranelift backend(s) to (i) generate SSA code +(i.e., do not reuse virtual registers) and (ii) use slightly different +regalloc APIs. This diff grew to 4k lines at ~1/3rd complete, and +involved essentially re-authoring all of the delicate lowering +patterns, with a high chance of introducing a novel bug. In contrast, +code that is generated can be adapted to use new APIs or frameworks by +adjusting the code generator itself, which is usually much easier (the +changes are made only in one place). More importantly still, the +lowering patterns themselves are not modified, so the semantics (and +correctness) remain exactly the same. Especially as we look toward +further optimizations in our compiler pipeline, this flexibility will +be very important. + +The final disadvantage of hand-written pattern matching is that it +becomes increasingly error-prone as more matching rules and +optimizations interact. Two example bugs suffice to illustrate the +problem. First, the security bug CVE-2021-32629 arose out of a case in +which we recognized and elided an unnecessary zero-extend, when the +proper (extended) type was not exposed to the register allocator. A +system that reasons in a principled way about operators with typed +inputs and outputs might not have allowed us to express this +bug. Second, the bug fixed by PR #2576 [2] was caused by an +over-aggressive merge of a load into a compare op that was generated +more than once; again, a system that enforces invariants in the way +that instructions are matched and merged would likely have caught +this. + +[2]: https://github.com/bytecodealliance/wasmtime/pull/2576 + +## The Renewed Need for a Principled Approach + +For all of the above reasons, as our machine backends evolve to become +more feature-complete and optimized, and as we have now settled on a +reasonable single-pass lowering design, it is again time to consider +developing a principled approach by which we can express instruction +lowering patterns in some sort of DSL. + +Why now, and how will it be different than the +legalization/encoding/recipe system? There are (at least) two reasons: + +1. Intrinsic complexity is increasing. We now support more + instructions (SIMD), and we have more fine-tuned instruction + lowerings (the old backend could never, e.g., combine address + generation expressions into complex addressing modes, or make use + of builtin extend/shift capabilities of instructions). + +2. We understand the required design better now. The earlier + legalization-based system was tied to an edit-in-place approach, + and also to the idea that the machine lowering can simply be + represented as an overlay of "encodings" on the IR; as discussed in + [3], we now understand that a two-level IR fits this problem space + better, and that a streaming one-pass approach to lowering is more + efficient. It took a descent into simple hand-written code to reach + this point, and to allow us to bring up three new machine backends. + But now, since the design has now settled, it makes sense to + re-abstract into a mechanized approach. + +[3]: https://cfallin.org/blog/2020/09/18/cranelift-isel-1/ + +# Requirements for a DSL + +Now that we have established the need for some sort of DSL, let's +examine what requirements are imposed by the problem at hand. + +## Requirement 1: First-Class Destructuring/Matching + +The most important task of instruction-selection DSL is matching on, +and destructuring (binding variables to components of), the input IR. + +Thus, the DSL must make these operations as natural and easy to +express as possible. Even a little bit of friction -- say, manually +calling a method, or manually extracting matched components, for each +level of a pattern match -- will discourage efforts to refactor and +optimize lowering rules. + +We should strive for the simplicity of other pattern-matching DSLs +that have been used to specify instruction selection patterns. For +example, the Go compiler uses a DSL to generate lowering code from +S-expression patterns such as [4]: + +``` +// Lowering arithmetic +(Add(64|32|16|8) ...) => (ADD(Q|L|L|L) ...) + +// combine add/shift into LEAQ/LEAL +(ADD(L|Q) x (SHL(L|Q)const [3] y)) => (LEA(L|Q)8 x y) + +// Merge load and op +((ADD|SUB|AND|OR|XOR)Q x l:(MOVQload [off] {sym} ptr mem)) && + canMergeLoadClobber(v, l, x) && + clobber(l) => +((ADD|SUB|AND|OR|XOR)Qload x [off] {sym} ptr mem) +``` + +[4]: https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/gen/AMD64.rules + +Likewise, in LLVM's TableGen-based machine specifications, one can +find expression-tree matchers such as [5]: + +``` +// A read-modify-write negate: matches (store (ineg (loadi64 ...))) tree. + +def NEG64m : RI<0xF7, MRM3m, (outs), (ins i64mem:$dst), "neg{q}\t$dst", + [(store (ineg (loadi64 addr:$dst)), addr:$dst), + (implicit EFLAGS)]>, + Requires<[In64BitMode]>; +``` + +[5]: https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86InstrArithmetic.td + +Similarly, in our own tree, Peepmatic can express instruction rewrites +with a straightforward rule syntax that pairs a pattern (the left-hand +side) with a replacement (the right-hand side): + +``` +(=> (bor $x (bor $x $y)) + (bor $x $y)) +``` + +which could in principle be extended to machine-specific instruction +opcodes. + +Whatever DSL we adopt or design should allow one- or two-line patterns +to be written similarly to these whenever a new lowering for a pattern +of IR instructions must be added. + +## Requirement 2: Transition from Hand-Written Code + +A key fact to consider is that *our machine backends already exist*, +and they must continue working as we implement and make use of +whatever DSL we define. Furthermore, we should do as much as we can to +avoid large atomic switchover PRs. It is much easier to (i) make +progress and (ii) verify correctness as we go if we can migrate small +bits of backend logic at a time. + +There are two ways in which DSL-written and hand-written code can +interact: + +- "Horizontally", between different instructions. For example, we + might start to use a DSL to define arithmetic/SIMD instructions, and + optimize SIMD lowerings, while retaining the existing (working, + verified, hand-optimized) lowering for all other CLIF + instructions. Or we might want to prove out the DSL by fully + implementing all of the existing pattern-matching cases for just one + instruction. It should be possible to do this while deferring to + existing handwritten code for all not-handled cases. + + This is the simplest form of co-existence and would allow a + migration strategy by which we implement the entire lowering logic + for one or more Cranelift opcodes at a time, eventually covering the + whole CLIF opcode space. + +- "Vertically", within the logic to lower a single CLIF + instruction. For example, we might use the DSL to define lowering + patterns for arithmetic instructions and combinations of arithmetic + instructions, but defer to existing handwritten code that looks for + loads from memory that can be merged into the arithmetic + instructions and computes the best addressing mode with more complex + logic. Or, going the other way, we might try to systematize the + addressing-mode lowering into a set of rules, and use it from + existing handwritten code that lowers loads/stores. + + This implies a sort of "FFI" by which the DSL code can call out to + Rust, and perhaps the Rust code can call back into (sub)rules in the + DSL. + + This is a more challenging form of interoperation of old and new + (though it may fall more naturally out of some approaches than + others). If we can achieve it, it would allow us to (i) continue to + make use of shared code, such as code that computes addressing + modes, various forms of immediates, etc., when writing new lowerings + in the DSL, and (ii) have more flexibility with certain complex + lowering logic. The main advantage can thus be described as + "flexibility". + + The main downside of "vertical" interop is that, by providing more + flexibility, it can make analyzability more difficult. For example, + if we aim to verify the lowerings, we need to annotate what these + calls to outside code are doing for the sake of the verification + tool. + + This is reminiscent of "unsafe" code in Rust: it allows one to build + axiomatic building blocks with flexibility, but it requires one to + carefully define *what* the building blocks do. + +## Requirement 3: Generate "Optimal" Matching Code + +Despite a preference for simplicity, the DSL should also work to +achieve *performant* pattern-matching, even when this means that the +generated code works differently than a naïve interpretation of the +semantics of the pattern-matching might suggest. As a simple example, +when a decision rests on an opcode or other parameter and there are N +valid options, we should always do an N-way dispatch with a `match` +rather than a case-by-case linear-time search. + +In general, performant lowering should avoid redundant work. This +likely implies factoring out and sharing the results of subtree +matching when multiple instruction-lowering cases would use it, rather +than re-processing the subtree at each query. The system could solve +this either by careful construction and ordering of matching code +(statically) or by memoization (dynamically), or some combination +thereof. + +## Requirement 4: Flexible / Minimal Coupling to Cranelift Design + +Though not necessary for the initial use-case, a DSL that we develop +*should* be as generic as possible with respect to Cranelift-specific +details. + +At the most basic level, one could define a common set of abstractions +that the instruction-lowering system expects to find at its boundaries +-- a set of traits to implement for the input and output IRs, for +example. + +An even more generalized design point would allow the concepts of +"instruction" and "operand" to be defined entirely within the DSL +itself. This flexibility comes with an inherent tradeoff: the less +that the DSL compiler knows about the domain, the fewer +domain-specific optimizations that can be applied. However, it might +be the case that all relevant optimizations are actually tied only to +lower-level concepts in the metalanguage, like pattern-matching. + +Minimal coupling also allows for use of the lowering rules in +different contexts: for example, it would likely help with +verification efforts (below), and makes testing/fuzzing much more +practical with a smaller testing harness. + +## Requirement 5: Ability to Encode Lowering Invariants + +While lowering code, there are a number of rules that one must +carefully observe in order to maintain correctness. A simple program +composed only of pure arithmetic operators poses little worry: one can +always duplicate an operator, or move it elsewhere in the program, as +long as the dataflow remains intact. In contrast, (i) loads and +stores, (ii) operators that can trap (such as divides), and (iii) +control flow impose hard limits on which IR-level operators can be +merged together, and which can be hoisted or sunk in the program. + +For an example of a somewhat subtle condition, consider the rules that +govern whether a load can be merged into another operator [7]: + +``` + /// ... + /// [This function returns:] + /// ... + /// + /// - An instruction, given that it is effect-free or able to sink its + /// effect to the current instruction being lowered, and given it has only + /// one output, and if effect-ful, given that this is the only use; + /// + /// ... + /// + /// If the backend merges the effect of a side-effecting instruction, it + /// must call `sink_inst()`. When this is called, it indicates that the + /// effect has been sunk to the current scan location. The sunk + /// instruction's result(s) must have *no* uses remaining, because it will + /// not be codegen'd (it has been integrated into the current instruction). +``` + +[7]: https://github.com/bytecodealliance/wasmtime/blob/9419d635c68f79d12471353e28b60cc8da010dad/cranelift/codegen/src/machinst/lower.rs#L110-L125 + +The above-cited compare-load merge bug (PR #2576) is a good example of +a bug that can result from a non-obvious violation of these +conditions. There are other cases where we discovered a potential +issue during code review. We have taken care to try to design the +`LowerCtx` API to avoid as many invariant-violations as possible +(e.g., the `Writable` type that encodes whether one can write to +a given virtual register), but nevertheless, it remains an area in +which we have to be vigilant. + +If we eventually express lowering logic in a DSL, then, we should +ensure that the DSL allows us to encode such restrictions, and +enforces them. For example, we might want the DSL to provide two +different ways to refer to an instruction input: one when we are the +only user, and one when we are not. Then the former might be required +in order to sink a side-effect. Ideally, such restrictions could be +expressed in some sort of DSL-internal type system, so that we do not +have to hardcode lowering rules into the DSL design itself. + +## Requirement 6: Helps to Advance Verification Efforts + +A top-level goal of the Cranelift project is to actively encourage +verification efforts. As two examples, (i) +[VeriWasm](https://github.com/plsyssec/veriwasm/) statically proves +heap isolation invariants in Cranelift-compiled WebAssembly code, and +(ii) the [regalloc +checker](https://github.com/bytecodealliance/regalloc.rs/blob/main/lib/src/checker.rs) +static proves the correctness of a particular register-allocation +run. More such efforts are expected in the future, with additional +scope. + +In order to facilitate verification of instruction selection, the DSL +with which we express the instruction lowering logic should allow for +formal reasoning about the lowering, and correspondence between input +and output. For example, if we were to eventually have a formal +semantics for CLIF, and for the machine code of some architecture, we +should be able to translate the lowering rules in the DSL into +equivalences that tie the two semantics together. + +There are two levels of extraction that we might expect to achieve, +with varying levels of difficulty. The first we could consider to be +equivalent to the regalloc checker: the instruction selector generated +from the DSL could produce metadata that would allow us to associate +input and output, and prove *for a specific compilation* that the +result is correct. This is likely much easier (it is very similar to +debuginfo), and is an approach that, e.g., has been recently explored +in research [7]. It also provides an end-to-end guarantee, if done +right: the metadata could tie Wasm bytecode-level semantics to machine +code-level semantics, covering all stages of the compilation +pipeline. If we decide on such a verification approach, then +propagation of this input-to-output correspondence is enough. + +The second level of verification is a static "for all inputs" +verification, and requires extraction of all lowering rules in a +symbolic form. This sort of verification is much harder in general, +especially when instruction selection involves code motion (hoisting +or sinking), control flow, or other side-effects; though for the case +of straightline expression-oriented code, one can often prove +equivalence with e.g. an SMT problem formulation. Despite the present +difficulties in the state of the art, we should not prematurely rule +out more general backend verification using symbolic techniques, and +as a result, we should ensure that there is a way to perform such +extraction. + +[7]: T Kasampalis, D Park, Z Lin, VS Adve, G Roşu. Language-parametric compiler + validation with application to LLVM. In ASPLOS 2021. + https://dl.acm.org/doi/abs/10.1145/3445814.3446751 + +# Open Discussion Questions + +## Do we need a DSL to solve this problem? + +It is a significant undertaking to (i) either develop or leverage a +DSL, and (ii) migrate our backends to this DSL. The costs include +implementation effort, learning overhead (by contributors who would +use the system), and risk of introducing bugs. + +On the other hand, there are significant medium-to-long-term benefits: +it would become significantly easier to make contributions, we would +gain more confidence that the backends are correct, and there could be +significant second-order effects (e.g., new contributors encouraged to +join the project) because of these. + +## Compatible with Existing Lowering API/Machinery? + +Should our DSL generate code that is compatible with the existing +MachInst lowering API, or do we hook into the compiler pipeline at a +different level? + +There are interesting implications of the first approach. Along with +"gradual migration" (below), designing for compatibility with the +MachInst lowering logic means that control is inverted relative to a +more self-contained instruction rewriting system. It also implies a +single-pass lowering strategy, rather than an iterative application of +rules. + +The major advantage of plugging into the MachInst framework is +pragmatic: it exists and it works, and we can continue to use it both +with handwritten and DSL-generated backends. In other words, the DSL +is just a tool that we use to author these backends. In contrast, +moving to a system that comes with its own toplevel driver and +framework would mean yet another backend-framework transition. + +On the other hand, the advantage of the second approach is that it +could mean we take another working system as a whole and attach it to +our compiler pipeline. This grants use greater freedom to use another +DSL, and leverage existing work, if other compelling reasons lead us +to do so. + +## Gradual Migration? + +Is the gradual-migration approach reasonable? How far should we go to +allow for fine-grained, gradual replacement of backend code with +DSL-generated code, vs. parallel implementations and a toplevel "try +the DSL, then fallback to existing backend" driver? + +The major advantage of a gradual migration is that it allows us to get +started much more easily, and to navigate tradeoffs in +reimplementation vs. reuse on a case-by-case basis where complex logic +is involved. For example, if we do not allow intermixing of +hand-written and DSL code within a single instruction lowering, then +before basic arithmetic ops can be supported on x86-64, we need to +migrate the entire addressing mode / load-op merging logic. (We could +build the backend to at least recognize cases where this would happen +instead, and fall back to the old backend; but this would result in +duplicated matching and thus lower efficiency.) Similarly, on both +x86-64 and aarch64, basic arithmetic ops require handling a complex +set of "narrow value mode" logic that inserts extend operators where +appropriate. While both of these examples should be expressible in the +DSL, their relevance to even basic instructions implies that a +"horizontal-only" mix of old and new code would require a very large +initial step. + +On the other hand, the major advantage of allowing for a more +coarse-grained transition, and not prioritizing fine-grained +interoperability of handwritten and DSL code, is that it grants more +design-space freedom. If we allow ourselves to consider any +instruction selector design, as long as it fits the `LowerCtx` API, it +is possible we could land at a better design point. This is not to say +that such a final design point could not be reached in a more gradual +manner, especially if a DSL is designed to minimize coupling to +Cranelift specifics, of course. + +## Term-Rewriting or Imperative Control Flow? + +Is an expression-node / term-rewriting-system design appropriate, or a +system with some imperative flow of lowering actions? + +Many instruction selection frameworks are built around a fundamental +notion of expression nodes. Every operator in the IR is a node that +produces one or more values, the input is a tree or DAG of these +nodes, and the rules specify equivalences between expressions. + +On the other hand, instruction selector languages can also be designed +with an "imperative backbone": match on an input, but then perform +some actions, such as allocating temps and emitting +instructions. Typically this design choice would come as a result of a +principle of "generalized design" (the expression model may not be +flexible enough) and/or as a result of existing backend design. + +This design axis is often closely coupled with an inversion of +control. Expression-node or term-rewriting systems typically have an +established core rewriting engine that is part of the system; one +provides rules to this engine and invokes it on some input. On the +other hand, instruction selectors built in a more imperative style +typically more directly interoperate with the rest of the compiler +backend, calling into it as needed, and are invoked by an external +driver loop. + +### Examples of Expression-Based Designs + +There are many examples of expression-based designs. There are +traditional tree-rewriting instruction selectors, which can be based +on various matching algorithms that often make use of specific +metacompilation-time optimizations (e.g. to build optimal lookup +tables or FSMs). Peepmatic, in our tree, is a good example of this +design point, as is the Go compiler DSL cited above. + +Another recently studied option is the "e-graph" (equivalence graph) +abstraction, which applies rewrite rules to a compact representation +of all possible equivalent forms of an expression (see e.g. the +[egg](https://github.com/egraphs-good/egg) crate). An e-graph merges +nodes to compress an exponential option space into typically much more +compact form (in a close analogy to how BDDs compress boolean +functions), and the "all equivalent expressions" abstraction is a +powerful one. + +### Examples of Imperative Control Flow + +There are also instruction selectors based on a more imperative +design. A handwritten instruction selector is sometimes the best +choice for very fast (JIT-compatible) operation. Our new Cranelift +backends operate in an imperative way, such that a lowering function +is invoked on a particular instruction and it can look "up the tree" +as needed, doing its work in a single pass. This design was inspired +by Valgrind's instruction selection, which also achieves good +efficiency in a single pass. + +General tools/languages exist to build pattern-matching in a way that +has explicit control flow: most prominently, the Prolog programming +language provides destructuring and backtracking, and its goal-driven +top-down control flow is closely analogous to how our current +handwritten backends work. + +### Advantages of Explicit Control Flow + +One might wonder why explicit control flow is advantageous, if we are +only concerned with a small set of imperative actions that are related +to producing an output expression. One answer is that it becomes much +more useful when the lowering interface with the rest of the compiler +backend becomes more complex: for example, when we need to track +side-effect coloring and sink instructions' side effects to merge +them. + +It also simplifies the DSL compiler and makes it more predictable. For +an example of what is avoided, a term-rewriting system can unroll a +fixpoint loop that repeatedly applies rewrite rules in order to +produce a single-pass lowering implementation, but this static +unrolling is undecidable in the general case unless the terms are +stratified (separated into layers) and rules are restricted only to +descend these strata, avoiding cycles. + +As another example, while term-rewriting engines can merge rules to +share subsets of logic between different lowering paths, it is an +expensive exponential search problem to "factor" the combinations of +rules into shared code in the general case. A lower-level DSL with +explicit control flow allows the author to directly factor into +reusable pieces. Optimizing expression-based rule applications into an +imperative final form is generally a very challenging problem. + +Finally, explicit control flow and an imperative paradigm fit more +naturally with the target language (Rust) generated by the +metacompiler, and so allow the DSL user to more easily interface with +the rest of the compiler backend. This is particularly relevant if we +decide that we need fine-grained mixing of handwritten and +DSL-generated code. + +### Advantages of an Expression-Based System + +The advantages of an expression-based system, on the other hand, +mainly derive from the fact that it imposes more regularity onto the +rules. By limiting or disallowing "escape hatches" -- calling into +handwritten helper code, controlling the sequence of rule applications +explicitly -- it provides a stronger specification of sorts that can +be used for other purposes. For example, while it is likely that an +instruction-selector verification effort could ultimately work with +either approach, it is clearer at the moment that a pure +expression-based system, with equivalence rules, can be directly +verified given semantics of input and output languages. The more +restricted semantics can also be simpler to understand in some sense: +one is just providing equivalences. + +(Side-note: There are additional questions in how backend verification +would work that require more research, and so it is somewhat hard to +give a definitive answer to what DSL features would be required for +verification. There is an argument that a more end-to-end approach +would be more appropriate, as many bugs occur in the gaps between +layers rather than directly in the basic lowering rules. If that's the +case, then verification might best be done probabilistically with a +fuzzing-like approach, where we drive the compiler with inputs and the +compiler produces a "witness proof", or correspondence between input +and output. However, if we wish to *statically* verify *the lowering +rules themselves*, this task is easier when they are expression +equivalences.) + +Thus the choice here comes down to flexibility, and possibly more +immediate feasibility (in terms of interop with existing code, and +ability to translate existing backends), on the one hand, versus +more regularity but possibly greater challenges in adapting to the +existing system on the other hand. + +## Existing DSL? + +Is there an existing DSL or language that would be appropriate? + +There are a few options to seriously consider here. + +### Peepmatic + +The most immediate option is Peepmatic: it is already in-tree, and is +built to work on CLIF. The major remaining work if we were to adapt +Peepmatic (as an implementation, not just a general approach) would +likely be along several lines: + +1. Modifying the engine to work with the new backend framework, which + requires some careful thought in how to "invert control" and allow + the lowering logic to drive the instruction selection. + +2. Modifying the engine to operate in a single-pass way rather than + iteratively applying rewrite rules. This requires us to work out + how to "unroll" chains of rule applications statically, which + presents some difficulties as discussed above, but could likely be + done to a large degree. + +3. Modifying the engine to (meta)compile the matcher down to Rust + code, rather than interpret match actions, as it currently does. + +4. Finding ways to fit much of the "sub-instruction" lowering logic + that currently exists in the new backends into the expression node + paradigm, for example by defining special address-mode nodes and + ensuring that these work efficiently (and are never reified into an + intermediate version of the program). + +5. Tuning Peepmatic to any differences in the sorts of rewrite rules + that we will write. For example, Peepmatic is built to share + prefixes and suffixes of match-op sequences between many rules, but + more complex lowering rules may share the *middle* portion with + specific prefixes and suffixes. (This connects back to the + "efficient factoring" point above.) + +The major upside of using Peepmatic is that it has been well-designed, +well-built, well-tested, and is generally a very high quality tool. + +### Some Other Compiler's DSL + +It is also possible that we could adapt an instruction selector DSL +from another compiler. For example, we could take more direct +inspiration from LLVM's TableGen and write our own TableGen backend +that generates Rust code. This is a "meta-option" to some degree, as +the "write our own TableGen backend" phrase is hiding a lot of work: +we would still actually need to decide on our matching and lowering +algorithms. + +### A Programming Language with Appropriate Semantics + +We could also adopt a more general programming language, with a +special fit in some way for the task, as a basis in which to write an +instruction selector. Prolog is a good example: it supports +destructuring, backtracking, is very high-level, and a natural fit for +writing sets of patterns. We would need, however, to carefully work +out the compilation strategy. + +### Our Own DSL + +Finally, we could decide on the necessary semantics and develop a new +small DSL that meets our needs. The author is experimenting with one +such design point (see note below) but it is by no means set; this is +just a fact-finding / research effort to either prove out or disprove +certain ideas. + +# Next Steps + +We should come to a consensus on requirements above; once we agree on +general requirements and design-space tradeoffs, we can discuss the +DSL semantics in more detail. + +As a fact-finding side-mission of sorts, I (@cfallin) am currently +prototyping a DSL that is oriented around pattern-matching, +destructuring, and backtracking, and can be seen as a more +general-purpose language in which one can define an instruction +selector. It allows calling into the existing Rust code and +intermixing DSL-generated and handwritten code at a fine +granularity. The open questions are (i) whether this approach allows +for generating "optimal" instruction selectors as well as higher-level +(e.g., instruction pattern-based) approaches, and (ii) how well this +meets verification-related goals. This is by no means settled or +decided upon; I am just as happy to abandon the effort if it turns out +that a different approach meets our needs better. + +# Acknowledgments + +Thanks to @fitzgen, @sunfishcode, and @tschneidereit for extensive +discussions and guidance that have allowed the thinking above to +develop to this point.