diff --git a/src/SUMMARY.md b/src/SUMMARY.md index f8b1f1d6e0c1f..8e18969a1f6c5 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -5,16 +5,19 @@ - [Using the compiler testing framework](./running-tests.md) - [Walkthrough: a typical contribution](./walkthrough.md) - [High-level overview of the compiler source](./high-level-overview.md) +- [Queries: demand-driven compilation](./query.md) + - [Incremental compilation](./incremental-compilation.md) - [The parser](./the-parser.md) - [Macro expansion](./macro-expansion.md) - [Name resolution](./name-resolution.md) -- [HIR lowering](./hir-lowering.md) +- [The HIR (High-level IR)](./hir.md) - [The `ty` module: representing types](./ty.md) - [Type inference](./type-inference.md) - [Trait resolution](./trait-resolution.md) - [Type checking](./type-checking.md) -- [MIR construction](./mir-construction.md) -- [MIR borrowck](./mir-borrowck.md) -- [MIR optimizations](./mir-optimizations.md) +- [The MIR (Mid-level IR)](./mir.md) + - [MIR construction](./mir-construction.md) + - [MIR borrowck](./mir-borrowck.md) + - [MIR optimizations](./mir-optimizations.md) - [trans: generating LLVM IR](./trans.md) - [Glossary](./glossary.md) diff --git a/src/glossary.md b/src/glossary.md index b66e17ea3c18c..3202e5f4ce354 100644 --- a/src/glossary.md +++ b/src/glossary.md @@ -9,23 +9,24 @@ AST | the abstract syntax tree produced by the syntax crate codegen unit | when we produce LLVM IR, we group the Rust code into a number of codegen units. Each of these units is processed by LLVM independently from one another, enabling parallelism. They are also the unit of incremental re-use. cx | we tend to use "cx" as an abbrevation for context. See also `tcx`, `infcx`, etc. DefId | an index identifying a definition (see `librustc/hir/def_id.rs`). Uniquely identifies a `DefPath`. -HIR | the High-level IR, created by lowering and desugaring the AST. See `librustc/hir`. +HIR | the High-level IR, created by lowering and desugaring the AST ([see more](hir.html)) HirId | identifies a particular node in the HIR by combining a def-id with an "intra-definition offset". -'gcx | the lifetime of the global arena (see `librustc/ty`). +'gcx | the lifetime of the global arena ([see more](ty.html)) generics | the set of generic type parameters defined on a type or item ICE | internal compiler error. When the compiler crashes. infcx | the inference context (see `librustc/infer`) -MIR | the Mid-level IR that is created after type-checking for use by borrowck and trans. Defined in the `src/librustc/mir/` module, but much of the code that manipulates it is found in `src/librustc_mir`. -obligation | something that must be proven by the trait system; see `librustc/traits`. +MIR | the Mid-level IR that is created after type-checking for use by borrowck and trans ([see more](./mir.html)) +obligation | something that must be proven by the trait system ([see more](trait-resolution.html)) local crate | the crate currently being compiled. node-id or NodeId | an index identifying a particular node in the AST or HIR; gradually being phased out and replaced with `HirId`. -query | perhaps some sub-computation during compilation; see `librustc/maps`. -provider | the function that executes a query; see `librustc/maps`. +query | perhaps some sub-computation during compilation ([see more](query.html)) +provider | the function that executes a query ([see more](query.html)) sess | the compiler session, which stores global data used throughout compilation side tables | because the AST and HIR are immutable once created, we often carry extra information about them in the form of hashtables, indexed by the id of a particular node. span | a location in the user's source code, used for error reporting primarily. These are like a file-name/line-number/column tuple on steroids: they carry a start/end point, and also track macro expansions and compiler desugaring. All while being packed into a few bytes (really, it's an index into a table). See the Span datatype for more. substs | the substitutions for a given generic type or item (e.g., the `i32`, `u32` in `HashMap`) -tcx | the "typing context", main data structure of the compiler (see `librustc/ty`). +tcx | the "typing context", main data structure of the compiler ([see more](ty.html)) +'tcx | the lifetime of the currently active inference context ([see more](ty.html)) trans | the code to translate MIR into LLVM IR. -trait reference | a trait and values for its type parameters (see `librustc/ty`). -ty | the internal representation of a type (see `librustc/ty`). +trait reference | a trait and values for its type parameters ([see more](ty.html)). +ty | the internal representation of a type ([see more](ty.html)). diff --git a/src/hir-lowering.md b/src/hir.md similarity index 98% rename from src/hir-lowering.md rename to src/hir.md index e28bb4cd40ffc..5d5e273c4792a 100644 --- a/src/hir-lowering.md +++ b/src/hir.md @@ -1,4 +1,4 @@ -# HIR lowering +# The HIR The HIR -- "High-level IR" -- is the primary IR used in most of rustc. It is a desugared version of the "abstract syntax tree" (AST) @@ -116,4 +116,4 @@ associated with an **owner**, which is typically some kind of item (e.g., a `fn()` or `const`), but could also be a closure expression (e.g., `|x, y| x + y`). You can use the HIR map to find the body associated with a given def-id (`maybe_body_owned_by()`) or to find -the owner of a body (`body_owner_def_id()`). \ No newline at end of file +the owner of a body (`body_owner_def_id()`). diff --git a/src/incremental-compilation.md b/src/incremental-compilation.md new file mode 100644 index 0000000000000..23910c5b38917 --- /dev/null +++ b/src/incremental-compilation.md @@ -0,0 +1,139 @@ +# Incremental compilation + +The incremental compilation scheme is, in essence, a surprisingly +simple extension to the overall query system. We'll start by describing +a slightly simplified variant of the real thing, the "basic algorithm", and then describe +some possible improvements. + +## The basic algorithm + +The basic algorithm is +called the **red-green** algorithm[^salsa]. The high-level idea is +that, after each run of the compiler, we will save the results of all +the queries that we do, as well as the **query DAG**. The +**query DAG** is a [DAG] that indices which queries executed which +other queries. So for example there would be an edge from a query Q1 +to another query Q2 if computing Q1 required computing Q2 (note that +because queries cannot depend on themselves, this results in a DAG and +not a general graph). + +[DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph + +On the next run of the compiler, then, we can sometimes reuse these +query results to avoid re-executing a query. We do this by assigning +every query a **color**: + +- If a query is colored **red**, that means that its result during + this compilation has **changed** from the previous compilation. +- If a query is colored **green**, that means that its result is + the **same** as the previous compilation. + +There are two key insights here: + +- First, if all the inputs to query Q are colored green, then the + query Q **must** result in the same value as last time and hence + need not be re-executed (or else the compiler is not deterministic). +- Second, even if some inputs to a query changes, it may be that it + **still** produces the same result as the previous compilation. In + particular, the query may only use part of its input. + - Therefore, after executing a query, we always check whether it + produced the same result as the previous time. **If it did,** we + can still mark the query as green, and hence avoid re-executing + dependent queries. + +### The try-mark-green algorithm + +The core of the incremental compilation is an algorithm called +"try-mark-green". It has the job of determining the color of a given +query Q (which must not yet have been executed). In cases where Q has +red inputs, determining Q's color may involve re-executing Q so that +we can compare its output; but if all of Q's inputs are green, then we +can determine that Q must be green without re-executing it or inspect +its value what-so-ever. In the compiler, this allows us to avoid +deserializing the result from disk when we don't need it, and -- in +fact -- enables us to sometimes skip *serializing* the result as well +(see the refinements section below). + +Try-mark-green works as follows: + +- First check if there is the query Q was executed during the previous + compilation. + - If not, we can just re-execute the query as normal, and assign it the + color of red. +- If yes, then load the 'dependent queries' that Q +- If there is a saved result, then we load the `reads(Q)` vector from the + query DAG. The "reads" is the set of queries that Q executed during + its execution. + - For each query R that in `reads(Q)`, we recursively demand the color + of R using try-mark-green. + - Note: it is important that we visit each node in `reads(Q)` in same order + as they occurred in the original compilation. See [the section on the query DAG below](#dag). + - If **any** of the nodes in `reads(Q)` wind up colored **red**, then Q is dirty. + - We re-execute Q and compare the hash of its result to the hash of the result + from the previous compilation. + - If the hash has not changed, we can mark Q as **green** and return. + - Otherwise, **all** of the nodes in `reads(Q)` must be **green**. In that case, + we can color Q as **green** and return. + + + +### The query DAG + +The query DAG code is stored in +[`src/librustc/dep_graph`][dep_graph]. Construction of the DAG is done +by instrumenting the query execution. + +One key point is that the query DAG also tracks ordering; that is, for +each query Q, we noy only track the queries that Q reads, we track the +**order** in which they were read. This allows try-mark-green to walk +those queries back in the same order. This is important because once a subquery comes back as red, +we can no longer be sure that Q will continue along the same path as before. +That is, imagine a query like this: + +```rust,ignore +fn main_query(tcx) { + if tcx.subquery1() { + tcx.subquery2() + } else { + tcx.subquery3() + } +} +``` + +Now imagine that in the first compilation, `main_query` starts by +executing `subquery1`, and this returns true. In that case, the next +query `main_query` executes will be `subquery2`, and `subquery3` will +not be executed at all. + +But now imagine that in the **next** compilation, the input has +changed such that `subquery` returns **false**. In this case, `subquery2` would never +execute. If try-mark-green were to visit `reads(main_query)` out of order, +however, it might have visited `subquery2` before `subquery1`, and hence executed it. +This can lead to ICEs and other problems in the compiler. + +[dep_graph]: https://github.com/rust-lang/rust/tree/master/src/librustc/dep_graph + +## Improvements to the basic algorithm + +In the description basic algorithm, we said that at the end of +compilation we would save the results of all the queries that were +performed. In practice, this can be quite wasteful -- many of those +results are very cheap to recompute, and serializing + deserializing +them is not a particular win. In practice, what we would do is to save +**the hashes** of all the subqueries that we performed. Then, in select cases, +we **also** save the results. + +This is why the incremental algorithm separates computing the +**color** of a node, which often does not require its value, from +computing the **result** of a node. Computing the result is done via a simple algorithm +like so: + +- Check if a saved result for Q is available. If so, compute the color of Q. + If Q is green, deserialize and return the saved result. +- Otherwise, execute Q. + - We can then compare the hash of the result and color Q as green if + it did not change. + +# Footnotes + +[^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis diff --git a/src/mir.md b/src/mir.md new file mode 100644 index 0000000000000..eeba6847295d1 --- /dev/null +++ b/src/mir.md @@ -0,0 +1,6 @@ +# The MIR (Mid-level IR) + +TODO + +Defined in the `src/librustc/mir/` module, but much of the code that +manipulates it is found in `src/librustc_mir`. diff --git a/src/query.md b/src/query.md new file mode 100644 index 0000000000000..65d651307ac39 --- /dev/null +++ b/src/query.md @@ -0,0 +1,314 @@ +# Queries: demand-driven compilation + +As described in [the high-level overview of the compiler][hl], the +Rust compiler is current transitioning from a traditional "pass-based" +setup to a "demand-driven" system. **The Compiler Query System is the +key to our new demand-driven organization.** The idea is pretty +simple. You have various queries that compute things about the input +-- for example, there is a query called `type_of(def_id)` that, given +the def-id of some item, will compute the type of that item and return +it to you. + +[hl]: high-level-overview.html + +Query execution is **memoized** -- so the first time you invoke a +query, it will go do the computation, but the next time, the result is +returned from a hashtable. Moreover, query execution fits nicely into +**incremental computation**; the idea is roughly that, when you do a +query, the result **may** be returned to you by loading stored data +from disk (but that's a separate topic we won't discuss further here). + +The overall vision is that, eventually, the entire compiler +control-flow will be query driven. There will effectively be one +top-level query ("compile") that will run compilation on a crate; this +will in turn demand information about that crate, starting from the +*end*. For example: + +- This "compile" query might demand to get a list of codegen-units + (i.e., modules that need to be compiled by LLVM). +- But computing the list of codegen-units would invoke some subquery + that returns the list of all modules defined in the Rust source. +- That query in turn would invoke something asking for the HIR. +- This keeps going further and further back until we wind up doing the + actual parsing. + +However, that vision is not fully realized. Still, big chunks of the +compiler (for example, generating MIR) work exactly like this. + +### Invoking queries + +To invoke a query is simple. The tcx ("type context") offers a method +for each defined query. So, for example, to invoke the `type_of` +query, you would just do this: + +```rust +let ty = tcx.type_of(some_def_id); +``` + +### Cycles between queries + +Currently, cycles during query execution should always result in a +compilation error. Typically, they arise because of illegal programs +that contain cyclic references they shouldn't (though sometimes they +arise because of compiler bugs, in which case we need to factor our +queries in a more fine-grained fashion to avoid them). + +However, it is nonetheless often useful to *recover* from a cycle +(after reporting an error, say) and try to soldier on, so as to give a +better user experience. In order to recover from a cycle, you don't +get to use the nice method-call-style syntax. Instead, you invoke +using the `try_get` method, which looks roughly like this: + +```rust +use ty::maps::queries; +... +match queries::type_of::try_get(tcx, DUMMY_SP, self.did) { + Ok(result) => { + // no cycle occurred! You can use `result` + } + Err(err) => { + // A cycle occurred! The error value `err` is a `DiagnosticBuilder`, + // meaning essentially an "in-progress", not-yet-reported error message. + // See below for more details on what to do here. + } +} +``` + +So, if you get back an `Err` from `try_get`, then a cycle *did* occur. This means that +you must ensure that a compiler error message is reported. You can do that in two ways: + +The simplest is to invoke `err.emit()`. This will emit the cycle error to the user. + +However, often cycles happen because of an illegal program, and you +know at that point that an error either already has been reported or +will be reported due to this cycle by some other bit of code. In that +case, you can invoke `err.cancel()` to not emit any error. It is +traditional to then invoke: + +``` +tcx.sess.delay_span_bug(some_span, "some message") +``` + +`delay_span_bug()` is a helper that says: we expect a compilation +error to have happened or to happen in the future; so, if compilation +ultimately succeeds, make an ICE with the message `"some +message"`. This is basically just a precaution in case you are wrong. + +### How the compiler executes a query + +So you may be wondering what happens when you invoke a query +method. The answer is that, for each query, the compiler maintains a +cache -- if your query has already been executed, then, the answer is +simple: we clone the return value out of the cache and return it +(therefore, you should try to ensure that the return types of queries +are cheaply cloneable; insert a `Rc` if necessary). + +#### Providers + +If, however, the query is *not* in the cache, then the compiler will +try to find a suitable **provider**. A provider is a function that has +been defined and linked into the compiler somewhere that contains the +code to compute the result of the query. + +**Providers are defined per-crate.** The compiler maintains, +internally, a table of providers for every crate, at least +conceptually. Right now, there are really two sets: the providers for +queries about the **local crate** (that is, the one being compiled) +and providers for queries about **external crates** (that is, +dependencies of the local crate). Note that what determines the crate +that a query is targeting is not the *kind* of query, but the *key*. +For example, when you invoke `tcx.type_of(def_id)`, that could be a +local query or an external query, depending on what crate the `def_id` +is referring to (see the `self::keys::Key` trait for more information +on how that works). + +Providers always have the same signature: + +```rust +fn provider<'cx, 'tcx>(tcx: TyCtxt<'cx, 'tcx, 'tcx>, + key: QUERY_KEY) + -> QUERY_RESULT +{ + ... +} +``` + +Providers take two arguments: the `tcx` and the query key. Note also +that they take the *global* tcx (i.e., they use the `'tcx` lifetime +twice), rather than taking a tcx with some active inference context. +They return the result of the query. + +#### How providers are setup + +When the tcx is created, it is given the providers by its creator using +the `Providers` struct. This struct is generate by the macros here, but it +is basically a big list of function pointers: + +```rust +struct Providers { + type_of: for<'cx, 'tcx> fn(TyCtxt<'cx, 'tcx, 'tcx>, DefId) -> Ty<'tcx>, + ... +} +``` + +At present, we have one copy of the struct for local crates, and one +for external crates, though the plan is that we may eventually have +one per crate. + +These `Provider` structs are ultimately created and populated by +`librustc_driver`, but it does this by distributing the work +throughout the other `rustc_*` crates. This is done by invoking +various `provide` functions. These functions tend to look something +like this: + +```rust +pub fn provide(providers: &mut Providers) { + *providers = Providers { + type_of, + ..*providers + }; +} +``` + +That is, they take an `&mut Providers` and mutate it in place. Usually +we use the formulation above just because it looks nice, but you could +as well do `providers.type_of = type_of`, which would be equivalent. +(Here, `type_of` would be a top-level function, defined as we saw +before.) So, if we want to add a provider for some other query, +let's call it `fubar`, into the crate above, we might modify the `provide()` +function like so: + +```rust +pub fn provide(providers: &mut Providers) { + *providers = Providers { + type_of, + fubar, + ..*providers + }; +} + +fn fubar<'cx, 'tcx>(tcx: TyCtxt<'cx, 'tcx>, key: DefId) -> Fubar<'tcx> { .. } +``` + +NB. Most of the `rustc_*` crates only provide **local +providers**. Almost all **extern providers** wind up going through the +[`rustc_metadata` crate][rustc_metadata], which loads the information from the crate +metadata. But in some cases there are crates that provide queries for +*both* local and external crates, in which case they define both a +`provide` and a `provide_extern` function that `rustc_driver` can +invoke. + +[rustc_metadata]: https://github.com/rust-lang/rust/tree/master/src/librustc_metadata + +### Adding a new kind of query + +So suppose you want to add a new kind of query, how do you do so? +Well, defining a query takes place in two steps: + +1. first, you have to specify the query name and arguments; and then, +2. you have to supply query providers where needed. + +To specify the query name and arguments, you simply add an entry to +the big macro invocation in +[`src/librustc/ty/maps/mod.rs`][maps-mod]. This will probably have +changed by the time you read this README, but at present it looks +something like: + +[maps-mod]: https://github.com/rust-lang/rust/blob/master/src/librustc/ty/maps/mod.rs + +``` +define_maps! { <'tcx> + /// Records the type of every item. + [] fn type_of: TypeOfItem(DefId) -> Ty<'tcx>, + + ... +} +``` + +Each line of the macro defines one query. The name is broken up like this: + +``` +[] fn type_of: TypeOfItem(DefId) -> Ty<'tcx>, +^^ ^^^^^^^ ^^^^^^^^^^ ^^^^^ ^^^^^^^^ +| | | | | +| | | | result type of query +| | | query key type +| | dep-node constructor +| name of query +query flags +``` + +Let's go over them one by one: + +- **Query flags:** these are largely unused right now, but the intention + is that we'll be able to customize various aspects of how the query is + processed. +- **Name of query:** the name of the query method + (`tcx.type_of(..)`). Also used as the name of a struct + (`ty::maps::queries::type_of`) that will be generated to represent + this query. +- **Dep-node constructor:** indicates the constructor function that + connects this query to incremental compilation. Typically, this is a + `DepNode` variant, which can be added by modifying the + `define_dep_nodes!` macro invocation in + [`librustc/dep_graph/dep_node.rs`][dep-node]. + - However, sometimes we use a custom function, in which case the + name will be in snake case and the function will be defined at the + bottom of the file. This is typically used when the query key is + not a def-id, or just not the type that the dep-node expects. +- **Query key type:** the type of the argument to this query. + This type must implement the `ty::maps::keys::Key` trait, which + defines (for example) how to map it to a crate, and so forth. +- **Result type of query:** the type produced by this query. This type + should (a) not use `RefCell` or other interior mutability and (b) be + cheaply cloneable. Interning or using `Rc` or `Arc` is recommended for + non-trivial data types. + - The one exception to those rules is the `ty::steal::Steal` type, + which is used to cheaply modify MIR in place. See the definition + of `Steal` for more details. New uses of `Steal` should **not** be + added without alerting `@rust-lang/compiler`. + +[dep-node]: https://github.com/rust-lang/rust/blob/master/src/librustc/dep_graph/dep_node.rs + +So, to add a query: + +- Add an entry to `define_maps!` using the format above. +- Possibly add a corresponding entry to the dep-node macro. +- Link the provider by modifying the appropriate `provide` method; + or add a new one if needed and ensure that `rustc_driver` is invoking it. + +#### Query structs and descriptions + +For each kind, the `define_maps` macro will generate a "query struct" +named after the query. This struct is a kind of a place-holder +describing the query. Each such struct implements the +`self::config::QueryConfig` trait, which has associated types for the +key/value of that particular query. Basically the code generated looks something +like this: + +```rust +// Dummy struct representing a particular kind of query: +pub struct type_of<'tcx> { phantom: PhantomData<&'tcx ()> } + +impl<'tcx> QueryConfig for type_of<'tcx> { + type Key = DefId; + type Value = Ty<'tcx>; +} +``` + +There is an additional trait that you may wish to implement called +`self::config::QueryDescription`. This trait is used during cycle +errors to give a "human readable" name for the query, so that we can +summarize what was happening when the cycle occurred. Implementing +this trait is optional if the query key is `DefId`, but if you *don't* +implement it, you get a pretty generic error ("processing `foo`..."). +You can put new impls into the `config` module. They look something like this: + +```rust +impl<'tcx> QueryDescription for queries::type_of<'tcx> { + fn describe(tcx: TyCtxt, key: DefId) -> String { + format!("computing the type of `{}`", tcx.item_path_str(key)) + } +} +``` +