diff --git a/doc/dev/rule-streaming.md b/doc/dev/rule-streaming.md new file mode 100644 index 00000000000..fec88f232fc --- /dev/null +++ b/doc/dev/rule-streaming.md @@ -0,0 +1,241 @@ +# Rule production + +This document describe a new design for the production of build rules +in Dune. The new design aims to be more natural, easier to reason +about and to make existing feature work well with newer ones such as +directory targets. + +## How does rule production works? + +### `Dune_engine.Load_rules` + +The production of rules is driven by the module `Load_rules` in the +`dune_engine` library. This library is the build system core of +Dune. It is meant as a general purpose library for writing build +systems, and the Dune software is built on top of it. In theory, +`dune_engine` shouldn't know about `dune` or `dune-project` +files. However, for historical reason this is not the case yet and +`dune_engine` still knows some things about them. + +As we work on Dune, we expect that `dune_engine` will become more and +more agnostic. Even though it is not completely agnostic, we have +successfully been using it to build Jane Street code base, using the +Jane Street rules on top of this core. So it's already more general +than Dune iself. + +For the purpose of this design doc, we will treat `dune_engine` as a +completly general library that doesn't know about `dune` files. + +The main feature of `Load_dir` is the `Load_dir.load_dir` function: + +```ocaml +val load_dir : dir:Path.Build.t -> Loaded.t Memo.Build.t +``` + +`Loaded.t` represents a "loaded" set of rules for a particular +directory. It can also be thought as a "compiled" set of rules. A +`Loaded.t` contains all the rules in existence that produce targets in +`dir`. For instance, given a `Loaded.t` we can figure out all the +files that would be produced in `dir` if we were building everything +that could be built. While `dir` is a build directory, this also +includes files present in the source tree. This is because +`Load_dir.load_dir` implicitely adds copy rules for all source files +present in the source directory that correspond to the build directory +`dir`. For instance, if `dir` is `_build/default/src`, Dune will +implicit add rules to copy files in `src` to `_build/default/src`. +Except for rules that have the special `promote` or `fallback` modes. + +This is in fact how Dune evaluates globs during the build. Indeed, +when writing `dune` files we work in an imaginary world where both the +source files and the generated files are present. So when we write +`(deps (glob_files *.txt))`, this `*.txt` denotes both `.txt` files +that are present on disk in the source tree but also as the ones that +can be generated by the build. + +In practice, to evaluate `(glob_files *.txt)` in directory `d`, Dune +calls `Load_dir.load_dir ~dir:d` and filter the list of files that can +be built. Similarly, when Dune needs to build a file +`_build/default/src/x`, it first calls `Load_dir.load_dir` with +`_build/default/src` and then looks up a rule that has `x` has +target in the returned `Loaded.t`. The `Load_dir.load_dir` is +memoised, so it can be called multiple times during the build without +guilt. + +While `Load_dir` is responsible for driving the production of rules, +it is part of `dune_engine` which doesn't know about `dune` files and +doesn't know about OCaml libraries or OCaml compilation in general. So +it is not responsible for actually producing the build rules that +allow to build Dune projects. Instead, `Load_dir` defers the actual +production of rules to a callback that it obtains via +`Build_config`. Inside Dune, this callback is implemented by the +`Gen_rules` module inside the `dune_rules` library. `dune_rules` is +the library that is responsible for parsing, interpreting and +compiling `dune` files down to low-level build rules. + +### `Dune_rules.Gen_rules` + +The entry of `Dune_rules.Gen_rules` is the `gen_rules` function. It's +API looks like: + +```ocaml +val gen_rules : dir:Path.Build.t -> Dune_engine.Rules.t Memo.Build.t +``` + +Where `Dune_rules.Rules.t` is a "raw" set of rules. Raw in the sense +that there is no overlap checks or any other checks. During the rule +production phase, we merely accumulate a set of rules that is later +processed. The API of `gen_rules` is in fact a bit more complex, but +the above definition is enough for the purpose of this document. + +The first thing `gen_rules` does is analyse the directory it is +given. If the directory corresponds to a source directory with a `dune` +file, `gen_rules` will dispatch the call to the part of `dune_rules` +that parses and interprets the `dune` file. This is the simplest case, +but even in this case there are some things worth mentioning. + +For instance, when compiling an OCaml library dune stores the +artifacts for the library in generated dot-directories. For instance, +the cmi files for library `foo` living in source directory `src` will +end up in `_build/default/src/.foo.objs/byte`. We could produce these +rules when `gen_rules` is called with direcotry +`_build/default/src/.foo.objs/byte`, however that would spread out the +logic for interpreting `library` stanzas. It is much simpler to +produce all the build rules corresponding to a `library` stanza in one +go. This is what is hapenning at the moment: when called with +directory `_build/default/src`, `gen_rules` will not only produce +rules for this directory but will also produce rules for +`_build/default/src/.foo.objs/byte` and various other directories. + +`Load_rules` doesn't know anything about this. And in particular, it +doesn't know that it is the `gen_rules` call for directory +`_build/default/src/` that will produce the rules for the dot +subdirectories. When `Load_dir` loads the rules for the +`.../.foo.objs/byte` sub-directory, it simply calls `gen_rules` with +this directory. It is `gen_rules` that "redirects" the call to the +`_build/default/src` directory by calling +`Load_rules.load_dir_and_produce_its_rules`. This function simply +calls `Load_rules.load_dir` and re-emits all the raw rules that were +returned by the corresponding `gen_rules` call. + +This works because `Load_rules.load_dir` accepts the facts that +`gen_rules` produces rules for many directory at once. It simply +filters out the result. But for things to behave well, the unwritten +following invariant must hold: `gen_rules ~dir:d` is allowed to +generate rules for directory `d'` iff `gen_rules ~dir:d'` emits a call +to `Load_rules.load_dir ~dir:d`. + +This schenario happens in a number of cases. All these cases share a +common pattern: the redirections are always to an ancestor +directory. At the moment, there is one exception to this pattern in +the odoc rules, however it is easy to remove. + +Finally, the `copy_files` stanza creates another form of dependency +between directory. In order to calculate the targets produced by +`copy_files`, which needs to be known at rule production time, we need +to evaluate the glob given to `copy_files`. Which requires doing a +call to `Load_rules.load_dir` as previously described. Contrary to the +other form of dependency we just describe, this ones can go from any +directory to any other directory. For instance, the following stanza +in `src/dune`: + +``` +(copy_files foo/*.txt) +``` + +would create a dependency from `_build/default/src` to + `_build_default/src/foo`. + +So in the end, if we were looking at the internal computation graph of +Dune and narrowing it to just the calls to `Load_rules.load_dir`, we +would see a graph with many edges going from a directory to one of its +ancestor. These would mostly be between generated dot-subdirectories +and their first ancestor that has a corresponding directory in the +source tree. Plus a few other arbitrary ones for each `copy_files` +stanza. + +## Directory targets + +Before directory targets, answering the question "what rules produces +file X?" was easy. Dune would just call `Load_rules.load_dir` and +lookup `X` in the result. With directory targets, things are a bit +more complicated. Indeed, `X` might also be produced by a directory +target in an ancestor directory. This means that `Load_rules.load_dir` +now need to look in parent directories as well, which introduce more +dependencies from directories to their parents and can create cycles +because of `copy_files` stanza that create dependencies in the other +direction. + +At a result, some combinations of `copy_files` and directory targets +don't produce the expected result. This is documented in the test +suite. + +## Problem + +The above description exposes a concrete problem with directory +targets, but there is also a general sense of messiness in the way +things work. Generating rules for multiple directories at once is +natural, but the current encoding is odd. + +## Proposal + +The proposal is to add the following rule: `gen_rules ~dir` is allowed +to produced rules in `dir` or any of its descendant only. It is not +allowed to produce rules anywhere else. + +`Load_rules.load_dir ~dir` will then always call itself recursively on +the parent of `dir` and take the union of the rules produced by +`gen_rules` for `dir` and the ones produced by the recursive +call. `gen_rules` will no longer have to redirect a call via +`Load_rules.load_dir_and_produce_its_rules`, which we would simply +remove. + +This introduces a cycle with all `copy_files` stanza that copy files +from a sub-directory. We propose the break this cycle by introducing +laziness in the rule production code. + +### Generating rules with a mask + +The idea is that when we produce rules, we will produce rules under +a current active "mask" that tells us where we are allowed to generate +files or directories. Trying to produce a rule with targets not +matched by this mask will be a runtime error. + +When entering `gen_rules ~dir`, the initial mask will be: "any files +and directories that is a descendant of directory `dir`". + +We can then narrow the mask to a sub-mask: + +```ocaml +val narrow : Target_mask.t -> unit Memo.Build.t -> unit Memo.Build.t +``` + +With `narrow mask m`, `m` would only be allowed to produce rules whose +target are matched by the intersection of `mask` and the current +mask. `m` wouldn't be evaluated eagerly. Instead, `gen_rules` would +now return a set of direct rules as well as a list of +`(Target_mask.t * unit Memo.Build.t)`. Let's call such a pair a +suspension. A suspension can be forced by evaluation its second +component. Doing so will yield a list of rules matched by the mask and +a new list of suspension. + +### Staged rules loading + +The next step is to stage `Load_rules.load_dir`. In addition to taking +a directory, `load_dir` will now also take a mask and will return the +set of rules for this mask. To do that, it might need to force a bunch of +suspensions recursively. + + +### How does that help? + +We will put `copy_rules` under a `narrow `. In order to determine if a directory is part of a directory +target in an ancestor directory, we wouldn't need to force this +suspension. + +### Difficulties + +Interpreting a `library` stanza requires knowing the set of `.ml` +files in the current directrory. Knowing this requires interpreting +`copy_files` in the current directory. So the interpretation of +`library` stanzas will need to go under a `narrow` as well.