Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add rule streaming proposal #7195

Merged
merged 1 commit into from
Feb 28, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
241 changes: 241 additions & 0 deletions doc/dev/rule-streaming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# Rule production

This document describe a new design for the production of build rules
in Dune. The new design aims to be more natural, easier to reason
about and to make existing feature work well with newer ones such as
directory targets.

## How does rule production works?

### `Dune_engine.Load_rules`

The production of rules is driven by the module `Load_rules` in the
`dune_engine` library. This library is the build system core of
Dune. It is meant as a general purpose library for writing build
systems, and the Dune software is built on top of it. In theory,
`dune_engine` shouldn't know about `dune` or `dune-project`
files. However, for historical reason this is not the case yet and
`dune_engine` still knows some things about them.

As we work on Dune, we expect that `dune_engine` will become more and
more agnostic. Even though it is not completely agnostic, we have
successfully been using it to build Jane Street code base, using the
Jane Street rules on top of this core. So it's already more general
than Dune iself.

For the purpose of this design doc, we will treat `dune_engine` as a
completly general library that doesn't know about `dune` files.

The main feature of `Load_dir` is the `Load_dir.load_dir` function:

```ocaml
val load_dir : dir:Path.Build.t -> Loaded.t Memo.Build.t
```

`Loaded.t` represents a "loaded" set of rules for a particular
directory. It can also be thought as a "compiled" set of rules. A
`Loaded.t` contains all the rules in existence that produce targets in
`dir`. For instance, given a `Loaded.t` we can figure out all the
files that would be produced in `dir` if we were building everything
that could be built. While `dir` is a build directory, this also
includes files present in the source tree. This is because
`Load_dir.load_dir` implicitely adds copy rules for all source files
present in the source directory that correspond to the build directory
`dir`. For instance, if `dir` is `_build/default/src`, Dune will
implicit add rules to copy files in `src` to `_build/default/src`.
Except for rules that have the special `promote` or `fallback` modes.

This is in fact how Dune evaluates globs during the build. Indeed,
when writing `dune` files we work in an imaginary world where both the
source files and the generated files are present. So when we write
`(deps (glob_files *.txt))`, this `*.txt` denotes both `.txt` files
that are present on disk in the source tree but also as the ones that
can be generated by the build.

In practice, to evaluate `(glob_files *.txt)` in directory `d`, Dune
calls `Load_dir.load_dir ~dir:d` and filter the list of files that can
be built. Similarly, when Dune needs to build a file
`_build/default/src/x`, it first calls `Load_dir.load_dir` with
`_build/default/src` and then looks up a rule that has `x` has
target in the returned `Loaded.t`. The `Load_dir.load_dir` is
memoised, so it can be called multiple times during the build without
guilt.

While `Load_dir` is responsible for driving the production of rules,
it is part of `dune_engine` which doesn't know about `dune` files and
doesn't know about OCaml libraries or OCaml compilation in general. So
it is not responsible for actually producing the build rules that
allow to build Dune projects. Instead, `Load_dir` defers the actual
production of rules to a callback that it obtains via
`Build_config`. Inside Dune, this callback is implemented by the
`Gen_rules` module inside the `dune_rules` library. `dune_rules` is
the library that is responsible for parsing, interpreting and
compiling `dune` files down to low-level build rules.

### `Dune_rules.Gen_rules`

The entry of `Dune_rules.Gen_rules` is the `gen_rules` function. It's
API looks like:

```ocaml
val gen_rules : dir:Path.Build.t -> Dune_engine.Rules.t Memo.Build.t
```

Where `Dune_rules.Rules.t` is a "raw" set of rules. Raw in the sense
that there is no overlap checks or any other checks. During the rule
production phase, we merely accumulate a set of rules that is later
processed. The API of `gen_rules` is in fact a bit more complex, but
the above definition is enough for the purpose of this document.

The first thing `gen_rules` does is analyse the directory it is
given. If the directory corresponds to a source directory with a `dune`
file, `gen_rules` will dispatch the call to the part of `dune_rules`
that parses and interprets the `dune` file. This is the simplest case,
but even in this case there are some things worth mentioning.

For instance, when compiling an OCaml library dune stores the
artifacts for the library in generated dot-directories. For instance,
the cmi files for library `foo` living in source directory `src` will
end up in `_build/default/src/.foo.objs/byte`. We could produce these
rules when `gen_rules` is called with direcotry
`_build/default/src/.foo.objs/byte`, however that would spread out the
logic for interpreting `library` stanzas. It is much simpler to
produce all the build rules corresponding to a `library` stanza in one
go. This is what is hapenning at the moment: when called with
directory `_build/default/src`, `gen_rules` will not only produce
rules for this directory but will also produce rules for
`_build/default/src/.foo.objs/byte` and various other directories.

`Load_rules` doesn't know anything about this. And in particular, it
doesn't know that it is the `gen_rules` call for directory
`_build/default/src/` that will produce the rules for the dot
subdirectories. When `Load_dir` loads the rules for the
`.../.foo.objs/byte` sub-directory, it simply calls `gen_rules` with
this directory. It is `gen_rules` that "redirects" the call to the
`_build/default/src` directory by calling
`Load_rules.load_dir_and_produce_its_rules`. This function simply
calls `Load_rules.load_dir` and re-emits all the raw rules that were
returned by the corresponding `gen_rules` call.

This works because `Load_rules.load_dir` accepts the facts that
`gen_rules` produces rules for many directory at once. It simply
filters out the result. But for things to behave well, the unwritten
following invariant must hold: `gen_rules ~dir:d` is allowed to
generate rules for directory `d'` iff `gen_rules ~dir:d'` emits a call
to `Load_rules.load_dir ~dir:d`.

This schenario happens in a number of cases. All these cases share a
common pattern: the redirections are always to an ancestor
directory. At the moment, there is one exception to this pattern in
the odoc rules, however it is easy to remove.

Finally, the `copy_files` stanza creates another form of dependency
between directory. In order to calculate the targets produced by
`copy_files`, which needs to be known at rule production time, we need
to evaluate the glob given to `copy_files`. Which requires doing a
call to `Load_rules.load_dir` as previously described. Contrary to the
other form of dependency we just describe, this ones can go from any
directory to any other directory. For instance, the following stanza
in `src/dune`:

```
(copy_files foo/*.txt)
```

would create a dependency from `_build/default/src` to
`_build_default/src/foo`.

So in the end, if we were looking at the internal computation graph of
Dune and narrowing it to just the calls to `Load_rules.load_dir`, we
would see a graph with many edges going from a directory to one of its
ancestor. These would mostly be between generated dot-subdirectories
and their first ancestor that has a corresponding directory in the
source tree. Plus a few other arbitrary ones for each `copy_files`
stanza.

## Directory targets

Before directory targets, answering the question "what rules produces
file X?" was easy. Dune would just call `Load_rules.load_dir` and
lookup `X` in the result. With directory targets, things are a bit
more complicated. Indeed, `X` might also be produced by a directory
target in an ancestor directory. This means that `Load_rules.load_dir`
now need to look in parent directories as well, which introduce more
dependencies from directories to their parents and can create cycles
because of `copy_files` stanza that create dependencies in the other
direction.

At a result, some combinations of `copy_files` and directory targets
don't produce the expected result. This is documented in the test
suite.

## Problem

The above description exposes a concrete problem with directory
targets, but there is also a general sense of messiness in the way
things work. Generating rules for multiple directories at once is
natural, but the current encoding is odd.

## Proposal

The proposal is to add the following rule: `gen_rules ~dir` is allowed
to produced rules in `dir` or any of its descendant only. It is not
allowed to produce rules anywhere else.

`Load_rules.load_dir ~dir` will then always call itself recursively on
the parent of `dir` and take the union of the rules produced by
`gen_rules` for `dir` and the ones produced by the recursive
call. `gen_rules` will no longer have to redirect a call via
`Load_rules.load_dir_and_produce_its_rules`, which we would simply
remove.

This introduces a cycle with all `copy_files` stanza that copy files
from a sub-directory. We propose the break this cycle by introducing
laziness in the rule production code.

### Generating rules with a mask

The idea is that when we produce rules, we will produce rules under
a current active "mask" that tells us where we are allowed to generate
files or directories. Trying to produce a rule with targets not
matched by this mask will be a runtime error.

When entering `gen_rules ~dir`, the initial mask will be: "any files
and directories that is a descendant of directory `dir`".

We can then narrow the mask to a sub-mask:

```ocaml
val narrow : Target_mask.t -> unit Memo.Build.t -> unit Memo.Build.t
```

With `narrow mask m`, `m` would only be allowed to produce rules whose
target are matched by the intersection of `mask` and the current
mask. `m` wouldn't be evaluated eagerly. Instead, `gen_rules` would
now return a set of direct rules as well as a list of
`(Target_mask.t * unit Memo.Build.t)`. Let's call such a pair a
suspension. A suspension can be forced by evaluation its second
component. Doing so will yield a list of rules matched by the mask and
a new list of suspension.

### Staged rules loading

The next step is to stage `Load_rules.load_dir`. In addition to taking
a directory, `load_dir` will now also take a mask and will return the
set of rules for this mask. To do that, it might need to force a bunch of
suspensions recursively.


### How does that help?

We will put `copy_rules` under a `narrow <only file targets in current
dir>`. In order to determine if a directory is part of a directory
target in an ancestor directory, we wouldn't need to force this
suspension.

### Difficulties

Interpreting a `library` stanza requires knowing the set of `.ml`
files in the current directrory. Knowing this requires interpreting
`copy_files` in the current directory. So the interpretation of
`library` stanzas will need to go under a `narrow` as well.