Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when installing several dune packages at the same time #2131

Closed
kit-ty-kate opened this issue May 7, 2019 · 25 comments
Closed

Comments

@kit-ty-kate
Copy link
Member

opam-health-check (check.ocamllabs.io) detected a transient failure. While building one of the packages, the following error happened:

#=== ERROR while compiling cppo_ocamlbuild.1.6.0 ==============================#
# context              2.0.3 | linux/x86_64 | ocaml-base-compiler.4.05.0 | file:///home/opam/opam-repository
# path                 ~/.opam/4.05.0/.opam-switch/build/cppo_ocamlbuild.1.6.0
# command              ~/.opam/4.05.0/bin/jbuilder build -p cppo_ocamlbuild -j 71
# exit-code            1
# env-file             ~/.opam/log/cppo_ocamlbuild-1-52c930.env
# output-file          ~/.opam/log/cppo_ocamlbuild-1-52c930.out
### output ###
# [...]
# File "/home/opam/.opam/4.05.0/lib/result/dune-package", line 1, characters 0-0:
# Error: Invalid first line, expected: (lang <lang> <version>)

The result package is not a dependency of the package failing and so is built and installed while cppo_ocamlbuild is building. I'm guessing what is happening here is that, dune is scanning for all packages currently installed and the dune-package file gets installed by opam in the meantime (first created, then filed up I suppose. cc @AltGr @rjbou). With the right timing dune parses the file in the interval it is created but not filled up, and fails.

The behaviour was observed with dune.1.9.1

@rgrinberg
Copy link
Member

I'm wondering if this issue is caused by the overly eager package loading in dune. To check this hypothesis, could you try and reproduce this issue with dune 1.9.2 and above? If this is indeed the issue, you will not observe this failure with these versions.

Now if this is in fact the issue, then 1.9.2+ do not actually fix it. They only mitigate it whenever variants aren't used. To truly fix this issue, dune should only ever try loading packages which are dependencies of the package that it is currently building.

Side note: In fact, this seems to be desirable behavior anyway from a sandboxing perspective: dune should not observe packages that are not dependencies (including transitive dependencies here).

Now let's brainstorm on how to fix this issue properly. First, I think we will need to implement the variants index discussed in the meeting yesterday. Also, I'm wondering if we should introduce the following restriction: a variant implementation is not visible unless it exists in one of the package dependencies of the package being built. Hence when we specify variants when building an executable, we may also add package names so that dune knows where to look for variant implementations.

@ghost
Copy link

ghost commented May 8, 2019

Implementing the per-virtual-lib variant index we discussed yesterday should be enough. Should we just completely disable variants until this is implemented?

@rgrinberg
Copy link
Member

Is it enough? Opam's installation isn't atomic so I'm thinking we could still have issues where the index is updated by the package still isn't fully installed (or vice versa) for example.

@ghost
Copy link

ghost commented May 8, 2019

BTW, @kit-ty-kate we were thinking of marking 1.9.0 and 1.9.1 as unavailable in opam once 1.9.2 is released

@rgrinberg
Copy link
Member

Should we just completely disable variants until this is implemented?

I'm not sure this is necessary as we've shipped virtual libraries in an incomplete state before (behind a feature flag as well). So I think it's ok to leave them as 0.1.

@ghost
Copy link

ghost commented May 8, 2019

@rgrinberg the index is never updated. It is constructed once and for all when building the virtual library and will live in the (library ...) stanza inside the dune-package of the virtual lib

@ghost
Copy link

ghost commented May 8, 2019

I'll write down the concrete proposal we discussed to make sure we are all on the same page about this

@ghost
Copy link

ghost commented May 8, 2019

I wrote down the new proposal: #2134

@rgrinberg
Copy link
Member

I believe this can be now closed as we've fixed this in 1.9.3 and made variants experimental.

@tomjack
Copy link

tomjack commented Nov 4, 2019

I have observed this error twice, using dune 1.11.4.

To truly fix this issue, dune should only ever try loading packages which are dependencies of the package that it is currently building.

we've fixed this in 1.9.3

Was it truly fixed? :)

@ghost
Copy link

ghost commented Nov 5, 2019

Do you have a reproduction case we could look at?

@tomjack
Copy link

tomjack commented Nov 6, 2019

I don't know how useful these logs will be to you, but you can see three examples here:
https://gitlab.com/ligolang/ligo/-/jobs/341369966
https://gitlab.com/ligolang/ligo/-/jobs/333174151
https://gitlab.com/ligolang/ligo/-/jobs/343731757

To find the error, you can search for "ERROR while compiling" on the page.

In the second case, the job was retried, and succeeded: https://gitlab.com/ligolang/ligo/-/jobs/341466583

The third case just happened (thankfully, outside of the special "debian 9" job, I guess that was a red herring.)

The thing our build script was doing when the failure happened was:

opam install -y --deps-only --with-test $(find src vendors -name \*.opam)

You might notice in the logs a problem with our custom opam repository:

[WARNING] The repository ... doesn't have a 'repo' file, and might not be compatible with this version of opam.
[NOTE] Repository at ... doesn't define its version, assuming it's 1.2.

I wonder if that could somehow cause this? I have just added the missing repo file, will wait and see if it happens again...


In all cases, the failure:

  • happened while building ocplip-resto (opam, dune)
  • was caused by a blank dune-package file for tezos-protocol-environment-sigs (opam, dune)

I haven't been able to reproduce it locally yet, and I'm not sure what other kinds of information would be relevant. I am happy to try to get more details for you, if you have a clue where to look...

I might try some more to reproduce locally, but, my current plan is to simply reduce the frequency with which our CI does this step. That would be helpful for other reasons too.

@ghost
Copy link

ghost commented Nov 7, 2019

Just a thought: do the package for which you are observing issues have dependencies? i.e. do they have (optional) or (select ...) in their jbuild/dune files? and if yes, are these properly reported in the opam files?

If there are optional dependencies that are not present in the opam file, then that would explain the race condition.

@tomjack
Copy link

tomjack commented Nov 21, 2019

I couldn't find any optional or select anywhere. I linked the dune and opam files for the two obvious suspect libraries above.

I haven't seen this happen again yet, since I fixed this problem with our custom opam repository:

[NOTE] Repository at ... doesn't define its version, assuming it's 1.2.

Still waiting to see if the issue happens again...

@kit-ty-kate
Copy link
Member Author

@tomjack we've tracked down the issue and I've opened a new ticket in #2913

@tomjack
Copy link

tomjack commented Nov 25, 2019

Astounding! Thanks all!

@AltGr
Copy link
Member

AltGr commented Nov 25, 2019

@diml: how about making dune aware of opam locks, i.e. taking a read-lock on <pfx>/.opam-switch/lock if it exists before loading the libs, and a write lock before processing dune install ?
It could make dune and opam play more nicely together in this kind of scenarios.

@rgrinberg
Copy link
Member

rgrinberg commented Nov 27, 2019 via email

@ghost
Copy link

ghost commented Nov 27, 2019

I don't like the idea of a read lock either. Regarding a write lock, my first thought is about people who do sudo make install and so that seems a bit dodgy.

@rgrinberg
Copy link
Member

Why does sudo make a difference for the write lock? Are you worried that the lock wouldn't be removable by a normal user if something bad happens?

@ghost
Copy link

ghost commented Nov 28, 2019

Yep. I also feel like it's a workaround for a real issue that should be fixed. Dune should absolutely not be looking at library files from things that are not dependency of what is being built.

@rgrinberg
Copy link
Member

Yeah, but I'm not sure if this possible with our current design. The library database is available for querying when setting up the builds rules. So this means that if we have a directory with rules look like this:

(library
 (name foo)
 (modules ())
 (libraries bla))

(library
 (public_name bar)
 (libraries blarg))

And we're installing bar, the look up to bla is going to happen regardless.

The way to avoid the look up to bla would be to attach package to library.

@ghost
Copy link

ghost commented Nov 28, 2019

Or do the lookup lazily? i.e. instead of using the Or_exn.t monad for dealing with library resolution, we use a lazy or_exn monad. In fact, maybe we should introduce a dedicated abstract monad to deal with library resolution. That would give us more flexibility, in particular if we want to make it use Fiber at some point

@rgrinberg
Copy link
Member

I'm not really sure how the lookup can be done lazily.

As I understand the situation, we setup rules for a directory all at once. So anyway to make things lazier here would require us to skip setting up the rules for foo. I don't how we can determine if foo is going to be required for other @install targets in the currently installed package.

@ghost
Copy link

ghost commented Dec 16, 2019

This is what I had in mind: right now, we do the library resolution inside the Or_exn monad. So a resolved library has type Lib.t Or_exn.t, a closure has type Lib.t list Or_exn.t, etc... While setting up the rules, we only manipulate _ Or_exn.t values when it comes to library management. Then we connect this to the Build.t selective so that the Or_exn.t value is opened at rule execution time and any error is raised at this point, only breaking the build hard if the rule is indeed executed.

So in fact, even dune-package file reading errors shouldn't break the build hard if the corresponding rules are not needed. This might need a bit more plumbing though.

Regarding dune not setting up rules it doesn't execute. I'm not entirely sure what it entails code-wise. It might be a serious refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants