Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build rust code only once per platform in a CI run #7047

Merged
merged 5 commits into from
Jan 10, 2019

Conversation

benjyw
Copy link
Contributor

@benjyw benjyw commented Jan 9, 2019

  • Creates greater symmetry between Linux and OSX.
  • Each platform has a bootstrap stage job that builds the native engine, a single-platform pants.pex and fs_util.
  • The jobs in the test stage use the prebuilt pants.pex for the relevant platform (and/or the native_engine.so embedded in it).
  • The Travis cache config has been refactored so that the bootstrap jobs only cache dirs related to building rust code, and test jobs only cache dirs related to running pants (e.g., the ivy2 cache).
  • Removes the -k option from ci.sh. It was never used in practice (e.g., ccdf523 inverted the sense of the flag, but didn't change what it actually does, which seems like ample evidence that it isn't needed).
  • Adds a -f option to release.sh, to build fs_util. Now a dry-run release no longer automatically builds it. This allows us to build it only in the bootstrap stage.
  • Ensures that each shard has a separate Travis cache.
  • The .travis.yml changes utilize a little-known but very useful feature of mustache, where if a partial is invoked on a line by itself then on substitution, every line in it receives the indentation of that line.

@benjyw benjyw force-pushed the ci_consolidation2 branch from 5fcd0f1 to 01cd87d Compare January 9, 2019 04:35
@benjyw
Copy link
Contributor Author

benjyw commented Jan 9, 2019

Reviewers: please ensure that the logic is sound, particularly around the rust builds. It probably makes sense to examine the CI run for this PR, to double-check that it does what we think it does.

@benjyw
Copy link
Contributor Author

benjyw commented Jan 9, 2019

Note that there is probably some beneficial fine-tuning of the Travis caches that we can do in a followup change. Right now I erred on the side of caching less than before, because our caches were huge, and the benefit of those is questionable.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks! Particularly the cache splitting and shrinking. As you note, I suspect we'll want to add some more things back to the caches, but we should consider that on a case by case basis :)

(In particular, I've been experimenting with avoiding the bootstrap shard re-compiling rust when rust code hasn't changed; I've got it mostly working in #7044 - just one more kink to work out :))

.travis.yml Outdated
-v "${HOME}:/travis/home"
-v "${TRAVIS_BUILD_DIR}:/travis/workdir"
travis_ci:latest
sh -c "git clean -xfd && ./build-support/bin/ci.sh && ./build-support/bin/release.sh -f"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we git clean -xfd here and on the next shard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. It was there from before. Probably unnecessary, since we work on a clean clone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's drop it unless we can justify it; it can only confuse things...

env:
- CACHE_NAME=macos10.12sanity
script:
- MODE=debug ./build-support/bin/travis-ci.sh -bm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up TODO: swap -b from "don't bootstrap" to "do bootstrap" so it's only present for the bootstrap shards (I'll happily do it, but just noting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If and when you do, note that the bootstrapping, and ONLY the bootstrapping, should run before we set RUN_PANTS_FROM_PEX=1. Any other logic must follow this line, or else that logic will likely end up bootstrapping again.

@@ -221,325 +259,359 @@ matrix:
repo: pantsbuild/pants

# Deploy Pex Unstable to s3
- name: "Deploy Pants PEX Unstable"
- name: "Deploy unstable multiplatform pants.pex"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle question for the future (definitely not for now): Would it make sense to merge the stable and unstable pex publish jobs? The difference between the two pexes is basically a handful of strings and hashes, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know. I was out of the loop when this was originally implemented.

name: "Rust Clippy on Linux"
name: "Linux Rust Clippy"
env:
- CACHE_NAME=linuxclippy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job should use the native_engine_cache_config (and maybe share it with the linux bootstrap shard)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the travis documentation, sharing caches between jobs is a no-no. Will add the cache config though.

Is this job recompiling a lot of rust code that the bootstrap job already compiled? Would it make sense to move the clippy run to there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's re-compiling all of it, but in a way which is different (both cheaper and more expensive) than the bootstrap job. It's doing compiler-integrated linting, which adds extra instrumentation (more expensive), but also doesn't actually do full codegen or linking (cheaper). It probably makes sense to merge it with the rust test running shard, actually...

name: "Rust Tests Linux"
name: "Linux Rust tests"
env:
- CACHE_NAME=linuxrusttests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job should use the native_engine_cache_config (and maybe share it with the linux bootstrap shard)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. Again, is this recompiling unnecessarily? Is it better to move these tests into the bootstrap shard. Or - is it possible to propagate the bootstrapped rust build state via S3 (like we do for the pex) so that this job isn't starting from a clean rust workspace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests take a non-trivial amount of time, so we probably don't want to merge them into the bootstrap shard...

The build state we'd want to share is basically the entire cache (~/.cache/pants/rust + src/rust/engine/target), but it's probably noticeably simpler for us to just re-do the work in this shard than to do custom cross-shard caching...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much time might it save to run these on that build state, rather than on clean state? I mean, the whole thing is ~12 minutes from clean right now... And clippy is about the same, AFAICT. And that's without the native_engine_cache_config. So you're probably right that it's not worth worrying about...

- name: "Rust + Platform-specific Tests OSX"
os: osx
- <<: *default_osx_test_config
name: "OSX Rust + platform-specific tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should split this shard back into two; one for rust tests (ci.sh -e) which uses the rust cache config, and possibly shares its cache with the bootstrap shard, and one which runs platform-specific tests (ci.sh -z) and uses the run cache config.

The shards were merged because the cost of bootstrapping rust was high, but now that that's ameliorated, we should split them so that they can benefit from more appropriate caching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this note, should we split Self checks, lint, and JVM tests into two shards? Self checks and lint seem like one logical shard, and JVM tests another logical shard. Will make it easier for us to fail more quickly if there's a linting error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but I'd rather do that in a separate change, since this one is already hefty.

Also, as above, we can't share Travis caches, but we can maybe share the built rust state via S3? I don't know enough about what that state is in order to do so correctly, but either way, let's address in a followup?

# TODO: Figure out why we have such large caches (2-7GB) and try to trim them.
timeout: 500
directories:
- ${HOME}/.cache/pants/rust/cargo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could shave 100M off this dir by replacing this with:

- ${HOME}/.cache/pants/rust/cargo/.crates.toml
- ${HOME}/.cache/pants/rust/cargo/git
- ${HOME}/.cache/pants/rust/cargo/registry
- ${HOME}/.cache/pants/rust/cargo/bin/cargo-ensure-installed
- ${HOME}/.cache/pants/rust/cargo/bin/cargo-ensure-prefix

at the expense of:

  1. Hard coding some internals of cargo that may change
  2. Hard-coding a list of things we cargo install (so that if we add more in the future, we'd need to update this list, or they'd be re-compiled)

I'm not sure what the cost of 100M here is, so can't really comment on the trade-off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is safe to cache ${HOME}/.cache/pants/python_cache, which has subfolders for interpreters and for requirements. This will allow us to avoid redownloading Python requirements every time, as we cache the wheels. Same with interpreters, as it will cache wheel, setuptools, and the python executable.

The two situtations where the cache would be invalid are:

  1. modified requirements.txt. Our cache code will know that the cache is invalidated, though, so this is fine.
  2. changing the name of the venv folder, as done in Allow Pants to run with Python 3 via ./pants3 script #6959. Here, we can manually delete the cache, get the PR merged, and then the new venv folder names will be cached correctly.

Edit: after inspecting Travis, I think it's okay and possibly preferable to keep these out of the cache, as the only two shards that will benefit are the native engine builders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but again - let's experiment with this in a followup change whose only purpose is further cache tweaking? This was a tough change, and I'm leery of perturbing it any further when it's already a win as-is.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy to see how much faster CI is now. Thanks Benjy! Cool how PEX makes this all possible.

# TODO: Figure out why we have such large caches (2-7GB) and try to trim them.
timeout: 500
directories:
- ${HOME}/.cache/pants/rust/cargo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is safe to cache ${HOME}/.cache/pants/python_cache, which has subfolders for interpreters and for requirements. This will allow us to avoid redownloading Python requirements every time, as we cache the wheels. Same with interpreters, as it will cache wheel, setuptools, and the python executable.

The two situtations where the cache would be invalid are:

  1. modified requirements.txt. Our cache code will know that the cache is invalidated, though, so this is fine.
  2. changing the name of the venv folder, as done in Allow Pants to run with Python 3 via ./pants3 script #6959. Here, we can manually delete the cache, get the PR merged, and then the new venv folder names will be cached correctly.

Edit: after inspecting Travis, I think it's okay and possibly preferable to keep these out of the cache, as the only two shards that will benefit are the native engine builders.

services:
- docker
env:
- CACHE_NAME=linuxpexbuild
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: linux_pex_build is easier on the eye

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python interpreter was recently unsafe to cache because the call site makes a symlink. Practically fine if you use the system /usr/bin/python or something, but breaks for us since we bootstrap our interpreter from a virtualenv.

I haven't looked to see if that behavior has changed in the last couple months, but I ended up moving the interpreter cache under .pants.d. That allows me to continue caching all of XDG_CACHE_HOME/pants.

Since this is for Pants CI, caching is likely fine, but FYI

# We request the oldest image we can (corresponding to OSX 10.11) for maximum compatibility.
# We use 10.11 as a minimum to avoid https://github.com/rust-lang/regex/issues/489.
# See: https://docs.travis-ci.com/user/reference/osx/#OS-X-Version
osx_image: xcode8
env:
- CACHE_NAME=macospexbuild
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: macos_pex_build

- name: "Rust + Platform-specific Tests OSX"
os: osx
- <<: *default_osx_test_config
name: "OSX Rust + platform-specific tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this note, should we split Self checks, lint, and JVM tests into two shards? Self checks and lint seem like one logical shard, and JVM tests another logical shard. Will make it easier for us to fail more quickly if there's a linting error.

@benjyw
Copy link
Contributor Author

benjyw commented Jan 9, 2019

I've handled the comments that are easily addressable here, but I'd prefer to punt on some of the more intricate ones (splitting jobs, tweaking the cache dirs, possibly propagating rust build state from job to job) to a followup change, as this one was already several days in the making, and I'd like to checkpoint the progress so far.

@illicitonion
Copy link
Contributor

Awesome, let's merge! :D Thanks!

@Eric-Arellano
Copy link
Contributor

Agreed on breaking out the proposed changes in separate PRs. This is a good scope and looks good to merge!

I would still prefer renaming all the cache names to use snake_case as it's more readable than nospacing and it mirrors Python conventions.

@illicitonion
Copy link
Contributor

Shard 34 seems to have appeared, called just "Ruby", and I can't work out what it is from the config... Any ideas?

@benjyw benjyw merged commit 5acaf84 into pantsbuild:master Jan 10, 2019
@benjyw benjyw deleted the ci_consolidation2 branch January 10, 2019 17:20
Eric-Arellano added a commit that referenced this pull request Jan 17, 2019
Prior to #7047 and #7012, a significant portion of CI was spent on repeatedly bootstrapping Pants. So, we combined some jobs that did not have much logical connection in order to save CI time.

In general, it is acceptable to combine unrelated jobs into the same shard when there are benefits, as it does still avoid some shard startup cost. 

Here, however, we get some benefits from splitting these two shards into four.

* "Lint, self-checks, and JVM tests" -> "Lint and self checks" + "JVM tests"
    - Lint fails more quickly. This is one of the shards most likely to fail, so we want to surface the failure quickly. Meanwhile, we do not often modify JVM code, so JVM tests are far less likely to fail from the average PR.
   - Allows us to move JVM tests later in the CI run, as again they are less likely to fail from the average PR.
* "OSX Rust + platform-specific tests" -> "OSX Rust tests" + "OSX platform specific tests"
   - Allows us to mark "OSX Rust tests" as no pex, which saves some CI time by not installing `aws-cli` and simplifies `.travis.yml` to not need `py2` and `py3` variants of the shard.
   - Allows us to extract a common `base_rust_tests` config between `linux_rust_tests` and `osx_rust_tests`.
   - Thanks to the extracted `base_rust_tests`, we now setup `native_engine_cache` for the `osx_rust_tests` shard, which we did not have before. This caches the rust bootstrap.
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Jan 17, 2019
Thanks to the changes made in pantsbuild#7047 and pantsbuild#7012, we went from bootstrapping in nearly every shard to only bootstrapping in the first 4 shards. So, we flip the meaning to be a positive rather than negative flag.
Eric-Arellano added a commit that referenced this pull request Jan 18, 2019
Thanks to the changes made in #7047 and #7012, we went from bootstrapping in nearly every shard to only bootstrapping in the first 4 shards. So, we flip the meaning to be a positive rather than negative flag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants