Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure from missing files when using --experimental_action_cache_store_output_metadata #13882

Closed
brentleyjones opened this issue Aug 20, 2021 · 14 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged

Comments

@brentleyjones
Copy link
Contributor

brentleyjones commented Aug 20, 2021

Description of the problem / feature request:

When using --experimental_action_cache_store_output_metadata, which was introduced in 4e29042, you can get build failures when going from using a remote cache to not using one.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel build --remote_cache=grpc://... --remote_download_toplevel --experimental_action_cache_store_output_metadata -- //...
bazel build --remote_download_toplevel --experimental_action_cache_store_output_metadata -- //...

The second command will fail with errors like this:

(06:29:47) INFO: Writing explanation of rebuilds to 'tmp/logs/explanation.log'
(06:29:49) ERROR: /private/var/tmp/_bazel_iosci/4e089aeca56f6c6b5c9850816d5a6522/external/build_bazel_rules_swift/tools/worker/BUILD:85:10: Linking external/build_bazel_rules_swift/tools/worker/worker [for host] failed: (Aborted): cc_wrapper.sh failed: error executing command
  (cd /private/var/tmp/_bazel_iosci/4e089aeca56f6c6b5c9850816d5a6522/sandbox/darwin-sandbox/9/execroot/repo && \
  exec env - \
    APPLE_SDK_PLATFORM=MacOSX \
    APPLE_SDK_VERSION_OVERRIDE=11.3 \
    PATH=/usr/bin:/bin \
    RELATIVE_AST_PATH=true \
    XCODE_VERSION_OVERRIDE=12.5.1.12E507 \
    ZERO_AR_DATE=1 \
  external/local_config_cc/cc_wrapper.sh @bazel-out/host/bin/external/build_bazel_rules_swift/tools/worker/worker-2.params)
Execution platform: @local_config_platform//:host
 
Use --sandbox_debug to see verbose messages from the sandbox
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/worker/_objs/worker/worker_main.o'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/worker/libcompile_without_worker.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/worker/libcompile_with_worker.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/worker/libswift_runner.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/common/libbazel_substitutions.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/common/libprocess.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/third_party/bazel_protos/libworker_protocol_proto.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/common/libfile_system.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/build_bazel_rules_swift/tools/common/libpath_utils.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/com_github_protocolbuffers_protobuf/libprotobuf.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/com_github_protocolbuffers_protobuf/libprotobuf_lite.a'
clang: error: no such file or directory: 'bazel-out/host/bin/external/zlib/libzlib.a'

What operating system are you running Bazel on?

macOS

What's the output of bazel info release?

release 5.0.0-pre.20210810.4

@brentleyjones
Copy link
Contributor Author

cc: @coeuvre

@oquenchil oquenchil added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged labels Aug 31, 2021
@brentleyjones
Copy link
Contributor Author

Hi @coeuvre, just a ping on this, as it makes the flag unusable for us.

@coeuvre
Copy link
Member

coeuvre commented Nov 8, 2021

What's the use case here for changing --remote_cache between builds?

The root cause is those files aren't downloaded in the first build because of --remote_download_top_level. In the next build, the actions could have download these files from remote server but remote cache is disabled.

I have a fix in hand but it requires invalidating all actions if --remote_cache is changed (it will still hit action cache though).

@coeuvre
Copy link
Member

coeuvre commented Nov 8, 2021

And this issue can be reproduced without using --experimental_action_cache_store_output_metadata.

@brentleyjones
Copy link
Contributor Author

We can't reproduce without using that flag though. We only encountered it with the flag, and after removing the flag we never encountered it again.

Our use case is that we disable the remote cache if we detect that we can't connect to it (with curl), because Bazel takes way too long on each build if it can't connect to the cache (offline, vpn, etc).

@brentleyjones
Copy link
Contributor Author

brentleyjones commented Nov 8, 2021

And invalidating the actions wouldn't be good. That would be a huge regression for us. There are cases when we still use the disk cache when disabling the remote cache.

@brentleyjones
Copy link
Contributor Author

Also, I feel it's very valid to flip off the remote cache if your internet becomes too slow (on a hotspot for example). Since Dynamic Scheduling doesn't support remote cache, this can drastically speed up your build.

@coeuvre
Copy link
Member

coeuvre commented Nov 9, 2021

This issue is similar to #8250 that Bazel knows the file is in the remote server but can't get it (evicted or no remote endpoint). The only way to generate those missing files is to rerun the generating actions. So the real fix is to implement action rewinding.

A temporary fix as I mentioned above is to invalidate all actions if --remote_cache is changed from enabled to disabled. Note that, "invalidate" doesn't mean Bazel will rerun all the actions because we still have action cache. Thinking of shutting bazel down and rerun the build command (but without the cost of re-analyse build config). In this case, Bazel will only rerun the generating actions if its outputs are not local available already -- we have to do this anyway.

That said, I am curious why this didn't happen to you without --experimental_action_cache_store_output_metadata. Maybe I missed something? If you can share a simple repro, I could probably fix the specific case.

@brentleyjones
Copy link
Contributor Author

I'll work on a repo.

@brentleyjones
Copy link
Contributor Author

Here are some reproductions: https://github.com/brentleyjones/action_cache_store_output_metadata_bug

It seems that if you don't invalidate the analysis cache that you get a build failure with or without the flag (repo1.sh and repo3.sh). If you do invalidate the analysis cache though, you only get a build failure with the --experimental_action_cache_store_output_metadata flag (repo2.sh and repo4.sh).

@coeuvre
Copy link
Member

coeuvre commented Nov 10, 2021

Thanks for the repros!

For repo4.sh, actions are invalidated as well (because they depend on analysis result). So in the final build, all action nodes are re-evaluated and checked against action cache. For actions whose outputs are not downloaded during previous builds, they can't hit action cache since output metadata are missing, resulting re-executions. Outputs of those actions are regenerated locally so we won't get the "file not found" error.

repo2.sh is same as repo4.sh except --experimental_action_cache_store_output_metadata is set which means remote output metadata are saved in action cache during previous builds. So in the final build, all action nodes are re-evaluated and checked against action cache as well. But this time, actions whose inputs aren't changed can still HIT the action cache even if their outputs are not downloaded.

repo1.sh and repo.3.sh have the same root cause: the action nodes are not invalidated so skyframe will reuse in-memory values from previous build -- they are not even get checked against action cache. These values are remote values so before a local run, inputs should be downloaded. But remote cache is not enabled resulting in "file not found" error.

The fix for repo2.sh is simple: just don't load remote output metadata from action cache if remote cache is not enabled.

The fix for repo1.sh and repo3.sh is, as I described above, invalidating action nodes if remote cache is changed from enabled to disabled.

@brentleyjones
Copy link
Contributor Author

brentleyjones commented Nov 18, 2021

@coeuvre did ed68933 and c9b7e22 fix this? If so, could those be cherry-picked into 5.0?

@coeuvre
Copy link
Member

coeuvre commented Nov 24, 2021

Sorry for the delay. I just transferred to Munich and had to deal with travelling, jet lag etc.

Yes, I will cherry-pick these and other fixes into 5.0.

@brentleyjones
Copy link
Contributor Author

Fixed in the 5.0 RC with #14321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged
Projects
None yet
3 participants