Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.4.1] Files created on disk from remote cache or build are marked dirty on first rebuild #24763

Closed
Bradshawz opened this issue Dec 19, 2024 · 14 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@Bradshawz
Copy link

Description of the bug:

We currently use BwoB toplevel with remote build/remote cache, but do our linking locally using build --strategy=CppLink=local. Due to this, we get a large amount of files in our output directory. On the first rebuild of any build, Bazel is marking all the downloaded files as changed outputs. This does not happen on the second rebuild. This can cause the rebuild to take a very long time.

I tried to track down where the issue was coming from a bit, and it seems like:

if (!trustRemoteValue && fileMetadata.couldBeModifiedSince(lastKnownData)) {

on first rebuild fileMetadata.couldBeModifiedSince(lastKnownData) does not seem like it actually successfully compares anything. It seems like it tries to compare their FileArtifactValue digests, and if that fails it compares their FileContentsProxy. On the first rebuild, fileMetadata has a proxy but no digest, but lastKnownData has a digest but no proxy. On the second rebuild fileMetadata still does not have a digest, but lastKnownData has a proxy and thus isn't marked as modified.

Which category does this issue belong to?

Performance, Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Using a build that either gets files built remotely or from a remote cache and then downloaded locally:
bazel clean
bazel build //src:someTarget
bazel build //src:someTarget - in the java log, there should be something like 241219 17:47:12.558:I 42 [com.google.devtools.build.lib.skyframe.SequencedSkyframeExecutor.detectModifiedOutputFiles] Found 1 modified files from last build
bazel build //src:someTarget - 241219 18:08:46.021:I 42 [com.google.devtools.build.lib.skyframe.SequencedSkyframeExecutor.detectModifiedOutputFiles] Found 0 modified files from last build

Which operating system are you running Bazel on?

Ubuntu 22.04

What is the output of bazel info release?

release 7.4.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

I found #22367, but it seems like the opposite issue where stuff that wasn't downloaded was still marked as dirty.

Any other information, logs, or outputs that you want to share?

No response

@github-actions github-actions bot added team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Dec 19, 2024
@fmeum
Copy link
Collaborator

fmeum commented Dec 19, 2024

@coeuvre

@oquenchil oquenchil added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jan 7, 2025
@coeuvre
Copy link
Member

coeuvre commented Jan 7, 2025

Did you enable local action cache (--use_action_cache)? Because in this case, even if we invalidate the skyframe node, they should hit action cache.

Can you share a json trace profile for the slow first incremental build?

@coeuvre
Copy link
Member

coeuvre commented Jan 7, 2025

I created https://bazel-review.googlesource.com/c/bazel/+/268970 to not mark skyframe nodes as dirty. But preliminary benchmark shows the performance is somewhat worse because we need to digest the outputs during invalidation. Can you try whether that patch helps?

@Bradshawz
Copy link
Author

Bradshawz commented Jan 7, 2025

--use_action_cache is on by default right? I explicitly turned it on just in case. Attached two rebuild profile examples.

This is the slowest example where we don't use fusion and don't skip creating intermediate archives (.a files), causing the final output binary to be ~8GB. It seems your patch didn't help very much in this case.
SlowestRebuild.json.gz
SlowestRebuildPatch.json.gz

This is a more normal developer example where we are using fusion and don't combine the debug symbols together so the final binary is ~500MB. It seems your patch vastly helped in this case.
NormalSlowRebuild.json.gz
NormalSlowRebuildPatch.json.gz

@coeuvre
Copy link
Member

coeuvre commented Jan 8, 2025

--use_action_cache is on by default right?

It is enabled by default.

Thanks for the profiles! They really helped me a lot to understand the problem.

Because in this case, even if we invalidate the skyframe node, they should hit action cache.

In this case, the size and number of output files are large, checking action cache is a lot slower than comparing skyframe nodes.

This is the slowest example where we don't use fusion and don't skip creating intermediate archives (.a files), causing the final output binary to be ~8GB. It seems your patch didn't help very much in this case.

Right, because my patch needs to digest the outputs which has smaller cost comparing to checking action cache, but it's still a lot.

Let me try another idea to further optimize this.

@coeuvre
Copy link
Member

coeuvre commented Jan 8, 2025

@Bradshawz: I have updated the patch to not compute digest, but use contents proxy to compare local and remote metadata. Please try it again: https://bazel-review.googlesource.com/c/bazel/+/268970.

@Bradshawz
Copy link
Author

Don't know if it matters that i applied this on 7.4.1, but i'm getting similar results and similar looking profiles to completely unpatched build.

@coeuvre
Copy link
Member

coeuvre commented Jan 8, 2025

Maybe most outputs in the build graph are tree outputs? I think the patch doesn't handle tree outputs efficiently.

@Bradshawz
Copy link
Author

Bradshawz commented Jan 8, 2025

Bleh that looks like that was on me, didn't end up removing:

if (!(o instanceof RegularFileArtifactValue lastKnown)) { return true; }

So it was skipping the new code.

I'll edit this comment with the new numbers when i get them.

Edit:
Both slowest and normal examples took less than second, looks like your fix fixed it.

@coeuvre
Copy link
Member

coeuvre commented Jan 9, 2025

Cool! It's good to hear the patch fixed it.

I think the patch doesn't handle tree outputs efficiently.

I will continue polishing the patch to handle tree outputs before submiting it. I will also backport it to 7.4 branch afterwards.

@coeuvre
Copy link
Member

coeuvre commented Jan 14, 2025

@bazel-io fork 7.5.0

copybara-service bot pushed a commit that referenced this issue Jan 15, 2025
... if the corresponding output is materialized afterwards and hasn't been changed.

Failing to do so will cause an incremental build to re-check the action cache for actions whose outputs were materialized during last build. This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata after materialization. During dirtiness check, we compare remote metadata and the corresponding local metadata with their `digest`, or `proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build is improved by this change from `14s` to less than `1s` in common case, and from `115s` to less than `1s` in worst case. #24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163
@coeuvre
Copy link
Member

coeuvre commented Jan 16, 2025

@bazel-io fork 8.1.0

@coeuvre coeuvre closed this as completed Jan 16, 2025
coeuvre added a commit to coeuvre/bazel that referenced this issue Jan 16, 2025
... if the corresponding output is materialized afterwards and hasn't been changed.

Failing to do so will cause an incremental build to re-check the action cache for actions whose outputs were materialized during last build. This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata after materialization. During dirtiness check, we compare remote metadata and the corresponding local metadata with their `digest`, or `proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build is improved by this change from `14s` to less than `1s` in common case, and from `115s` to less than `1s` in worst case. bazelbuild#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163
coeuvre added a commit to coeuvre/bazel that referenced this issue Jan 16, 2025
... if the corresponding output is materialized afterwards and hasn't been changed.

Failing to do so will cause an incremental build to re-check the action cache for actions whose outputs were materialized during last build. This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata after materialization. During dirtiness check, we compare remote metadata and the corresponding local metadata with their `digest`, or `proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build is improved by this change from `14s` to less than `1s` in common case, and from `115s` to less than `1s` in worst case. bazelbuild#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163
coeuvre added a commit to coeuvre/bazel that referenced this issue Jan 16, 2025
... if the corresponding output is materialized afterwards and hasn't been changed.

Failing to do so will cause an incremental build to re-check the action cache for actions whose outputs were materialized during last build. This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata after materialization. During dirtiness check, we compare remote metadata and the corresponding local metadata with their `digest`, or `proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build is improved by this change from `14s` to less than `1s` in common case, and from `115s` to less than `1s` in worst case. bazelbuild#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163
coeuvre added a commit to coeuvre/bazel that referenced this issue Jan 16, 2025
... if the corresponding output is materialized afterwards and hasn't been changed.

Failing to do so will cause an incremental build to re-check the action cache for actions whose outputs were materialized during last build. This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata after materialization. During dirtiness check, we compare remote metadata and the corresponding local metadata with their `digest`, or `proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build is improved by this change from `14s` to less than `1s` in common case, and from `115s` to less than `1s` in worst case. bazelbuild#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163
github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2025
…ck (#24941)

... if the corresponding output is materialized afterwards and hasn't
been changed.

Failing to do so will cause an incremental build to re-check the action
cache for actions whose outputs were materialized during last build.
This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata
after materialization. During dirtiness check, we compare remote
metadata and the corresponding local metadata with their `digest`, or
`proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build
is improved by this change from `14s` to less than `1s` in common case,
and from `115s` to less than `1s` in worst case.
#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163

Fixes #24920.
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.5.0 RC2. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.5.0rc2. Thanks!

@Bradshawz
Copy link
Author

Tried it out, works great. Thanks!

fmeum pushed a commit to fmeum/bazel that referenced this issue Jan 29, 2025
... if the corresponding output is materialized afterwards and hasn't been changed.

Failing to do so will cause an incremental build to re-check the action cache for actions whose outputs were materialized during last build. This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata after materialization. During dirtiness check, we compare remote metadata and the corresponding local metadata with their `digest`, or `proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build is improved by this change from `14s` to less than `1s` in common case, and from `115s` to less than `1s` in worst case. bazelbuild#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163
coeuvre added a commit to coeuvre/bazel that referenced this issue Jan 29, 2025
... if the corresponding output is materialized afterwards and hasn't been changed.

Failing to do so will cause an incremental build to re-check the action cache for actions whose outputs were materialized during last build. This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata after materialization. During dirtiness check, we compare remote metadata and the corresponding local metadata with their `digest`, or `proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build is improved by this change from `14s` to less than `1s` in common case, and from `115s` to less than `1s` in worst case. bazelbuild#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163
github-merge-queue bot pushed a commit that referenced this issue Jan 29, 2025
…ck (#25126)

... if the corresponding output is materialized afterwards and hasn't
been changed.

Failing to do so will cause an incremental build to re-check the action
cache for actions whose outputs were materialized during last build.
This can be very slow if the size of the outputs are large.

We achieved this by storing `FileContentsProxy` in the remote metadata
after materialization. During dirtiness check, we compare remote
metadata and the corresponding local metadata with their `digest`, or
`proxy` if `digest` is not available for local metadata,

A user reported back the wall time of their very first increment build
is improved by this change from `14s` to less than `1s` in common case,
and from `115s` to less than `1s` in worst case.
#24763

PiperOrigin-RevId: 715734718
Change-Id: Id1a1a59d8b5f3e91a7ae05a73663ff37eee6d163

Fixes #24940.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

7 participants