Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote execution build time regression with Bazel 5.1+ when action has many inputs #15872

Closed
clintharrison opened this issue Jul 13, 2022 · 6 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@clintharrison
Copy link
Contributor

clintharrison commented Jul 13, 2022

Description of the bug:

As of Bazel 5.1 (specifically, the change in #15091 -- cc @coeuvre and @Wyverald), we have seen an action take considerably more time to execute remotely. We previously saw <5 seconds spent in "upload missing inputs", and this PR has increased that to 90+ seconds in CI and several minutes on developer machines.

This action has around 600k inputs, though we see very long execution time even when only ~5 of them change. It appears that this is allocating much much more than previously, though I am not very familiar with rxjava, let alone its internals 😄

With these changes, it's also impossible to interrupt Bazel; this helpfully prints a stack trace of the thread though:

220713 16:07:12.594:W 330 [com.google.devtools.build.lib.util.ThreadUtils.warnAboutSlowInterrupt] Interrupt took too long. Dumping thread state.
220713 16:07:12.599:W 330 [com.google.devtools.build.lib.util.ThreadUtils.lambda$warnAboutSlowInterrupt$1] RUNNABLE <skyframe-evaluator 5 290>
	app//io.reactivex.rxjava3.core.Observable.subscribe(Observable.java:13176)
	app//io.reactivex.rxjava3.internal.operators.observable.ObservableRefCount.subscribeActual(ObservableRefCount.java:84)
	app//io.reactivex.rxjava3.core.Observable.subscribe(Observable.java:13176)
	app//io.reactivex.rxjava3.internal.operators.observable.ObservableSingleSingle.subscribeActual(ObservableSingleSingle.java:36)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//io.reactivex.rxjava3.internal.operators.single.SingleDoOnSubscribe.subscribeActual(SingleDoOnSubscribe.java:41)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//io.reactivex.rxjava3.internal.operators.single.SingleFlatMapCompletable.subscribeActual(SingleFlatMapCompletable.java:44)
	app//io.reactivex.rxjava3.core.Completable.subscribe(Completable.java:2859)
	app//io.reactivex.rxjava3.internal.operators.completable.CompletableDefer.subscribeActual(CompletableDefer.java:43)
	app//io.reactivex.rxjava3.core.Completable.subscribe(Completable.java:2859)
	app//io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle.subscribeActual(CompletableToSingle.java:37)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//com.google.devtools.build.lib.remote.util.AsyncTaskCache$Execution.subscribeActual(AsyncTaskCache.java:157)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//com.google.devtools.build.lib.remote.util.AsyncTaskCache.lambda$execute$1(AsyncTaskCache.java:309)
	app//com.google.devtools.build.lib.remote.util.AsyncTaskCache$$Lambda$925/0x00000008006a0040.subscribe(Unknown Source)
	app//io.reactivex.rxjava3.internal.operators.single.SingleCreate.subscribeActual(SingleCreate.java:40)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//io.reactivex.rxjava3.internal.operators.completable.CompletableFromSingle.subscribeActual(CompletableFromSingle.java:29)
	app//io.reactivex.rxjava3.core.Completable.subscribe(Completable.java:2859)
	app//io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle.subscribeActual(CompletableToSingle.java:37)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//io.reactivex.rxjava3.internal.operators.single.SingleResumeNext.subscribeActual(SingleResumeNext.java:39)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableFlatMapSingle$FlatMapSingleSubscriber.onNext(FlowableFlatMapSingle.java:131)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable$IteratorSubscription.fastPath(FlowableFromIterable.java:185)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable$BaseRangeSubscription.request(FlowableFromIterable.java:129)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableFlatMapSingle$FlatMapSingleSubscriber.onSubscribe(FlowableFlatMapSingle.java:106)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable.subscribe(FlowableFromIterable.java:69)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableFromIterable.subscribeActual(FlowableFromIterable.java:47)
	app//io.reactivex.rxjava3.core.Flowable.subscribe(Flowable.java:15917)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableFlatMapSingle.subscribeActual(FlowableFlatMapSingle.java:53)
	app//io.reactivex.rxjava3.core.Flowable.subscribe(Flowable.java:15917)
	app//io.reactivex.rxjava3.internal.operators.flowable.FlowableCollectSingle.subscribeActual(FlowableCollectSingle.java:53)
	app//io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
	app//io.reactivex.rxjava3.internal.operators.single.SingleFlatMapCompletable.subscribeActual(SingleFlatMapCompletable.java:44)
	app//io.reactivex.rxjava3.core.Completable.subscribe(Completable.java:2859)
	app//io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1467)
	app//com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:101)
	app//com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1279)

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I don't have a minimal repro readily available: if necessary I suspect we could reproduce this with rules_nodejs pretty easily given that node_modules tend to have a lot of input files. Let me know if this would be helpful!

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 5.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@sgowroji sgowroji added type: support / not a bug (process) untriaged team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jul 14, 2022
@meisterT meisterT added type: bug P1 I'll work on this now. (Assignee required) and removed type: support / not a bug (process) untriaged labels Jul 14, 2022
@coeuvre
Copy link
Member

coeuvre commented Jul 14, 2022

I will take a look at this and providing a repro with rules_nodejs is super helpful!

@clint-stripe
Copy link

@coeuvre it's not as obvious, but using https://github.com/bazelbuild/rules_nodejs/tree/stable/examples/create-react-app does appear to reproduce this pretty reliably. (Run bazel build //:build --profile=bazel.profile with remote execution to trigger it.)

I added a tiny bit of logging, and it looks like this action has ~30k inputs. This is much smaller than the action that caused me to report this, but it seems pretty clearly affected too.

With Bazel 5.2, the "upload missing inputs" step in the profile consistently takes > 1.5 sec on my 16 core/64gb dev machine. With the commit reverted, I haven't seen it take longer than 800ms. (Rather un-scientifically, I was just modifying the text in src/App.tsx to trigger a rebuild)

@coeuvre
Copy link
Member

coeuvre commented Jul 15, 2022

I can't reproduce this using https://github.com/bazelbuild/rules_nodejs/tree/stable/examples/create-react-app.

There are ~188K inputs for the final NpmPackageBin build action. I manually force this action to fail so we can repeated build this target without hitting the cache (and also trigger the upload inputs method).

With Bazel 5.2.0, the "upload missing inputs" step on my machine takes ~2.3 sec. But it's the same as Bazel 4.1.0 and Bazel 5.0.0.

@clint-stripe
Copy link

Hmm, that's definitely very strange. I added a simple log line to print allDigests.size(), and saw this for create-react-app:

[com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent] Ensuring 31974 action inputs are present for //:build

I made a more specific repro: see https://github.com/clint-stripe/action_many_inputs.
You can adjust how many files are created by the repo rule with the count param in WORKSPACE.

~/.bazel_binaries/5.2.0/bin/bazel test ... --config=remote --profile=/tmp/bazel-5.2.profile was run several times, also with bazel-bin/src/bazel-dev when the build w/ the revert was used, right on top of the 5.2.0 tag.

Using 100k, it's very obvious running with 5.2, or 5.2 with the PR reverted:

Stock 5.2:
image

image

With the revert:

image

image

In all cases, these are after Bazel has started up and the action has been run once, so we are comparing between runs where the inputs have already been uploaded and nothing about Bazel startup is included in the profile.

There's not a lot of detail in the profiles, but I've attached them in case it's helpful.

profiles.zip

@coeuvre
Copy link
Member

coeuvre commented Jul 15, 2022

Thanks for your repro! I think now I know where is the regression.

To "upload missing inputs", we need 3 steps:
a) collect digests that are new to Bazel.
b) call findMissingDigests for these collected digests.
c) upload missing digests.

We already uploaded missing inputs during previous builds, so c) should be identical for both cases. The mentioned change didn't touch findMissingDigests, so b) should also be identical. Then a) is the culprit.

I added more traces in the profile and tested with your provided repro. For the HEAD:

a) took ~3.4s and b) took ~3.1s.

Screenshot 2022-07-15 at 16 27 35

with the revert:

a) took ~83ms and b) took ~3s.

Screenshot 2022-07-15 at 16 22 40

The reason why I can't reproduce this before is probably b) took more time than a) (I am at home network) so I couldn't notice. BTW, which remote server did you test with?

I will work on the fix.

@clint-stripe
Copy link

Ah, awesome! Thank you so much for the quick reply -- really appreciate you taking a look here.

We have an internal remote execution cluster, described a bit in a recent blog post. My tests were on a cloud VM, hitting the cluster in the same region, so it makes sense we have different network situations here too :)

coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 15, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 16, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 16, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 16, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 22, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 22, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 27, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 27, 2022
coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 28, 2022
The regression was introduced in 702df84 where we essentially create a subscriber for each digest to subscribe the result of `findMissingBlobs`.

This change update the code to not create so many subscribers but maintain the same functionalities.

Fixes bazelbuild#15872.

Closes bazelbuild#15890.

PiperOrigin-RevId: 463826260
Change-Id: Id0b1c7c309fc9653a47c5df95c609b34e6510cde
ShreeM01 pushed a commit that referenced this issue Jul 28, 2022
The regression was introduced in 702df84 where we essentially create a subscriber for each digest to subscribe the result of `findMissingBlobs`.

This change update the code to not create so many subscribers but maintain the same functionalities.

Fixes #15872.

Closes #15890.

PiperOrigin-RevId: 463826260
Change-Id: Id0b1c7c309fc9653a47c5df95c609b34e6510cde
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants