Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a big build on Remote execution causes Bazel to OOM #16913

Closed
BalestraPatrick opened this issue Dec 3, 2022 · 18 comments
Closed

Running a big build on Remote execution causes Bazel to OOM #16913

BalestraPatrick opened this issue Dec 3, 2022 · 18 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@BalestraPatrick
Copy link
Member

BalestraPatrick commented Dec 3, 2022

Description of the bug:

Bazel throws FATAL: bazel ran out of memory and crashed. when running our build (Swift, Objective-C, C++) on remote execution. I can see the Bazel process using up to 65GB+ of RAM. I ran the build with --heap_dump_on_oom and I can reproduce it consistently when running a full clean build on remote execution. If I try again the build, I can get it to complete after 2-3 OOM exceptions.

The full stacktrace is the following:

FATAL: bazel ran out of memory and crashed. An attempt will be made to write a heap dump to /private/var/tmp/me/9c146a5e97098b318f66f519df2b642d/a886b233-ed1b-4a65-b703-1a9ac9a64c3d.heapdump.hprof. Printing stack trace:
java.lang.OutOfMemoryError: Java heap space
    at java.base/jdk.internal.misc.Unsafe.allocateUninitializedArray(Unknown Source)
    at java.base/java.lang.invoke.StringConcatFactory$MethodHandleInlineCopyStrategy.newArray(Unknown Source)
    at java.base/java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(Unknown Source)
    at java.base/java.lang.invoke.LambdaForm$MH/0x00000008000c2c40.invoke(LambdaForm$MH)
    at java.base/java.lang.invoke.DelegatingMethodHandle$Holder.reinvoke_L(Unknown Source)
    at java.base/java.lang.invoke.LambdaForm$MH/0x00000008000c2040.linkToTargetMethod(LambdaForm$MH)
    at com.google.devtools.build.lib.vfs.PathFragment.getRelative(PathFragment.java:229)
    at com.google.devtools.build.lib.vfs.PathFragment.getRelative(PathFragment.java:196)
    at com.google.devtools.build.lib.vfs.Path.getRelative(Path.java:114)
    at com.google.devtools.build.lib.vfs.Root$PathRoot.getRelative(Root.java:92)
    at com.google.devtools.build.lib.actions.Artifact.getPath(Artifact.java:545)
    at com.google.devtools.build.lib.actions.ActionInputHelper.toInputPath(ActionInputHelper.java:174)
    at com.google.devtools.build.lib.remote.merkletree.DirectoryTreeBuilder.lambda$buildFromActionInputs$1(DirectoryTreeBuilder.java:167)
    at com.google.devtools.build.lib.remote.merkletree.DirectoryTreeBuilder$$Lambda$1106/0x000000080098d440.visit(Unknown Source)
    at com.google.devtools.build.lib.remote.merkletree.DirectoryTreeBuilder.build(DirectoryTreeBuilder.java:252)
    at com.google.devtools.build.lib.remote.merkletree.DirectoryTreeBuilder.buildFromActionInputs(DirectoryTreeBuilder.java:141)
    at com.google.devtools.build.lib.remote.merkletree.DirectoryTreeBuilder.fromActionInputs(DirectoryTreeBuilder.java:78)
    at com.google.devtools.build.lib.remote.merkletree.MerkleTree.build(MerkleTree.java:254)
    at com.google.devtools.build.lib.remote.RemoteExecutionService.buildInputMerkleTree(RemoteExecutionService.java:389)
    at com.google.devtools.build.lib.remote.RemoteExecutionService.buildRemoteAction(RemoteExecutionService.java:448)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:189)
    at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:299)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:152)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:112)
    at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
    at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:64)
    at com.google.devtools.build.lib.rules.cpp.CppCompileAction.beginExecution(CppCompileAction.java:1509)
    at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:936)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1103)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1061)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:160)

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Build a big codebase on remote execution. Unfortunately I can't provide an example at the moment, but a full clean build is about 15K actions executed remotely (plus some locally as well of course).

Which operating system are you running Bazel on?

macOS 13.0

What is the output of bazel info release?

6.0.0rc2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

Something interesting is that passing --experimental_remote_merkle_tree_cache seems to workaround the issue.

@sgowroji sgowroji added type: bug team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Dec 4, 2022
@vladmos vladmos added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Dec 6, 2022
@coeuvre
Copy link
Member

coeuvre commented Jan 16, 2023

Does --experimental_remote_discard_merkle_trees make it better? #17120

@joeljeske
Copy link
Contributor

I have --experimental_remote_discard_merkle_trees and --experimental_remote_merkle_tree_cache on 6.1.0 and still occasionally get Bazel OOM on large RBE builds.

@coeuvre
Copy link
Member

coeuvre commented Mar 30, 2023

Can you try without --experimental_remote_merkle_tree_cache? Since --experimental_remote_discard_merkle_trees takes no effect if --experimental_remote_merkle_tree_cache is set.

@joeljeske
Copy link
Contributor

Ok yes I will try that. That is unexpected to me, I do not think I understand how those flags function then... I thought experimental_remote_merkle_tree_cache enabled caching the merkle trees in the remote cache, but I thought experimental_remote_discard_merkle_trees decides how long to keep them in-memory during an action..

Is there anything I can read to understand this better?

@coeuvre
Copy link
Member

coeuvre commented Mar 31, 2023

--experimental_remote_merkle_tree_cache allows Bazel cache sub merkle tree in memory and --experimental_remote_discard_merkle_trees try do free merkle trees as soon as they are no longer used. So they cannot be used together.

AFAICT, the best thing to read is the source code.

@joeljeske
Copy link
Contributor

joeljeske commented Apr 8, 2023

Thanks @coeuvre, I have flipped the flags accordingly, and have given the Bazel server 96G of RAM. I still occasionally (~5%) see Bazel OOM on incremental remote builds. Oddly, I have not yet seen on a cleaned-client build. I am looking through the server log for clues, I see very frequent full GCs but I suppose thats to be expected before an OOM. I'm not yet able to retrieve the heap profile, the job times out before its available.

Even more suspiciously, I often see a these OOM events clustered together; occurring on different machines at similar times as each other while building near-sequential commits in parallel on our trunk. I cannot distinguish if something on the remote cluster (Buildbarn) is misbehaving causing the clients to OOM, or perhaps the incremental builds of each of these Bazel servers gets hung up on a change that each is attempting to process.

Would you have any suggestions?

Stack Trace

230407 00:21:10.454:WT 76288 [com.google.devtools.build.lib.concurrent.AbstractQueueVisitor.maybeSaveUnhandledThrowable] Found critical error in queue visitor
java.lang.OutOfMemoryError: Java heap space
	at com.google.common.hash.MessageDigestHashFunction.newHasher(MessageDigestHashFunction.java:93)
	at com.google.common.hash.AbstractHashFunction.newHasher(AbstractHashFunction.java:80)
	at com.google.common.hash.AbstractHashFunction.hashBytes(AbstractHashFunction.java:68)
	at com.google.common.hash.AbstractHashFunction.hashBytes(AbstractHashFunction.java:62)
	at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:57)
	at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:81)
	at com.google.devtools.build.lib.remote.merkletree.MerkleTree.buildMerkleTree(MerkleTree.java:369)
	at com.google.devtools.build.lib.remote.merkletree.MerkleTree.lambda$build$2(MerkleTree.java:303)
	at com.google.devtools.build.lib.remote.merkletree.MerkleTree$$Lambda$1279/0x00007f58a68cacb0.visitDirectory(Unknown Source)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:284)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:278)
	at com.google.devtools.build.lib.remote.merkletree.DirectoryTree.visit(DirectoryTree.java:259)
	at com.google.devtools.build.lib.remote.merkletree.MerkleTree.build(MerkleTree.java:292)
	at com.google.devtools.build.lib.remote.merkletree.MerkleTree.build(MerkleTree.java:260)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.buildInputMerkleTree(RemoteExecutionService.java:397)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.buildRemoteAction(RemoteExecutionService.java:492)

@coeuvre
Copy link
Member

coeuvre commented Apr 11, 2023

Do you have large tree artifacts in your build? Without cache, the tree is computed once for each action. If you have actions, consuming the tree, run concurrently, it might increase the memory usage.

cc @tjgq

@joeljeske
Copy link
Contributor

Yes, I do have large tree artifacts, primarily due to node_modules. What might you recommend?

@coeuvre
Copy link
Member

coeuvre commented Apr 11, 2023

Tiago has made huge improvement to --experimental_remote_merkle_tree_cache for large tree artifacts. I would suggest wait 6.2 and try it again.

@tjgq: is it possible to include all the changes you made into 6.2?

@tjgq
Copy link
Contributor

tjgq commented Apr 11, 2023

Yes, I think they are being (or have already been) cherry-picked into 6.2.

@joeljeske
Copy link
Contributor

Wonderful, glad hear of these landed improvements! I'll make sure to evaluate those improvements as the RCs become available. 👍

@joeljeske
Copy link
Contributor

@tjgq @coeuvre I'm anxiously awaiting 6.2.0 in order to see if this solves my OOM problems 😄

In the meantime, I've noticed that occasionally I also see OOMs on an bazel run using RBE on a targeted, small, and incrementally built binary. This is concerning to me, as this binary was already built and should be cached from previous CI iterations. This is occurring on the same machines that do perform large remote bazel test invocations that have large TreeArtifacts, but when its OOMing trying to build/run a small binary I am concerned that something else may be wrong, as I would think a GC for a small bazel invocation should yield plenty of space. What do you think? Do you expect that the improvements made in 6.2 would fix this type of OOM?

I do have the following relevant flags set:

startup --host_jvm_args="-Xmx96G"
common --experimental_oom_more_eagerly_threshold=99
build --jobs=200
build --experimental_remote_discard_merkle_trees
build --noexperimental_remote_merkle_tree_cache
(20:11:42) FATAL: bazel ran out of memory and crashed. Printing stack trace:
--
  | java.lang.OutOfMemoryError: RetainedHeapLimiter forcing exit due to GC thrashing: After back-to-back full GCs, the tenured space is more than 99% occupied (102144927016 out of a tenured space size of 103079215104).
  | at com.google.devtools.build.lib.runtime.RetainedHeapLimiter.handle(RetainedHeapLimiter.java:106)
  | at com.google.devtools.build.lib.runtime.MemoryPressureListener.handleNotification(MemoryPressureListener.java:138)
  | at java.management/sun.management.NotificationEmitterSupport.sendNotification(Unknown Source)
  | at jdk.management/com.sun.management.internal.GarbageCollectorExtImpl.createGCNotification(Unknown Source)

@joeljeske
Copy link
Contributor

I'm still seeing OOMs occasionally with 6.2.0 and bazelrc

startup --host_jvm_args="-Xmx96G"
common --experimental_oom_more_eagerly_threshold=99
build --jobs=200
build --experimental_remote_merkle_tree_cache
build --remote_download_minimal

I think it may be related to #18145 as there are many similarities.

@alexofortune
Copy link

alexofortune commented May 24, 2023

@joeljeske Dump a heap After a successful BWOB build ( or after OOM with dump heap after OOM ) and check if you have a leaked cli-update-thread that causes your heap to be big ( you can do it using memory analyzer ). If so, it should be the same as #18145 .

@nickbreen
Copy link

nickbreen commented May 30, 2023

If another anecdote will help.

We get this reliably when fetching very large tree artifacts from remote cache.

In our case the problem is node_modules directories, i.e. a very large number of relatively small files.

@joeljeske
Copy link
Contributor

I continue to see many OOMs with RBE & BwoB on 6.3.0rc1. I was really hoping #18145 would fix the issue but it has not. @alexofortune have you verified if the fix works for your in 6.3.0rc1?

@alexofortune
Copy link

@joeljeske Hey there Joel, sorry to hear the fix didn't pan out for you.

I didn't - we still are on 6.1.0, with the patch that removes the event handler - and that one fixed the issue for us.

@tjgq
Copy link
Contributor

tjgq commented Aug 29, 2024

I think the way forward here is to implement the conclusions of #21378.

@tjgq tjgq closed this as not planned Won't fix, can't repro, duplicate, stale Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

8 participants