-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running a big build on Remote execution causes Bazel to OOM #16913
Comments
Does |
I have |
Can you try without |
Ok yes I will try that. That is unexpected to me, I do not think I understand how those flags function then... I thought experimental_remote_merkle_tree_cache enabled caching the merkle trees in the remote cache, but I thought experimental_remote_discard_merkle_trees decides how long to keep them in-memory during an action.. Is there anything I can read to understand this better? |
AFAICT, the best thing to read is the source code. |
Thanks @coeuvre, I have flipped the flags accordingly, and have given the Bazel server 96G of RAM. I still occasionally (~5%) see Bazel OOM on incremental remote builds. Oddly, I have not yet seen on a cleaned-client build. I am looking through the server log for clues, I see very frequent full GCs but I suppose thats to be expected before an OOM. I'm not yet able to retrieve the heap profile, the job times out before its available. Even more suspiciously, I often see a these OOM events clustered together; occurring on different machines at similar times as each other while building near-sequential commits in parallel on our trunk. I cannot distinguish if something on the remote cluster (Buildbarn) is misbehaving causing the clients to OOM, or perhaps the incremental builds of each of these Bazel servers gets hung up on a change that each is attempting to process. Would you have any suggestions? Stack Trace
|
Do you have large tree artifacts in your build? Without cache, the tree is computed once for each action. If you have actions, consuming the tree, run concurrently, it might increase the memory usage. cc @tjgq |
Yes, I do have large tree artifacts, primarily due to node_modules. What might you recommend? |
Tiago has made huge improvement to @tjgq: is it possible to include all the changes you made into 6.2? |
Yes, I think they are being (or have already been) cherry-picked into 6.2. |
Wonderful, glad hear of these landed improvements! I'll make sure to evaluate those improvements as the RCs become available. 👍 |
@tjgq @coeuvre I'm anxiously awaiting 6.2.0 in order to see if this solves my OOM problems 😄 In the meantime, I've noticed that occasionally I also see OOMs on an I do have the following relevant flags set:
|
I'm still seeing OOMs occasionally with 6.2.0 and bazelrc
I think it may be related to #18145 as there are many similarities. |
@joeljeske Dump a heap After a successful BWOB build ( or after OOM with dump heap after OOM ) and check if you have a leaked |
If another anecdote will help. We get this reliably when fetching very large tree artifacts from remote cache. In our case the problem is node_modules directories, i.e. a very large number of relatively small files. |
I continue to see many OOMs with RBE & BwoB on 6.3.0rc1. I was really hoping #18145 would fix the issue but it has not. @alexofortune have you verified if the fix works for your in 6.3.0rc1? |
@joeljeske Hey there Joel, sorry to hear the fix didn't pan out for you. I didn't - we still are on 6.1.0, with the patch that removes the event handler - and that one fixed the issue for us. |
I think the way forward here is to implement the conclusions of #21378. |
Description of the bug:
Bazel throws
FATAL: bazel ran out of memory and crashed.
when running our build (Swift, Objective-C, C++) on remote execution. I can see the Bazel process using up to 65GB+ of RAM. I ran the build with--heap_dump_on_oom
and I can reproduce it consistently when running a full clean build on remote execution. If I try again the build, I can get it to complete after 2-3 OOM exceptions.The full stacktrace is the following:
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Build a big codebase on remote execution. Unfortunately I can't provide an example at the moment, but a full clean build is about 15K actions executed remotely (plus some locally as well of course).
Which operating system are you running Bazel on?
macOS 13.0
What is the output of
bazel info release
?6.0.0rc2
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
Something interesting is that passing
--experimental_remote_merkle_tree_cache
seems to workaround the issue.The text was updated successfully, but these errors were encountered: