-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
occasionally, bazel hangs, pretending to compile file forever #4216
Comments
CC: @philwo could you take a look? Is it a sandbox bug? |
Hi @mafanasyev-tri, could you please try running with Please report back if that a) works at all (or immediately errors with a message that Bazel cannot find the mentioned Spawn strategy) and b) if it still causes you the problems you've seen. If yes, we can debug this further. :) Also: What is the name of the defunct child process you saw? Is it linux-sandbox, or some binary from the C++ compiler? Cheers, |
The Unfortunately, the bug still happens -- I run 10 build jobs, and one of them seems to be stuck already. |
That's really weird. Would you mind posting a process tree (something like "ps axuf" prints) with your Bazel server as the root? Feel free to remove any personal information. I'm just trying to figure out what that job could be... It sure sounds like Bazel is waiting for this defunct process to exit and this is causing your hanging build, but when using the linux-sandbox, that "java" process would run in a separate PID namespace that is cleaned up by the Linux kernel when the sandbox exits (even when it gets SIGKILL'd or something). No process running in the sandbox should be able to leak out in such a way. Could you try disabling the Javac persistent worker, too? That should be the only "java" process that Bazel starts outside the sandbox in an otherwise default configuration. The flag for that would be: |
here is relevant
also jstack on the bazel process does not work, I don't know enough about java to debug:
I will let you know about java sandbox disabling soon |
Regarding previous failure: It is interesting that gdb backtrace shows all threads sleeping (the first # is number of threads):
while the java thread dump shows one thread runnable in
|
Hm, I think I may have an idea. There was a kernel oops around the time the problematic file appeared, full text at https://gist.github.com/mafanasyev-tri/facfc80d9a36fccd1dc4121b13941fa6 . Summary:
Removing file from tmpfs sounds like something worker would do, so I think it is likely the oops happened as a part of bazel process. I am not sure how non-fatal kernel oops'es are handled, but is it possible they terminate the thread with extreme prejudice? Then the manager would not know the worker is gone. |
Kernel upgrade fixed it. Closing. |
Description of the problem / feature request / question:
We run bazel on AWS cloud as a part of CI builds. We have tried to enabled the sandbox with:
--spawn_strategy=sandboxed --experimental_sandbox_base=/dev/shm
and found that the build process times out occasionally (maybe 5-10% of the builds). The logs always show the same behavior: a single file (a new one every time) takes forever to compile. See the log below for example; the times on the left is the wall clock time.
We have manually examined the server while this fault is going on. We found that:
We did try to run it without timeout; the build kept going for more than 8 hours.
If possible, provide a minimal example to reproduce the problem:
Unfortunately, we have no reliable reproduction recipe -- we cannot even reliably reproduce it on a local machine.
I will keep working on trying to fund a reproduction recipe. I you have any advice on logs to enable / debugging hints, I would happily try it.
Environment info
Operating System:
Ubuntu 16.04.3 LTS
Linux 4.10.0-37-generic java 1.8 version check fails if JAVA_TOOLS_VERSION is set #41~16.04.1-Ubuntu x86_64
Bazel version (output of
bazel info release
):release 0.8.0- (@Non-Git)
this was also present on 0.7.0
If
bazel info release
returns "development version" or "(@Non-Git)", please tell us what source tree you compiled Bazel from; git commit hash is appreciated (git rev-parse HEAD
):release 0.8.0 source .zip from github releases page
Have you found anything relevant by searching the web?
searched bazel bug tracker for open bugs matched "hangs", "sandbox", "scheduler", "timeout"; found nothing that seemed related. 2985 is the closest thing, but this is a different bug.
Anything else, information or logs or outputs that would be helpful?
(If they are large, please upload as attachment or provide link).
The text was updated successfully, but these errors were encountered: