Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasionally, bazel hangs, pretending to compile file forever #4216

Closed
mafanasyev-tri opened this issue Dec 4, 2017 · 8 comments
Closed
Assignees
Labels
category: sandboxing P2 We'll consider working on this in future. (Assignee optional) type: bug

Comments

@mafanasyev-tri
Copy link

Description of the problem / feature request / question:

We run bazel on AWS cloud as a part of CI builds. We have tried to enabled the sandbox with:
--spawn_strategy=sandboxed --experimental_sandbox_base=/dev/shm

and found that the build process times out occasionally (maybe 5-10% of the builds). The logs always show the same behavior: a single file (a new one every time) takes forever to compile. See the log below for example; the times on the left is the wall clock time.

12:40:39 [5,909 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 300s linux-sandbox ... (200 actions running)
12:42:01 [6,025 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 379s linux-sandbox ... (199 actions running)
12:43:37 [6,153 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 469s linux-sandbox ... (200 actions running)
12:45:14 [6,344 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 573s linux-sandbox ... (200 actions running)
12:47:21 [6,485 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 693s linux-sandbox ... (200 actions running)
12:49:43 [6,684 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 832s linux-sandbox ... (200 actions running)
12:52:20 [6,866 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 991s linux-sandbox ... (200 actions running)
12:55:12 [8,267 / 8,639] 135 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1174s linux-sandbox ... (200 actions running)
12:58:55 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1386s linux-sandbox
13:02:46 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1627s linux-sandbox
13:08:23 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1954s linux-sandbox
13:14:15 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 2314s linux-sandbox
13:21:21 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 2734s linux-sandbox
13:29:13 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 3214s linux-sandbox

We have manually examined the server while this fault is going on. We found that:

  • There are no active compiler processes at all -- so this is not a gcc bug, it is a bazel scheduler (or sandbox?) bug
  • Top level bazel process has two threads:
    • One is in a 1 Hz polling loop checking for readability on 2 eventfds and a socket, none of which are ever ready.
    • The other is reading from a pipe shared only with the polling thread above.
  • There is a defunct child process, probably the completed operation. It is not being reaped by the bazel parent process.

We did try to run it without timeout; the build kept going for more than 8 hours.

If possible, provide a minimal example to reproduce the problem:

Unfortunately, we have no reliable reproduction recipe -- we cannot even reliably reproduce it on a local machine.

I will keep working on trying to fund a reproduction recipe. I you have any advice on logs to enable / debugging hints, I would happily try it.

Environment info

  • Operating System:
    Ubuntu 16.04.3 LTS
    Linux 4.10.0-37-generic java 1.8 version check fails if JAVA_TOOLS_VERSION is set #41~16.04.1-Ubuntu x86_64

  • Bazel version (output of bazel info release):
    release 0.8.0- (@Non-Git)
    this was also present on 0.7.0

  • If bazel info release returns "development version" or "(@Non-Git)", please tell us what source tree you compiled Bazel from; git commit hash is appreciated (git rev-parse HEAD):
    release 0.8.0 source .zip from github releases page

Have you found anything relevant by searching the web?

searched bazel bug tracker for open bugs matched "hangs", "sandbox", "scheduler", "timeout"; found nothing that seemed related. 2985 is the closest thing, but this is a different bug.

Anything else, information or logs or outputs that would be helpful?

(If they are large, please upload as attachment or provide link).

@iirina iirina added category: sandboxing P2 We'll consider working on this in future. (Assignee optional) type: bug labels Dec 4, 2017
@iirina
Copy link
Contributor

iirina commented Dec 4, 2017

CC: @philwo could you take a look? Is it a sandbox bug?

@philwo
Copy link
Member

philwo commented Dec 4, 2017

Hi @mafanasyev-tri,

could you please try running with --spawn_strategy=linux-sandbox instead of just =sandboxed? This will force Bazel to use the more advanced Linux sandbox which uses PID namespaces, etc. for added reliability.

Please report back if that a) works at all (or immediately errors with a message that Bazel cannot find the mentioned Spawn strategy) and b) if it still causes you the problems you've seen.

If yes, we can debug this further. :)

Also: What is the name of the defunct child process you saw? Is it linux-sandbox, or some binary from the C++ compiler?

Cheers,
Philipp

@philwo philwo self-assigned this Dec 4, 2017
@mafanasyev-tri
Copy link
Author

The linux-sandbox strategy works -- there are no immediate errors

Unfortunately, the bug still happens -- I run 10 build jobs, and one of them seems to be stuck already.
The zombie process appears as [java] <defunct>

@philwo
Copy link
Member

philwo commented Dec 4, 2017

That's really weird. Would you mind posting a process tree (something like "ps axuf" prints) with your Bazel server as the root? Feel free to remove any personal information. I'm just trying to figure out what that job could be...

It sure sounds like Bazel is waiting for this defunct process to exit and this is causing your hanging build, but when using the linux-sandbox, that "java" process would run in a separate PID namespace that is cleaned up by the Linux kernel when the sandbox exits (even when it gets SIGKILL'd or something). No process running in the sandbox should be able to leak out in such a way.

Could you try disabling the Javac persistent worker, too? That should be the only "java" process that Bazel starts outside the sandbox in an otherwise default configuration. The flag for that would be: --strategy=Javac=linux-sandbox.

@mafanasyev-tri
Copy link
Author

here is relevant ps axuf output:

ubuntu    11004  0.0  0.0   4508   712 ?        S    16:14   0:00  |           \_ sh -c echo $$ > '/opt/jenkins-slave/workspace/HIDDEN
ubuntu    11006  0.0  0.0   4508   860 ?        S    16:14   0:00  |               \_ /bin/sh -xe /opt/jenkins-slave/workspace/HIDDEN
ubuntu    11007  104 11.1 34563104 6892552 ?    tl   16:14  54:59  |                   \_ bazel(HIDDEN)  -XX:+HeapDumpOnOutOfMemoryError ...
ubuntu    11008  0.0  0.0      0     0 ?        Z    16:14   0:00  |                       \_ [java] <defunct>

also jstack on the bazel process does not work, I don't know enough about java to debug:

$ sudo jstack -F -l 11007 > js.out1
java.lang.RuntimeException: Unable to deduce type of thread from address 0x00007f09182ea800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)
...

I will let you know about java sandbox disabling soon

@mafanasyev-tri
Copy link
Author

Regarding previous failure:
(1) apparently "jstack" command worked at some point, it appeared in jenkins output once it timed out :
https://gist.github.com/mafanasyev-tri/b068bb27a684263bd5b66d6750054aeb
(2) I also used "gcore" to get coredump of the process. Here is the backtrace of each thread: https://gist.github.com/mafanasyev-tri/0a51760dd71e28a89126983410bc34ee

It is interesting that gdb backtrace shows all threads sleeping (the first # is number of threads):

      1 #0  0x00007f09f8494827 in do_futex_wait (private=0, abstime=0x0, expected=0, futex_word=0x7f09f8465b80 <sig_sem>)
      2 #0  0x00007f09f848d98d in pthread_join (threadid=139680811554560, thread_return=thread_return@entry=0x7ffed6f537f8) at pthread_join.c:90
     19 #0  0x00007f09f8492709 in pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
    226 #0  0x00007f09f8492360 in pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

while the java thread dump shows one thread runnable in com.google.devtools.build.lib.unix.NativePosixFiles.remove:

      1    java.lang.Thread.State: RUNNABLE             at com.google.devtools.build.lib.unix.NativePosixFiles.remove(Native Method)    --
      1    java.lang.Thread.State: TIMED_WAITING (on object monitor)            at java.lang.Object.wait(Native Method) --
      1    java.lang.Thread.State: TIMED_WAITING (sleeping)             at java.lang.Thread.sleep(Native Method)        --
      1    java.lang.Thread.State: WAITING (on object monitor)          at java.lang.Object.wait(Native Method) 
      3    java.lang.Thread.State: WAITING (on object monitor)          at java.lang.Object.wait(Native Method) --
     17    java.lang.Thread.State: RUNNABLE             --
    199    java.lang.Thread.State: WAITING (parking)            at sun.misc.Unsafe.park(Native Method)  --

@mafanasyev-tri
Copy link
Author

Hm, I think I may have an idea. There was a kernel oops around the time the problematic file appeared, full text at https://gist.github.com/mafanasyev-tri/facfc80d9a36fccd1dc4121b13941fa6 . Summary:

[ 1165.309791] BUG: unable to handle kernel NULL pointer dereference at 00000000000001f8
[ 1165.312086] IP: __radix_tree_replace+0xaa/0x100
...
[ 1165.568151] Call Trace:
[ 1165.568151]  __delete_from_page_cache+0x153/0x3e0
[ 1165.568151]  delete_from_page_cache+0x54/0xe0
[ 1165.568151]  truncate_inode_page+0x8c/0xc0
[ 1165.568151]  shmem_undo_range+0x4b5/0xa70
...
[ 1165.568151]  SyS_unlink+0x16/0x20

Removing file from tmpfs sounds like something worker would do, so I think it is likely the oops happened as a part of bazel process. I am not sure how non-fatal kernel oops'es are handled, but is it possible they terminate the thread with extreme prejudice? Then the manager would not know the worker is gone.

@mafanasyev-tri
Copy link
Author

Kernel upgrade fixed it. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: sandboxing P2 We'll consider working on this in future. (Assignee optional) type: bug
Projects
None yet
Development

No branches or pull requests

3 participants