occasionally, bazel hangs, pretending to compile file forever #4216

mafanasyev-tri · 2017-12-04T01:24:09Z

Description of the problem / feature request / question:

We run bazel on AWS cloud as a part of CI builds. We have tried to enabled the sandbox with:
--spawn_strategy=sandboxed --experimental_sandbox_base=/dev/shm

and found that the build process times out occasionally (maybe 5-10% of the builds). The logs always show the same behavior: a single file (a new one every time) takes forever to compile. See the log below for example; the times on the left is the wall clock time.

12:40:39 [5,909 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 300s linux-sandbox ... (200 actions running)
12:42:01 [6,025 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 379s linux-sandbox ... (199 actions running)
12:43:37 [6,153 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 469s linux-sandbox ... (200 actions running)
12:45:14 [6,344 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 573s linux-sandbox ... (200 actions running)
12:47:21 [6,485 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 693s linux-sandbox ... (200 actions running)
12:49:43 [6,684 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 832s linux-sandbox ... (200 actions running)
12:52:20 [6,866 / 8,454] 12 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 991s linux-sandbox ... (200 actions running)
12:55:12 [8,267 / 8,639] 135 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1174s linux-sandbox ... (200 actions running)
12:58:55 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1386s linux-sandbox
13:02:46 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1627s linux-sandbox
13:08:23 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 1954s linux-sandbox
13:14:15 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 2314s linux-sandbox
13:21:21 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 2734s linux-sandbox
13:29:13 [8,650 / 8,656] 179 / 179 tests; Compiling external/boost/boost_1_65_1/libs/python/src/numpy/dtype.cpp; 3214s linux-sandbox

We have manually examined the server while this fault is going on. We found that:

There are no active compiler processes at all -- so this is not a gcc bug, it is a bazel scheduler (or sandbox?) bug
Top level bazel process has two threads:
- One is in a 1 Hz polling loop checking for readability on 2 eventfds and a socket, none of which are ever ready.
- The other is reading from a pipe shared only with the polling thread above.
There is a defunct child process, probably the completed operation. It is not being reaped by the bazel parent process.

We did try to run it without timeout; the build kept going for more than 8 hours.

If possible, provide a minimal example to reproduce the problem:

Unfortunately, we have no reliable reproduction recipe -- we cannot even reliably reproduce it on a local machine.

I will keep working on trying to fund a reproduction recipe. I you have any advice on logs to enable / debugging hints, I would happily try it.

Environment info

Operating System:
Ubuntu 16.04.3 LTS
Linux 4.10.0-37-generic java 1.8 version check fails if JAVA_TOOLS_VERSION is set #41~16.04.1-Ubuntu x86_64
Bazel version (output of bazel info release):
release 0.8.0- (@Non-Git)
this was also present on 0.7.0
If bazel info release returns "development version" or "(@Non-Git)", please tell us what source tree you compiled Bazel from; git commit hash is appreciated (git rev-parse HEAD):
release 0.8.0 source .zip from github releases page

Have you found anything relevant by searching the web?

searched bazel bug tracker for open bugs matched "hangs", "sandbox", "scheduler", "timeout"; found nothing that seemed related. 2985 is the closest thing, but this is a different bug.

Anything else, information or logs or outputs that would be helpful?

(If they are large, please upload as attachment or provide link).

The text was updated successfully, but these errors were encountered:

iirina · 2017-12-04T14:11:11Z

CC: @philwo could you take a look? Is it a sandbox bug?

philwo · 2017-12-04T14:31:06Z

Hi @mafanasyev-tri,

could you please try running with --spawn_strategy=linux-sandbox instead of just =sandboxed? This will force Bazel to use the more advanced Linux sandbox which uses PID namespaces, etc. for added reliability.

Please report back if that a) works at all (or immediately errors with a message that Bazel cannot find the mentioned Spawn strategy) and b) if it still causes you the problems you've seen.

If yes, we can debug this further. :)

Also: What is the name of the defunct child process you saw? Is it linux-sandbox, or some binary from the C++ compiler?

Cheers,
Philipp

mafanasyev-tri · 2017-12-04T16:59:54Z

The linux-sandbox strategy works -- there are no immediate errors

Unfortunately, the bug still happens -- I run 10 build jobs, and one of them seems to be stuck already.
The zombie process appears as [java] <defunct>

philwo · 2017-12-04T17:05:03Z

That's really weird. Would you mind posting a process tree (something like "ps axuf" prints) with your Bazel server as the root? Feel free to remove any personal information. I'm just trying to figure out what that job could be...

It sure sounds like Bazel is waiting for this defunct process to exit and this is causing your hanging build, but when using the linux-sandbox, that "java" process would run in a separate PID namespace that is cleaned up by the Linux kernel when the sandbox exits (even when it gets SIGKILL'd or something). No process running in the sandbox should be able to leak out in such a way.

Could you try disabling the Javac persistent worker, too? That should be the only "java" process that Bazel starts outside the sandbox in an otherwise default configuration. The flag for that would be: --strategy=Javac=linux-sandbox.

mafanasyev-tri · 2017-12-04T17:10:53Z

here is relevant ps axuf output:

ubuntu    11004  0.0  0.0   4508   712 ?        S    16:14   0:00  |           \_ sh -c echo $$ > '/opt/jenkins-slave/workspace/HIDDEN
ubuntu    11006  0.0  0.0   4508   860 ?        S    16:14   0:00  |               \_ /bin/sh -xe /opt/jenkins-slave/workspace/HIDDEN
ubuntu    11007  104 11.1 34563104 6892552 ?    tl   16:14  54:59  |                   \_ bazel(HIDDEN)  -XX:+HeapDumpOnOutOfMemoryError ...
ubuntu    11008  0.0  0.0      0     0 ?        Z    16:14   0:00  |                       \_ [java] <defunct>

also jstack on the bazel process does not work, I don't know enough about java to debug:

$ sudo jstack -F -l 11007 > js.out1
java.lang.RuntimeException: Unable to deduce type of thread from address 0x00007f09182ea800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)
...

I will let you know about java sandbox disabling soon

mafanasyev-tri · 2017-12-04T18:13:11Z

Regarding previous failure:
(1) apparently "jstack" command worked at some point, it appeared in jenkins output once it timed out :
https://gist.github.com/mafanasyev-tri/b068bb27a684263bd5b66d6750054aeb
(2) I also used "gcore" to get coredump of the process. Here is the backtrace of each thread: https://gist.github.com/mafanasyev-tri/0a51760dd71e28a89126983410bc34ee

It is interesting that gdb backtrace shows all threads sleeping (the first # is number of threads):

      1 #0  0x00007f09f8494827 in do_futex_wait (private=0, abstime=0x0, expected=0, futex_word=0x7f09f8465b80 <sig_sem>)
      2 #0  0x00007f09f848d98d in pthread_join (threadid=139680811554560, thread_return=thread_return@entry=0x7ffed6f537f8) at pthread_join.c:90
     19 #0  0x00007f09f8492709 in pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
    226 #0  0x00007f09f8492360 in pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

while the java thread dump shows one thread runnable in com.google.devtools.build.lib.unix.NativePosixFiles.remove:

      1    java.lang.Thread.State: RUNNABLE             at com.google.devtools.build.lib.unix.NativePosixFiles.remove(Native Method)    --
      1    java.lang.Thread.State: TIMED_WAITING (on object monitor)            at java.lang.Object.wait(Native Method) --
      1    java.lang.Thread.State: TIMED_WAITING (sleeping)             at java.lang.Thread.sleep(Native Method)        --
      1    java.lang.Thread.State: WAITING (on object monitor)          at java.lang.Object.wait(Native Method) 
      3    java.lang.Thread.State: WAITING (on object monitor)          at java.lang.Object.wait(Native Method) --
     17    java.lang.Thread.State: RUNNABLE             --
    199    java.lang.Thread.State: WAITING (parking)            at sun.misc.Unsafe.park(Native Method)  --

mafanasyev-tri · 2017-12-04T18:52:18Z

Hm, I think I may have an idea. There was a kernel oops around the time the problematic file appeared, full text at https://gist.github.com/mafanasyev-tri/facfc80d9a36fccd1dc4121b13941fa6 . Summary:

[ 1165.309791] BUG: unable to handle kernel NULL pointer dereference at 00000000000001f8
[ 1165.312086] IP: __radix_tree_replace+0xaa/0x100
...
[ 1165.568151] Call Trace:
[ 1165.568151]  __delete_from_page_cache+0x153/0x3e0
[ 1165.568151]  delete_from_page_cache+0x54/0xe0
[ 1165.568151]  truncate_inode_page+0x8c/0xc0
[ 1165.568151]  shmem_undo_range+0x4b5/0xa70
...
[ 1165.568151]  SyS_unlink+0x16/0x20

Removing file from tmpfs sounds like something worker would do, so I think it is likely the oops happened as a part of bazel process. I am not sure how non-fatal kernel oops'es are handled, but is it possible they terminate the thread with extreme prejudice? Then the manager would not know the worker is gone.

mafanasyev-tri · 2017-12-20T02:39:39Z

Kernel upgrade fixed it. Closing.

iirina added category: sandboxing P2 We'll consider working on this in future. (Assignee optional) type: bug labels Dec 4, 2017

philwo self-assigned this Dec 4, 2017

mafanasyev-tri closed this as completed Dec 20, 2017

konste mentioned this issue Mar 21, 2022

Bazel does not notice compilation action finished and waits for it forever #15094

Open

TeoMihuc mentioned this issue Nov 28, 2024

Bazel 7.4.0 build hangs indefinitely sometimes #24520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

occasionally, bazel hangs, pretending to compile file forever #4216

occasionally, bazel hangs, pretending to compile file forever #4216

mafanasyev-tri commented Dec 4, 2017

iirina commented Dec 4, 2017

philwo commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

philwo commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

mafanasyev-tri commented Dec 20, 2017

occasionally, bazel hangs, pretending to compile file forever #4216

occasionally, bazel hangs, pretending to compile file forever #4216

Comments

mafanasyev-tri commented Dec 4, 2017

Description of the problem / feature request / question:

If possible, provide a minimal example to reproduce the problem:

Environment info

Have you found anything relevant by searching the web?

Anything else, information or logs or outputs that would be helpful?

iirina commented Dec 4, 2017

philwo commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

philwo commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

mafanasyev-tri commented Dec 4, 2017

mafanasyev-tri commented Dec 20, 2017