-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote Worker failures on bazel build #3251
Comments
Another manifestation of the same error (I think): ERROR: /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/external/com_google_protobuf_java/BUILD.bazel:1:1: Extracting interface @com_google_protobuf_java//:protobuf failed: ijar failed: error executing command |
No idea. There are some cases where we still swallow exceptions from the underlying libraries, and I suspect it's one of those. I'll try to see if I can reproduce this with a patched binary that produces more debugging output. |
I got a stack trace.
|
Well, it's pretty obvious what's going wrong now. Every time we're running an action we download all the relevant inputs and put them all into the same directory, silently overwriting any existing files. If we run two actions in parallel, we can be writing to a file just as we want to execute it, and Linux complains about that. |
ehehehe... nice find, @ulfjack. |
Err, no, that's not right. It is putting them in different directories. |
The RemoteWorker is running the action for every watch request it gets. My new hypothesis is that we get the same watch request twice.... |
WatcherServer has a check-then-act race, in which case it can throw a NPE. If we can get multiple watch requests for the same action, then it can also return Code.NOT_FOUND. |
Yeah, my WatcherServer is very simplistic, it strongly counts on the client
only calling one request. But how is this assumption violated? I indeed
only call watch once per action, GrpcRemoteExecutor.java line 96
…On Wed, Jun 28, 2017 at 7:42 AM, Ulf Adams ***@***.***> wrote:
WatcherServer has a check-then-act race, in which case it can throw a NPE.
If we can get multiple watch requests for the same action, then it can also
return Code.NOT_FOUND.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3251 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYoKuCZ7ZVfpPoM3JKWS1-iLReK7bDbgks5sIjw2gaJpZM4OC-X_>
.
|
I don't know yet. I added some more debug output, and am running builds trying to reproduce, but no success so far. |
In the mean time, I will improve the Watch server to properly synchronize
the operations map, and see if it fixes this. Of course, understanding what
actually caused the error would be much better, but I'll settle for second
best :-)
…On Wed, Jun 28, 2017 at 10:10 AM, Ulf Adams ***@***.***> wrote:
I don't know yet. I added some more debug output, and am running builds
trying to reproduce, but no success so far.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3251 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYoKuBYgStExnrFNGyWlKTqM4eo0ZlZtks5sIl7AgaJpZM4OC-X_>
.
|
I haven't been able to reproduce the error in the last two days. :-( I only have logging changes in the code compared to the stack trace I posted - how can that be? |
I can reproduce reliably by increasing the number of threads. I think it might be this bug: |
As a workaround, I can make it so the remote worker retires an action if i get the "error=26, Text file busy" error. That might be good enough for now. Unfortunately, if upstream doesn't want to fix this, we'll have to write our own. |
I get a new error now:
This is without the retry logic. |
More fun exceptions:
|
Ignore that last one, that's a bug in my change. |
Well, testing is difficult, since it's using up all memory in my machine, and making it impossible for me to use the machine until I physically switch it off and on again. |
That's a great find! Doesn't look like they are fixing this.
If retrying this error works, great. Now, if we only had a generic
retrier... :-)
…On Fri, Jun 30, 2017 at 9:06 AM, Ulf Adams ***@***.***> wrote:
Well, testing is difficult, since it's using up all memory in my machine,
and making it impossible for me to use the machine until I physically
switch it off and on again.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3251 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYoKuPxBHuGOsycTbd1CT8Lcw70v6PMvks5sJPLAgaJpZM4OC-X_>
.
|
This is happening with as few as a dozen simultaneous requests, with a single client on a single machine. I basically haven't been able to build bazel on my machine with the remote worker if I change jobs to a bit more than I have cores in my machine. That's a really bad experience, because I want to have a higher number of jobs when I get cache hits, as that significantly improves performance when a lot of the requests are cache hits. But I can't know ahead of time whether the actions will be cache hits or not, and manually tweaking jobs on the client isn't a good UI either. |
I wonder if this behavior is also related to the switching back to a single
gRPC channel change. We have experienced a much higher volume of 502 due to
this on our end, too (I think you're cc-ed on the internal bug). I'm
considering implementing a gRPC channel pool (want to get some advice from
gRPC folks before doing that).
Actually, let me check if that makes a difference.
…On Fri, Jun 30, 2017 at 10:37 AM, Ulf Adams ***@***.***> wrote:
This is happening with as few as a dozen simultaneous requests, with a
single client on a single machine. I basically haven't been able to build
bazel on my machine with the remote worker if I change jobs to a bit more
than I have cores in my machine. That's a really bad experience, because I
want to have a higher number of jobs when I get cache hits, as that
significantly improves performance when a lot of the requests are cache
hits. But I can't know ahead of time whether the actions will be cache hits
or not, and manually tweaking jobs on the client isn't a good UI either.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3251 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYoKuMlWrbpuLOXvsluf-nQF55Sx21dxks5sJQgpgaJpZM4OC-X_>
.
|
Also: re: queueing -- if we had explicit queueing in the RemoteWorker, we'd
still have to increase the remote_timeout in Bazel to account for the
queueing delays. I am wondering, however, if a thread actually gets starved
somewhere, because 6 attempts on a 60 seconds timeout should equal lots of
time. Does that really take 6 minutes before failing? Can you please paste
the --verbose_failures output?
…On Fri, Jun 30, 2017 at 10:46 AM, Ola Rozenfeld ***@***.***> wrote:
I wonder if this behavior is also related to the switching back to a
single gRPC channel change. We have experienced a much higher volume of 502
due to this on our end, too (I think you're cc-ed on the internal bug). I'm
considering implementing a gRPC channel pool (want to get some advice from
gRPC folks before doing that).
Actually, let me check if that makes a difference.
On Fri, Jun 30, 2017 at 10:37 AM, Ulf Adams ***@***.***>
wrote:
> This is happening with as few as a dozen simultaneous requests, with a
> single client on a single machine. I basically haven't been able to build
> bazel on my machine with the remote worker if I change jobs to a bit more
> than I have cores in my machine. That's a really bad experience, because I
> want to have a higher number of jobs when I get cache hits, as that
> significantly improves performance when a lot of the requests are cache
> hits. But I can't know ahead of time whether the actions will be cache hits
> or not, and manually tweaking jobs on the client isn't a good UI either.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#3251 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AYoKuMlWrbpuLOXvsluf-nQF55Sx21dxks5sJQgpgaJpZM4OC-X_>
> .
>
|
It definitely does not take 6 minutes to the DEADLINE_EXCEEDED. It takes a bit more than 60 seconds, so maybe retries don't work properly? I don't think we'd need to increase the timeout - we'd queue the action for execution, but report back regularly via the Watcher API. I thought that's why we're using a long running operation? |
Yeah, there's definitely something wrong there. I am looking into it as
well.
Sorry, you're right, that's why I didn't set a timeout on the Watcher API
at all. That's why I wanted to see the stack trace -- I suspect it's not
the Watcher call that hits this, or that I have a bug in dealing with
command execution timeout (which throws DEADLINE_EXCEEDED, but then is
supposed to put it onto the Operation error field instead of propagating
via onError).
…On Fri, Jun 30, 2017 at 11:04 AM, Ulf Adams ***@***.***> wrote:
It definitely does *not* take 6 minutes to the DEADLINE_EXCEEDED. It
takes a bit more than 60 seconds, so maybe retries don't work properly?
I don't think we'd need to increase the timeout - we'd queue the action
for execution, but report back regularly via the Watcher API. I thought
that's why we're using a long running operation?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3251 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYoKuMJzLcQZjwMrjT-Hy4_8TUQG1dUxks5sJQ5zgaJpZM4OC-X_>
.
|
Btw, another small issue I need to fix -- when the server fails on an
operation, e.g. IOException, it gets raised as INTERNAL_ERROR, but then a
simple retry obviously hits a NOT_FOUND -- which hides the problem. The bug
is that I need to retry both execute and watch on a NOT_FOUND.
…On Fri, Jun 30, 2017 at 11:13 AM, Ola Rozenfeld ***@***.***> wrote:
Yeah, there's definitely something wrong there. I am looking into it as
well.
Sorry, you're right, that's why I didn't set a timeout on the Watcher API
at all. That's why I wanted to see the stack trace -- I suspect it's not
the Watcher call that hits this, or that I have a bug in dealing with
command execution timeout (which throws DEADLINE_EXCEEDED, but then is
supposed to put it onto the Operation error field instead of propagating
via onError).
On Fri, Jun 30, 2017 at 11:04 AM, Ulf Adams ***@***.***>
wrote:
> It definitely does *not* take 6 minutes to the DEADLINE_EXCEEDED. It
> takes a bit more than 60 seconds, so maybe retries don't work properly?
>
> I don't think we'd need to increase the timeout - we'd queue the action
> for execution, but report back regularly via the Watcher API. I thought
> that's why we're using a long running operation?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#3251 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AYoKuMJzLcQZjwMrjT-Hy4_8TUQG1dUxks5sJQ5zgaJpZM4OC-X_>
> .
>
|
I get this fun error in Bazel: Server terminated abruptly (error code: 14, error message: '', log file: '/usr/local/google/home/ulfjack/.cache/bazel/_bazel_ulfjack/5c706281d73fa9baac5a257912930c73/server/jvm.out') |
Here's the --verbose_failures output from a DEADLINE_EXCEEDED, which I just managed to reproduce again...
|
|
Ok, I have a set of changes that make the remote worker not completely lock up my machine. But I still get DEADLINE_EXCEEDED when building from scratch, and also the odd Bazel server crash with no error messages and no output (except as noted above). |
And apparently, netty is allocating huge amounts of direct memory. Maybe we're holding it wrong? io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 1677721 byte(s) of direct memory (used: 8120172847, max: 8135442432) |
Ok, I can at least confirm that the "Server terminated abruptly" and the "OutOfDirectMemoryError" are connected. We're actually redirecting stdout / stderr within Java to go over gRPC, so we can loose error messages if gRPC crashes. I've removed the redirection (instead, it goes to an output file), and now I can see the exception stack traces. |
Actually, there are more cases where we get "Server terminated abruptly". Apparently the remote execution use of gRPC is starving out the client/server connection and the client is loosing the connection, even though the server is still running. |
Maybe it's getting 'deadline exceeded'? |
Status 14 is UNAVAILABLE according to StatusCode. |
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 1677721 byte(s) of direct memory (used: 8120172847, max: 8135442432) |
We think the main problem is that there's no flow control at the application level - we're queueing all the chunks at once which means reading all the files for upload into memory at the same time. |
The DEADLINE_EXCEEDED problem is because the stubs are memoized. gRPC computes the deadline when the stubs are created not when they're used. |
So the deadline isn't applied per call, but per build, which isn't right. |
New error: |
Ouch, sorry about that! Let me send you a CL right away.
…On Mon, Jul 3, 2017 at 2:49 PM, Ulf Adams ***@***.***> wrote:
So the deadline isn't applied per call, but per build, which isn't right.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3251 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYoKuGPQBlnzSObrbHy7iHJ19V12HwYdks5sKTesgaJpZM4OC-X_>
.
|
re: errors: I don't see yet how that could happen, but I do know how to fix it (I think). Will add it to the the cl in review. |
Wait, I just realized I have always had the assumption that the uploaded digests are different. That is incorrect, and in fact I should filter the same ones out before they get into the Chunker. I'm still not sure how that could cause errors to have less values than failedDigests (we modify them at the same time, one is a list, the other is a set), but I am now seeing more evidence that the observer threads do not terminate when I thought they would. |
It might be a bug in my own code. I only rewrote the chunker, the uploader, the downloader, the byte stream server, and the watch server, and the execution server. I might have a few CLs to send out this week... |
Might be, but now I think it is a bug in my code. Here is what I think happens: the Chunker chunks 5 inputs, out of which there are only 4 different digests (2 inputs happen to have same content). Suppose the doubled file actually fails on upload. Then, the number of failedDigests is actually 1, but the number of uploads I will need to execute again (in the current code) is 2! That screws up my whole multithreading assumption, because I am instantiating the finishLatch to await on one item, when in reality I spawn two threads. |
I'm afraid we need to rewrite this code entirely to add flow control, for which I have a partial CL, so I wouldn't spend much time on this right now if I was you. |
When trying to build Bazel using the RemoteWorker, I get intermittent errors such as:
ERROR: /usr/local/google/home/olaola/full-bazel/third_party/BUILD:75:1: Extracting interface //third_party:android_common_25_0_0 failed: ijar failed: error executing command
(cd /usr/local/google/home/olaola/.cache/bazel/_bazel_olaola/098d19699a9b31c0ee0da4e8c41e90b9/execroot/io_bazel &&
exec env -
PATH=/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/local/google/home/olaola/work/bin:/usr/local/google/home/olaola/depot_tools:/usr/local/google/home/olaola/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/google/home/olaola/google-cloud-sdk/bin:/usr/lib/google-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin
external/bazel_tools/tools/jdk/ijar/ijar third_party/android_common/com.android_annotations_25.0.0.jar bazel-out/local-fastbuild/genfiles/third_party/_ijar/android_common_25_0_0/third_party/android_common/com.android_annotations_25.0.0-ijar.jar): Exit -1.
ERROR: /usr/local/google/home/olaola/full-bazel/third_party/BUILD:75:1: output 'third_party/_ijar/android_common_25_0_0/third_party/android_common/com.android.tools.lint_lint-api_25.0.0-ijar.jar' was not created.
The command actually failed on the RemoteWorker, this is not a gRPC issue.
The text was updated successfully, but these errors were encountered: