Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC resource exhaustion errors during BEP upload #12050

Closed
SrodriguezO opened this issue Sep 4, 2020 · 9 comments
Closed

gRPC resource exhaustion errors during BEP upload #12050

SrodriguezO opened this issue Sep 4, 2020 · 9 comments
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc untriaged

Comments

@SrodriguezO
Copy link
Contributor

Description of the problem:

gRPC resource exhaustion errors during BEP upload when a very large build completes very quickly due to cache hits.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We often run into it when a very large build completes very quickly due to cache hits (and --bes_backend is specified).

What operating system are you running Bazel on?

Ubuntu 18.04

What's the output of bazel info release?

release 3.3.0

Have you found anything relevant by searching the web?

Similar issue encountered:

Any other information, logs, or outputs that you want to share?

Error logs:

ERROR: The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=RESOURCE_EXHAUSTED, description=grpc: received message larger than max (10913322 vs. 4194304), cause=null}' RESOURCE_EXHAUSTED: RESOURCE_EXHAUSTED: grpc: received message larger than max (10913322 vs. 4194304) RESOURCE_EXHAUSTED: RESOURCE_EXHAUSTED: grpc: received message larger than max (10913322 vs. 4194304)

Bazel Exit Code:

38

Other Info:

  • We are using BuildBuddy as our BES backend.
  • Our current workaround is to disable BEP for the build in question.
@zachgrayio
Copy link

zachgrayio commented Sep 4, 2020

This actually looks like a buildbuddy bug a few folks have asked me about recently, not a Bazel issue.

Your BES upload is greater than the gRPC default of 4mb, hence the message you see here (you're sending 10.6mb).

If you're using a fork of buildbuddy, which I assume you are, then you can go and fix this yourself like so:

diff --git a/server/libmain/libmain.go b/server/libmain/libmain.go
--- a/server/libmain/libmain.go
+++ b/server/libmain/libmain.go
@@ -211,6 +211,7 @@ func StartGRPCServiceOrDie(env environment.Env, buildBuddyServer *buildbuddy_ser
        grpcOptions := []grpc.ServerOption{
                rpcfilters.GetUnaryInterceptor(env),
                rpcfilters.GetStreamInterceptor(env),
+               grpc.MaxRecvMsgSize(1024*1024*20),
        }

If you'd like to take a look at alternative BES implementations, feel free to give us a shout :)

@SrodriguezO
Copy link
Contributor Author

Hey @zachgrayio, thanks for your quick response. Is this something that gets configured on the server or on the client? I found this grpc-gateway comment and this Stack Overflow response suggesting it was a client-side config.

Maybe it needs to be configured in both places? I can't find references to the grpc MaxXMsgSize configs on the Bazel project though.

Also, someone ran into this issue with Buildbarn as well (granted, they could also have the bug if it's a server-side thing).

@zachgrayio
Copy link

In this case it's a missing server option I think (grpc.MaxRecvMsgSize()).

@aiuto aiuto added z-team-Apple Deprecated. Send to rules_apple, or label team-Rules-CPP + platform:apple untriaged labels Sep 6, 2020
@ulfjack
Copy link
Contributor

ulfjack commented Sep 7, 2020

Here's some background on this issue: gRPC has a built-in maximum message size controlled by the receiver (in this case the buildbuddy service). The default value in Java is 4 MiB.

Bazel does not automatically limit itself to the server-defined maximum message size. Doing so is difficult, as some of the proto messages in the Build Event Protocol / Service are inherently monolithic, and cannot be automatically broken into separate messages. As such, we were targeting a maximum size of about 50 MiB.

Depending on which event is too large, you may be able to reduce the event size by setting --bes_outerr_chunk_size or --build_event_max_named_set_of_file_entries on the client, i.e., Bazel. The default outerr chunk size is 1 MiB.

Unfortunately, the error message above contains the error code, but not which message caused it.

@ulfjack
Copy link
Contributor

ulfjack commented Sep 7, 2020

Or set --legacy_important_outputs=false.

@SrodriguezO
Copy link
Contributor Author

Hey @ulfjack thanks for the tips.

we were targeting a maximum size of about 50 MiB.

Do you know where this is specified? Mostly just curious.

Depending on which event is too large, you may be able to reduce the event size by setting --bes_outerr_chunk_size or --build_event_max_named_set_of_file_entries on the client

Interesting. It might make more sense to increase it on the receiver if the client's already allowing 50 MiB. We might look into that approach.

Or set --legacy_important_outputs=false

I'm curious, how come this might also help? The docs for legacy_important_outputs simply say "Use this to suppress generation of the legacy important_outputs field in the TargetComplete event"

@ulfjack
Copy link
Contributor

ulfjack commented Sep 8, 2020

I am not aware of any place where that is publicly documented. This is my personal recollection from working on the BEP.

I think it makes sense for BES implementations to provide a knob to allow larger than default packets. However, there are also reasons for preferring smaller packets (e.g., preventing service outages due to memory exhaustion), and there are knobs in Bazel to adjust that as well.

The original BEP design had a repeated field representing a flat list of all 'important' outputs of a configured target. However, this turned out to be problematic because some configured targets have a huge list of such outputs. We then migrated to a nested-set style listing of important outputs. However, this is technically an incompatible change, and so we added the --legacy_important_outputs flag. Maybe we should have called it incompatible_legacy_outputs or something.

@jmmv jmmv added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc and removed z-team-Apple Deprecated. Send to rules_apple, or label team-Rules-CPP + platform:apple labels Sep 9, 2020
@siggisim
Copy link
Contributor

Thanks for reporting @SrodriguezO and for the background @ulfjack.

We've bumped the default max grpc limit in buildbuddy-io/buildbuddy@7cd6929 which should go live in the next release (targeting this afternoon). Configurability incoming as well.

Feel free to upstream your changes in the future @zachgrayio!

@SrodriguezO
Copy link
Contributor Author

Closing this as it doesn't seem to be a Bazel bug after all. Thank you @siggisim for the quick turnaround!

benjaminp added a commit to benjaminp/bazel that referenced this issue Nov 30, 2021
This flag reduces the largest proto size, which helps avoid sharp edges with remote execution systems (e.g., bazelbuild#12050).

RELNOTES[INC]: --legacy_important_outputs now has a default of false.
benjaminp added a commit to benjaminp/bazel that referenced this issue Nov 30, 2021
This flag reduces the largest BES proto size, which helps avoid sharp edges with remote execution systems (e.g., bazelbuild#12050).

RELNOTES[INC]: --legacy_important_outputs now has a default of false.
bazel-io pushed a commit that referenced this issue May 20, 2022
This flag reduces the largest proto size, which helps avoid sharp edges with remote execution systems (e.g., #12050).

RELNOTES[INC]: --legacy_important_outputs now has a default of false.

Closes #14353.

PiperOrigin-RevId: 449979885
bazel-io pushed a commit that referenced this issue May 20, 2022
*** Reason for rollback ***

Breaking ResultStore customers.

RELNOTES[INC]: --legacy_important_outputs default reverted to true.

*** Original change description ***

Set --legacy_important_outputs to false by default.

This flag reduces the largest proto size, which helps avoid sharp edges with remote execution systems (e.g., #12050).

Closes #14353.

PiperOrigin-RevId: 450067034
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc untriaged
Projects
None yet
Development

No branches or pull requests

6 participants