Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should have full control over the reconnect exponential backoff behavior #25540

Closed
mbyio opened this issue Feb 24, 2021 · 2 comments
Closed

Comments

@mbyio
Copy link

mbyio commented Feb 24, 2021

Is your feature request related to a problem? Please describe.

We use grpcio 1.35.0 in Python 3.7.7 on Macs and Linux.

At my company, we have a longstanding bug where a server will randomly lose its GRPC connection to an upstream server and then take a long time to reconnect. I had some time to look into it today. I experimented with the keepalive settings, but then I found the reconnect backoff settings. I discovered that the default reconnect timeout is 20 seconds - a very long time for our use case!

I found https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md which describes the algorithm you use for reconnect backoff. But, I can't figure out how to configure all of the variables in the algorithm.

According to the docs, I can set the following channel options

  • grpc.initial_reconnect_backoff_ms
  • grpc.min_reconnect_backoff_ms
  • grpc.max_reconnect_backoff_ms

But that doesn't let me set MIN_CONNECT_TIMEOUT, MULTIPLIER, or JITTER. These are all important, but MIN_CONNECT_TIMEOUT is most important since if I use the default of 20 seconds, the server will be unavailable for 20 seconds, even though if it were to retry, it would most likely recover right away.

Describe the solution you'd like

It would be easier to fix both this bug and similar bugs in the future if you let us control all the variables used in the algorithm directly. For now, I think I found a solution (see below), but it isn't ideal.

Describe alternatives you've considered

After doing some testing (by connecting to a non-existent server in a loop and seeing how long requests blocked before failing), I found that if I set grpc.initial_reconnect_backoff_ms, that also sets the initial backoff timeout. This is unexpected, but I have a way forward for fixing the bug for now.

@gnossen
Copy link
Contributor

gnossen commented Mar 1, 2021

Reassigning to Yash since this question is really about Core.

@gnossen gnossen assigned yashykt and unassigned gnossen Mar 1, 2021
@yashykt
Copy link
Member

yashykt commented Mar 9, 2021

I'm not sure what the issue here is..
GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS i.e. "grpc.initial_reconnect_backoff_ms" corresponds to INITIAL_BACKOFF from the algorithm.
The other configurable args are -
GRPC_ARG_MIN_RECONNECT_BACKOFF_MS "grpc.min_reconnect_backoff_ms" which corresponds to MIN_CONNECT_TIMEOUT
GRPC_ARG_MAX_RECONNECT_BACKOFF_MS "grpc.max_reconnect_backoff_ms" which corresponds to MAX_BACKOFF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants