We should have full control over the reconnect exponential backoff behavior #25540

mbyio · 2021-02-24T01:57:34Z

Is your feature request related to a problem? Please describe.

We use grpcio 1.35.0 in Python 3.7.7 on Macs and Linux.

At my company, we have a longstanding bug where a server will randomly lose its GRPC connection to an upstream server and then take a long time to reconnect. I had some time to look into it today. I experimented with the keepalive settings, but then I found the reconnect backoff settings. I discovered that the default reconnect timeout is 20 seconds - a very long time for our use case!

I found https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md which describes the algorithm you use for reconnect backoff. But, I can't figure out how to configure all of the variables in the algorithm.

According to the docs, I can set the following channel options

grpc.initial_reconnect_backoff_ms
grpc.min_reconnect_backoff_ms
grpc.max_reconnect_backoff_ms

But that doesn't let me set MIN_CONNECT_TIMEOUT, MULTIPLIER, or JITTER. These are all important, but MIN_CONNECT_TIMEOUT is most important since if I use the default of 20 seconds, the server will be unavailable for 20 seconds, even though if it were to retry, it would most likely recover right away.

Describe the solution you'd like

It would be easier to fix both this bug and similar bugs in the future if you let us control all the variables used in the algorithm directly. For now, I think I found a solution (see below), but it isn't ideal.

Describe alternatives you've considered

After doing some testing (by connecting to a non-existent server in a loop and seeing how long requests blocked before failing), I found that if I set grpc.initial_reconnect_backoff_ms, that also sets the initial backoff timeout. This is unexpected, but I have a way forward for fixing the bug for now.

The text was updated successfully, but these errors were encountered:

gnossen · 2021-03-01T18:40:19Z

Reassigning to Yash since this question is really about Core.

yashykt · 2021-03-09T06:04:41Z

I'm not sure what the issue here is..
GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS i.e. "grpc.initial_reconnect_backoff_ms" corresponds to INITIAL_BACKOFF from the algorithm.
The other configurable args are -
GRPC_ARG_MIN_RECONNECT_BACKOFF_MS "grpc.min_reconnect_backoff_ms" which corresponds to MIN_CONNECT_TIMEOUT
GRPC_ARG_MAX_RECONNECT_BACKOFF_MS "grpc.max_reconnect_backoff_ms" which corresponds to MAX_BACKOFF

mbyio added kind/enhancement priority/P2 labels Feb 24, 2021

mbyio assigned donnadionne Feb 24, 2021

donnadionne assigned gnossen and unassigned donnadionne Mar 1, 2021

donnadionne added the lang/Python label Mar 1, 2021

gnossen assigned yashykt and unassigned gnossen Mar 1, 2021

gnossen added the lang/core label Mar 1, 2021

yashykt closed this as completed Mar 9, 2021

aepfli mentioned this issue Dec 15, 2024

Provider reconnection topic open-feature/flagd#1472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should have full control over the reconnect exponential backoff behavior #25540

We should have full control over the reconnect exponential backoff behavior #25540

mbyio commented Feb 24, 2021

gnossen commented Mar 1, 2021

yashykt commented Mar 9, 2021 •

edited

Loading

We should have full control over the reconnect exponential backoff behavior #25540

We should have full control over the reconnect exponential backoff behavior #25540

Comments

mbyio commented Feb 24, 2021

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

gnossen commented Mar 1, 2021

yashykt commented Mar 9, 2021 • edited Loading

yashykt commented Mar 9, 2021 •

edited

Loading