Client reconnects way too frequently #2739

ejona86 · 2019-04-03T22:40:06Z

On connection failure, I'm seeing the Go client reconnect .5-1s after the initial failure, with no backoff. This may only impact DNS+pick first.

I used a modified server that had no ALPN. I wanted to test Go's TLS behavior, and it should fail connecting. However, I noticed that the reconnection attempts were too frequent.

It looks like the connection failure is triggering a DNS refresh. Since there is no DNS rate limiting (hi #2402 !), it resolves quickly. It seems DNS notifies the "new" addresses, even though they are the same. tryUpdateAddrs fails because the connection is TRANSIENT_FAILURE and so acBalancerWrapper.UpdateAddresses tears down the ac and re-creates it, destroying the backoff.

GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info ./client --server_host_override=foo.test.google.fr --use_test_ca=true --server_port=8080 --use_tls
...
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, CONNECTING
WARNING: 2019/04/03 15:13:07 grpc: addrConn.createTransport failed to connect to {127.0.0.1:8080 0  <nil>}. Err :write tcp 127.0.0.1:41052->127.0.0.1:8080: write: broken pipe. Reconnecting...
WARNING: 2019/04/03 15:13:07 grpc: addrConn.createTransport failed to connect to {[::1]:8080 0  <nil>}. Err :tls: use of closed connection. Reconnecting...
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, TRANSIENT_FAILURE
INFO: 2019/04/03 15:13:07 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 127.0.0.1:53: no such host.
INFO: 2019/04/03 15:13:07 grpc: failed dns TXT record lookup due to lookup _grpc_config.localhost on 127.0.0.1:53: no such host.
INFO: 2019/04/03 15:13:07 ccResolverWrapper: got new service config: 
INFO: 2019/04/03 15:13:07 ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:8080 0  <nil>} {[::1]:8080 0  <nil>}]
INFO: 2019/04/03 15:13:07 addrConn: tryUpdateAddrs curAddr: { 0  <nil>}, addrs: [{127.0.0.1:8080 0  <nil>} {[::1]:8080 0  <nil>}]
INFO: 2019/04/03 15:13:07 addrConn: tryUpdateAddrs curAddr: { 0  <nil>}, addrs: [{127.0.0.1:8080 0  <nil>} {[::1]:8080 0  <nil>}]
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, CONNECTING
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, CONNECTING

The text was updated successfully, but these errors were encountered:

menghanl · 2019-04-03T23:01:48Z

With DNS and pickfirst:

connection fails, addrConn enters transient failure, and triggers re-resolve()
DNS re-resolve() doesn't see new addresses, but will send the same addresses to ClientConn
The address list is sent to pickfirst
pickfirst update SubConn's address with the new list (but content didn't change)
SubConn tryUpdateAddress() returns false (this is where things go wrong), and will tear down the old addrConn, create a new addrConn, thus reset backoff

ejona86 added the Type: Bug label Apr 3, 2019

menghanl self-assigned this Apr 3, 2019

menghanl added the P1 label Apr 3, 2019

menghanl mentioned this issue Apr 4, 2019

addrConn: not reset backoff if update address in TransientFailure #2740

Merged

menghanl closed this as completed in #2740 May 30, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client reconnects way too frequently #2739

Client reconnects way too frequently #2739

ejona86 commented Apr 3, 2019

menghanl commented Apr 3, 2019 •

edited

Loading

Client reconnects way too frequently #2739

Client reconnects way too frequently #2739

Comments

ejona86 commented Apr 3, 2019

menghanl commented Apr 3, 2019 • edited Loading

menghanl commented Apr 3, 2019 •

edited

Loading