Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client reconnects way too frequently #2739

Closed
ejona86 opened this issue Apr 3, 2019 · 1 comment · Fixed by #2740
Closed

Client reconnects way too frequently #2739

ejona86 opened this issue Apr 3, 2019 · 1 comment · Fixed by #2740
Assignees

Comments

@ejona86
Copy link
Member

ejona86 commented Apr 3, 2019

On connection failure, I'm seeing the Go client reconnect .5-1s after the initial failure, with no backoff. This may only impact DNS+pick first.

I used a modified server that had no ALPN. I wanted to test Go's TLS behavior, and it should fail connecting. However, I noticed that the reconnection attempts were too frequent.

It looks like the connection failure is triggering a DNS refresh. Since there is no DNS rate limiting (hi #2402 !), it resolves quickly. It seems DNS notifies the "new" addresses, even though they are the same. tryUpdateAddrs fails because the connection is TRANSIENT_FAILURE and so acBalancerWrapper.UpdateAddresses tears down the ac and re-creates it, destroying the backoff.

GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info ./client --server_host_override=foo.test.google.fr --use_test_ca=true --server_port=8080 --use_tls
...
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, CONNECTING
WARNING: 2019/04/03 15:13:07 grpc: addrConn.createTransport failed to connect to {127.0.0.1:8080 0  <nil>}. Err :write tcp 127.0.0.1:41052->127.0.0.1:8080: write: broken pipe. Reconnecting...
WARNING: 2019/04/03 15:13:07 grpc: addrConn.createTransport failed to connect to {[::1]:8080 0  <nil>}. Err :tls: use of closed connection. Reconnecting...
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, TRANSIENT_FAILURE
INFO: 2019/04/03 15:13:07 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.localhost on 127.0.0.1:53: no such host.
INFO: 2019/04/03 15:13:07 grpc: failed dns TXT record lookup due to lookup _grpc_config.localhost on 127.0.0.1:53: no such host.
INFO: 2019/04/03 15:13:07 ccResolverWrapper: got new service config: 
INFO: 2019/04/03 15:13:07 ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:8080 0  <nil>} {[::1]:8080 0  <nil>}]
INFO: 2019/04/03 15:13:07 addrConn: tryUpdateAddrs curAddr: { 0  <nil>}, addrs: [{127.0.0.1:8080 0  <nil>} {[::1]:8080 0  <nil>}]
INFO: 2019/04/03 15:13:07 addrConn: tryUpdateAddrs curAddr: { 0  <nil>}, addrs: [{127.0.0.1:8080 0  <nil>} {[::1]:8080 0  <nil>}]
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, CONNECTING
INFO: 2019/04/03 15:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc00015e400, CONNECTING
@menghanl menghanl self-assigned this Apr 3, 2019
@menghanl menghanl added the P1 label Apr 3, 2019
@menghanl
Copy link
Contributor

menghanl commented Apr 3, 2019

With DNS and pickfirst:

  • connection fails, addrConn enters transient failure, and triggers re-resolve()
  • DNS re-resolve() doesn't see new addresses, but will send the same addresses to ClientConn
  • The address list is sent to pickfirst
  • pickfirst update SubConn's address with the new list (but content didn't change)
  • SubConn tryUpdateAddress() returns false (this is where things go wrong), and will tear down the old addrConn, create a new addrConn, thus reset backoff

@lock lock bot locked as resolved and limited conversation to collaborators Nov 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants