Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client connection phase should optionally wait for SETTINGS frame and set deadlines #1444

Closed
gyuho opened this issue Aug 16, 2017 · 12 comments · Fixed by #1648
Closed

Client connection phase should optionally wait for SETTINGS frame and set deadlines #1444

gyuho opened this issue Aug 16, 2017 · 12 comments · Fixed by #1648
Assignees
Labels
P1 Type: Feature New features or improvements in behavior

Comments

@gyuho
Copy link
Contributor

gyuho commented Aug 16, 2017

What version of gRPC are you using?

Master branch as of today (bfaf042).

What version of Go are you using (go version)?

go version go1.9rc1 darwin/amd64

What operating system (Linux, Windows, …) and version?

MacOS

What did you do?

c.f. etcd-io/etcd#8258

What did you expect to see?

We want to use keepalive for HTTP/2 ping health checking. We expect endpoint switch when one endpoint times out on keepalive.

What did you see instead?

keepalive time-out triggers address connection state update to transient failure, and resetTransport retries this endpoint: balancer keeps calling Up on the timed-out endpoint. If the endpoint never comes back, balancer gets stuck with retrying.

Is there any other way to stop those retries on timed-out endpoint, and try others? We have our own balancer interface implementation, but the keepalive time-out error is not distinguishable in client side, so not much we can do.

Here's the code path for reference:

  1. Configure grpc.Balancer(ep1,ep2) with keepalive 1-second
  2. Blackhole(ep1)
  3. keepalive(ep1) times out in 1-second, which is expected
  4. grpc-go/transport/http2_client.go/*http2Client calls (*http2Client).Close on ep1
    • ep1 has transportState reachable at the moment
    • close(t.errorChan)
  5. Signal <-t.Error() on grpc-go/clientconn.go/(*addrConn).transportMonitor()
    • ep1 *addrConn.(connectivity.State) is connectivity.Ready
    • ep1 *addrConn.(connectivity.State) is set to connectivity.TransientFailure
  6. resetTransport(drain=false) on ep1
    • Calls ep1's down with grpc: failed with network I/O error
  7. resetTransport(drain=false) retries on ep1 unless *addrConn.(connectivity.State) != connectivity.Shutdown
    • for retries := 0; ; retries++ {
  8. Still ep1's *addrConn.(connectivity.State) == connectivity.TransientFailure
  9. Thus, the retrial loop will keep calling ac.cc.dopts.balancer.Up(ep1)
  10. Now, be stuck with blackholed ep1

Thanks.

@gyuho
Copy link
Contributor Author

gyuho commented Aug 17, 2017

We will see if we can temporarily mask timed-out endpoint in application layer.

@gyuho gyuho closed this as completed Aug 17, 2017
@dfawley
Copy link
Member

dfawley commented Aug 21, 2017

It looks like clients are only waiting for a connection to be made and for the client preface and a settings frame to be sent to the server -- never waiting for the server to send a valid settings frame back -- before attempting to use the connection. It may make sense to wait for that settings frame before using the connection. We'd need to do this through a DialOption, because this will break cmux users that don't have the "workaround for java" in place.

Further, we noticed there are no deadlines on the reads/writes happening during connection initialization, which is problematic -- we should set these to the deadline of the context during this phase.

@dfawley dfawley reopened this Aug 21, 2017
@dfawley dfawley changed the title keepalive time-out for address shut down Client connection phase should optionally wait for SETTINGS frame and set deadlines Aug 21, 2017
@dfawley dfawley added enhancement P1 Type: Feature New features or improvements in behavior and removed Type: Enhancement labels Aug 24, 2017
@dfawley
Copy link
Member

dfawley commented Sep 19, 2017

cc @vtubati

The changes to implement this are not significant, but we have higher priority things in flight right now. I expect this to be done within a month.

@tsuna
Copy link
Contributor

tsuna commented Oct 18, 2017

Any update on this bug? We regularly run into crazy busy loop situations and I have to manually patch transportMonitor() in our vendored code to add a time.Sleep(1 * time.Second) at the end of the endless for loop in there as a poor man's solution to get gRPC to chill out instead of going crazy.

@dfawley
Copy link
Member

dfawley commented Oct 19, 2017

Thanks for the ping. We should hopefully be able to have this done by the end of next week.

@dfawley
Copy link
Member

dfawley commented Nov 2, 2017

We haven't made much progress on this, but it's at the top of our priority list. Also, we have a slightly different plan:

  1. As before, add a DialOption to block in the client until a settings frame is received from the server before sending RPCs on the connection.

  2. The new part: even if the option is not set, do not consider the connection "good" (and consequently reset the backoff timer) until a settings frame is received from the server. We would still proceed to send RPCs on this connection immediately, as we do today. But if a failure occurs before a settings frame is received, we will resume connecting to alternate backends and backoff with the same deadline as if the initial server never connected at all.

@anshupitlia
Copy link

Any update on this?

@tsuna
Copy link
Contributor

tsuna commented Nov 30, 2017

Any update please? This is a problem with the client busy-looping when connecting to a TCP reverse-proxy like haproxy that accepts the connection and has no other choice than closing it if no backend is healthy.

@MakMukhi
Copy link
Contributor

This should be in this week. Sorry for the delay I got distracted by something else.

@dfawley
Copy link
Member

dfawley commented Nov 30, 2017

Note: this is PR #1648 if you are curious.

@tsuna
Copy link
Contributor

tsuna commented Dec 11, 2017

Just to be clear (and for the casual pedestrian stumbling on this issue and seeing it closed) the issue isn't actually fixed unless we use the new DialOption called WithWaitForHandshake(), right?

@dfawley
Copy link
Member

dfawley commented Dec 11, 2017

If I understand your concerns correctly, then I believe it should be fixed for everyone. We will not consider a connection "successful" (from a backoff perspective) if the server never sent the HTTP2 preface to the client.

The option is there to prevent RPCs from being assigned to the channel until after the handshake has been received. This can be set if you want extra-stable behavior so RPCs don't fail due to a connection that fails in this way.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1 Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants