Client connection phase should optionally wait for SETTINGS frame and set deadlines #1444

gyuho · 2017-08-16T21:09:51Z

What version of gRPC are you using?

Master branch as of today (bfaf042).

What version of Go are you using (`go version`)?

go version go1.9rc1 darwin/amd64

What operating system (Linux, Windows, …) and version?

MacOS

What did you do?

c.f. etcd-io/etcd#8258

What did you expect to see?

We want to use keepalive for HTTP/2 ping health checking. We expect endpoint switch when one endpoint times out on keepalive.

What did you see instead?

keepalive time-out triggers address connection state update to transient failure, and resetTransport retries this endpoint: balancer keeps calling Up on the timed-out endpoint. If the endpoint never comes back, balancer gets stuck with retrying.

Is there any other way to stop those retries on timed-out endpoint, and try others? We have our own balancer interface implementation, but the keepalive time-out error is not distinguishable in client side, so not much we can do.

Here's the code path for reference:

Configure grpc.Balancer(ep1,ep2) with keepalive 1-second
Blackhole(ep1)
keepalive(ep1) times out in 1-second, which is expected
grpc-go/transport/http2_client.go/*http2Client calls (*http2Client).Close on ep1
- ep1 has transportState reachable at the moment
- close(t.errorChan)
Signal <-t.Error() on grpc-go/clientconn.go/(*addrConn).transportMonitor()
- ep1 *addrConn.(connectivity.State) is connectivity.Ready
- ep1 *addrConn.(connectivity.State) is set to connectivity.TransientFailure
resetTransport(drain=false) on ep1
- Calls ep1's down with grpc: failed with network I/O error
resetTransport(drain=false) retries on ep1 unless *addrConn.(connectivity.State) != connectivity.Shutdown
- for retries := 0; ; retries++ {
Still ep1's *addrConn.(connectivity.State) == connectivity.TransientFailure
Thus, the retrial loop will keep calling ac.cc.dopts.balancer.Up(ep1)
Now, be stuck with blackholed ep1

Thanks.

The text was updated successfully, but these errors were encountered:

gyuho · 2017-08-17T20:13:34Z

We will see if we can temporarily mask timed-out endpoint in application layer.

dfawley · 2017-08-21T18:29:18Z

It looks like clients are only waiting for a connection to be made and for the client preface and a settings frame to be sent to the server -- never waiting for the server to send a valid settings frame back -- before attempting to use the connection. It may make sense to wait for that settings frame before using the connection. We'd need to do this through a DialOption, because this will break cmux users that don't have the "workaround for java" in place.

Further, we noticed there are no deadlines on the reads/writes happening during connection initialization, which is problematic -- we should set these to the deadline of the context during this phase.

dfawley · 2017-09-19T20:01:38Z

cc @vtubati

The changes to implement this are not significant, but we have higher priority things in flight right now. I expect this to be done within a month.

tsuna · 2017-10-18T23:37:01Z

Any update on this bug? We regularly run into crazy busy loop situations and I have to manually patch transportMonitor() in our vendored code to add a time.Sleep(1 * time.Second) at the end of the endless for loop in there as a poor man's solution to get gRPC to chill out instead of going crazy.

dfawley · 2017-10-19T16:46:20Z

Thanks for the ping. We should hopefully be able to have this done by the end of next week.

dfawley · 2017-11-02T16:48:30Z

We haven't made much progress on this, but it's at the top of our priority list. Also, we have a slightly different plan:

As before, add a DialOption to block in the client until a settings frame is received from the server before sending RPCs on the connection.
The new part: even if the option is not set, do not consider the connection "good" (and consequently reset the backoff timer) until a settings frame is received from the server. We would still proceed to send RPCs on this connection immediately, as we do today. But if a failure occurs before a settings frame is received, we will resume connecting to alternate backends and backoff with the same deadline as if the initial server never connected at all.

anshupitlia · 2017-11-16T13:47:43Z

Any update on this?

tsuna · 2017-11-30T20:48:25Z

Any update please? This is a problem with the client busy-looping when connecting to a TCP reverse-proxy like haproxy that accepts the connection and has no other choice than closing it if no backend is healthy.

MakMukhi · 2017-11-30T21:03:35Z

This should be in this week. Sorry for the delay I got distracted by something else.

dfawley · 2017-11-30T22:06:33Z

Note: this is PR #1648 if you are curious.

tsuna · 2017-12-11T19:30:21Z

Just to be clear (and for the casual pedestrian stumbling on this issue and seeing it closed) the issue isn't actually fixed unless we use the new DialOption called WithWaitForHandshake(), right?

dfawley · 2017-12-11T19:40:26Z

If I understand your concerns correctly, then I believe it should be fixed for everyone. We will not consider a connection "successful" (from a backoff perspective) if the server never sent the HTTP2 preface to the client.

The option is there to prevent RPCs from being assigned to the channel until after the handshake has been received. This can be set if you want extra-stable behavior so RPCs don't fail due to a connection that fails in this way.

gyuho closed this as completed Aug 17, 2017

dfawley reopened this Aug 21, 2017

dfawley changed the title ~~keepalive time-out for address shut down~~ Client connection phase should optionally wait for SETTINGS frame and set deadlines Aug 21, 2017

dfawley added enhancement P1 Type: Feature New features or improvements in behavior and removed Type: Enhancement labels Aug 24, 2017

This was referenced Sep 19, 2017

Unbounded net.Conn writes in user goroutine lead to missed context deadlines/cancellation #1453

Closed

backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954

Closed

tsuna mentioned this issue Sep 19, 2017

busy loop in transportMonitor #1525

Closed

dfawley mentioned this issue Sep 22, 2017

gRPC calls Dial function in a busy loop if connection immediately becomes unhealthy #1535

Closed

dfawley assigned MakMukhi Oct 19, 2017

MakMukhi mentioned this issue Nov 3, 2017

client: backoff before reconnecting if an HTTP2 server preface was not received #1648

Merged

MakMukhi closed this as completed in #1648 Dec 1, 2017

gyuho mentioned this issue Dec 6, 2017

Watch does not fail on complete loss of connectivity etcd-io/etcd#8980

Closed

This was referenced May 15, 2018

Upgarded to 1.12.0 Improbable-Archive/grpc-go#4

Closed

Upgrade to 1.12.0 Improbable-Archive/grpc-go#5

Closed

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client connection phase should optionally wait for SETTINGS frame and set deadlines #1444

Client connection phase should optionally wait for SETTINGS frame and set deadlines #1444

gyuho commented Aug 16, 2017 •

edited

Loading

gyuho commented Aug 17, 2017

dfawley commented Aug 21, 2017

dfawley commented Sep 19, 2017

tsuna commented Oct 18, 2017

dfawley commented Oct 19, 2017

dfawley commented Nov 2, 2017

anshupitlia commented Nov 16, 2017

tsuna commented Nov 30, 2017

MakMukhi commented Nov 30, 2017

dfawley commented Nov 30, 2017

tsuna commented Dec 11, 2017

dfawley commented Dec 11, 2017

Client connection phase should optionally wait for SETTINGS frame and set deadlines #1444

Client connection phase should optionally wait for SETTINGS frame and set deadlines #1444

Comments

gyuho commented Aug 16, 2017 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

gyuho commented Aug 17, 2017

dfawley commented Aug 21, 2017

dfawley commented Sep 19, 2017

tsuna commented Oct 18, 2017

dfawley commented Oct 19, 2017

dfawley commented Nov 2, 2017

anshupitlia commented Nov 16, 2017

tsuna commented Nov 30, 2017

MakMukhi commented Nov 30, 2017

dfawley commented Nov 30, 2017

tsuna commented Dec 11, 2017

dfawley commented Dec 11, 2017

gyuho commented Aug 16, 2017 •

edited

Loading

What version of Go are you using (`go version`)?