Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect and close dead TCP connection #3206

Closed
mechpen opened this issue Nov 22, 2019 · 3 comments
Closed

detect and close dead TCP connection #3206

mechpen opened this issue Nov 22, 2019 · 3 comments

Comments

@mechpen
Copy link

mechpen commented Nov 22, 2019

What version of gRPC are you using?

v1.23.0

What version of Go are you using (go version)?

1.13

What operating system (Linux, Windows, …) and version?

ubuntu 18.04

What did you do?

A "dead" TCP connection is when no packet is received from a TCP peer. This could happen when the peer kernel panic or packets from the peer are dropped by iptables (e.g. in kubernetes, when a node is removed, some CNI may start dropping all packets from the node).

By default, grpc-go could not detect "dead" TCP connections. All RPC calls return "DEADLINE_EXCEEDED":

rpc error: code = DeadlineExceeded desc = context deadline exceeded

This error continues for about 15 minutes, until kernel TCP retransmission times out and closes the connection.

One solution is to enable gRPC "keepalive" pings. But this not enabled by default.

To reproduce the problem, run the following command on a gRPC client host:

iptables -I INPUT -s <server-ip> -p tcp --sport <server-port> -j DROP

What did you expect to see?

gRPC should enable keepalive by default to detect dead TCP connections.

What did you see instead?

@gotwarlost
Copy link

To add to @mechpen 's comments, we have seen this behavior under various situations all on Kubernetes.

  • gRPC client to an envoy proxy (using istio) where the envoy upstream process is dead. This could be related to how the envoy proxy deals with broken upstream connections.

  • gRPC client to services running on kubernetes masters as used by the kiam project when the server is abruptly killed example of symptoms

/cc @kyessenov @mandarjog @duderino

@menghanl
Copy link
Contributor

As mentioned in the original post, keepalive is the solution here.

To enable it, see the doc and the example.
The parameters and default values can also be found at the godoc.

There's no plan to change the default behavior for keepalive. We don't want to change default behavior unless there's a strong reason to. Please try enabling it and see if it solves all the problems.

@mechpen
Copy link
Author

mechpen commented Nov 22, 2019

Yes, enabling keepalive does fix the issue.

Do you recommend enabling keepalive in general? If so, could you please add this in your guides, such that many grpc users could enable keepalive in their applications?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants