Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

premature "transport is closing" when using keepalive #3171

Closed
pohly opened this issue Nov 11, 2019 · 4 comments
Closed

premature "transport is closing" when using keepalive #3171

pohly opened this issue Nov 11, 2019 · 4 comments

Comments

@pohly
Copy link

pohly commented Nov 11, 2019

What version of gRPC are you using?

Master = 51ac07f

What version of Go are you using (go version)?

go version go1.13.4 linux/amd64

What operating system (Linux, Windows, …) and version?

Linux 5.2.0-0.bpo.2-amd64 (Debian Buster)

What did you do?

When enabling keepalive on the client side and running that client against a server which has a very long running operation (>200s), the client fails with "transport is closing". This can be avoided by not enabled keepalive on the client side.

The server ran without any special options. Adding KeepaliveParameters had no effect.

This can be reproduce by patching the helloworld example:

diff --git a/examples/helloworld/greeter_client/main.go b/examples/helloworld/greeter_client/main.go
index 0ca4cbaa..88f0f6bf 100644
--- a/examples/helloworld/greeter_client/main.go
+++ b/examples/helloworld/greeter_client/main.go
@@ -27,6 +27,7 @@ import (
 
 	"google.golang.org/grpc"
 	pb "google.golang.org/grpc/examples/helloworld/helloworld"
+	"google.golang.org/grpc/keepalive"
 )
 
 const (
@@ -36,7 +37,8 @@ const (
 
 func main() {
 	// Set up a connection to the server.
-	conn, err := grpc.Dial(address, grpc.WithInsecure(), grpc.WithBlock())
+	conn, err := grpc.Dial(address, grpc.WithInsecure(), grpc.WithBlock(),
+		grpc.WithKeepaliveParams(keepalive.ClientParameters{}))
 	if err != nil {
 		log.Fatalf("did not connect: %v", err)
 	}
@@ -48,7 +50,7 @@ func main() {
 	if len(os.Args) > 1 {
 		name = os.Args[1]
 	}
-	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+	ctx, cancel := context.WithTimeout(context.Background(), 200*time.Second)
 	defer cancel()
 	r, err := c.SayHello(ctx, &pb.HelloRequest{Name: name})
 	if err != nil {
diff --git a/examples/helloworld/greeter_server/main.go b/examples/helloworld/greeter_server/main.go
index eb2fa0ef..71a44d24 100644
--- a/examples/helloworld/greeter_server/main.go
+++ b/examples/helloworld/greeter_server/main.go
@@ -25,9 +25,11 @@ import (
 	"context"
 	"log"
 	"net"
+	"time"
 
 	"google.golang.org/grpc"
 	pb "google.golang.org/grpc/examples/helloworld/helloworld"
+	"google.golang.org/grpc/keepalive"
 )
 
 const (
@@ -42,6 +44,7 @@ type server struct {
 // SayHello implements helloworld.GreeterServer
 func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
 	log.Printf("Received: %v", in.GetName())
+	time.Sleep(100 * time.Second)
 	return &pb.HelloReply{Message: "Hello " + in.GetName()}, nil
 }
 
@@ -50,7 +53,7 @@ func main() {
 	if err != nil {
 		log.Fatalf("failed to listen: %v", err)
 	}
-	s := grpc.NewServer()
+	s := grpc.NewServer(grpc.KeepaliveParams(keepalive.ServerParameters{}))
 	pb.RegisterGreeterServer(s, &server{})
 	if err := s.Serve(lis); err != nil {
 		log.Fatalf("failed to serve: %v", err)
grpc-go/examples/helloworld$ go run ./greeter_server &
[1] 144035
grpc/grpc-go/examples/helloworld$ go run ./greeter_client
2019/11/11 11:40:35 Received: world
2019/11/11 11:41:15 could not greet: rpc error: code = Unavailable desc = transport is closing
exit status 1

What did you expect to see?

The client should have returned without error after 200s.

What did you see instead?

"transport is closing" error.

@pohly
Copy link
Author

pohly commented Nov 11, 2019

https://godoc.org/google.golang.org/grpc/keepalive#ClientParameters warns "Make sure these parameters are set in coordination with the keepalive policy on the server, as incompatible settings can result in closing of connection."

Does that mean that merely the timeout parameters must match (and that the default ones are fine), or does it mean that the server also must enable https://godoc.org/google.golang.org/grpc#KeepaliveParams when the client does?

I think it is the former as the latter doesn't make a difference in practice (at least not for this problem), but the wording IMHO is ambiguous.

@easwars
Copy link
Contributor

easwars commented Nov 11, 2019

You need to set an EnforcementPolicy on the server side. The default EnforcementPolicy expects that the client send no more one keepalive ping every 5 minutes, which is clearly violated in the example you have.

If you enable debugging on your server as mentioned here, you should see the following error on your server:

transport: Got too many pings from the client, closing the connection

And once you set an enforcement policy on your server to match the expected rate of keepalive pings from your client, your stream should not be broken. HTH.

Thanks for the detailed steps for reproduction.

@pohly
Copy link
Author

pohly commented Nov 12, 2019

I do indeed see transport: Got too many pings from the client, closing the connection, 30 seconds after the client has sent the echo request, which seems to be the sum of the default keepalive.ClientParameters.Time = 10s and keepalive.ClientParameters.Timeout = 20s.

This combination works:

        // server:
	s := grpc.NewServer(grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
		MinTime:             10 * time.Second,
		PermitWithoutStream: true,
	}))

       // client
       	conn, err := grpc.Dial(address, grpc.WithInsecure(), grpc.WithBlock(),
		grpc.WithKeepaliveParams(keepalive.ClientParameters{}))

The default EnforcementPolicy expects that the client send no more one keepalive ping every 5 minutes, which is clearly violated in the example you have.

I find it surprising that the defaults on client and server side don't match, i.e. lead to client behavior that the server doesn't accept. Perhaps it's worthwhile to enhance the warning in the ClientParams documentation?

Other than that, this issue can be closed as the code works as intended.

pohly added a commit to pohly/csi-test that referenced this issue Nov 12, 2019
The gRPC defaults for a client do not work for a gRPC server that also
uses the defaults (grpc/grpc-go#3171) which
caused connection drops when a CSI driver is slow to respond and the
client starts sending keepalive pings too quickly
(kubernetes-csi#238).

The sanity package no longer uses gRPC keepalive. Instead, custom test
suites can enable that via some new configuration options if needed
and when it is certain that the CSI driver under test can handle it.
pohly added a commit to pohly/csi-test that referenced this issue Nov 12, 2019
The gRPC defaults for a client do not work for a gRPC server that also
uses the defaults (grpc/grpc-go#3171) which
caused connection drops when a CSI driver is slow to respond and the
client starts sending keepalive pings too quickly
(kubernetes-csi#238).

The sanity package no longer uses gRPC keepalive. Instead, custom test
suites can enable that via some new configuration options if needed
and when it is certain that the CSI driver under test can handle it.
@easwars
Copy link
Contributor

easwars commented Nov 12, 2019

The client and server keepalives are considered sufficiently different enough to deserve separate designs. You can find the corresponding proposals here:
Client
Server

And the reason why you are seeing the transport being closed on the client side after 30 seconds is not because of the sum of the default keepalive.ClientParameters.Time = 10s and keepalive.ClientParameters.Timeout = 20s, but because of the algorithm detailed here.

@easwars easwars closed this as completed Nov 12, 2019
pohly added a commit to pohly/csi-test that referenced this issue Nov 14, 2019
The gRPC defaults for a client do not work for a gRPC server that also
uses the defaults (grpc/grpc-go#3171) which
caused connection drops when a CSI driver is slow to respond and the
client starts sending keepalive pings too quickly
(kubernetes-csi#238).

The sanity package no longer uses gRPC keepalive. Instead, custom test
suites can enable that via some new configuration options if needed
and when it is certain that the CSI driver under test can handle it.
@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants