-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transport is closing and transient failure errors since upgrading to 1.18 #2663
Comments
Can you also get server side logs to see if it contains more information? The handshake deadline exceeded error looks very suspicious. It would help if we know why it happened. |
One more thing to try: can you set |
@menghanl I have experienced same issue here. Here is a reproduce procedure. Example programclient
server
ResultsWith above programs, I run the server in v1.19.0 and the client with v1.17, 1.18, 1.19.
|
This causes performance regression on the server which provides single subconn like Google Spanner. |
If it's okay to try to create transport immediately after transient failure for first retry, how about adding code like this here?
|
I have a question. |
I encountered the same issue using the gc bigtable go client (grpc based), bigtable grpc server seems to have a connection max age of one hour, which means that every hour, I litteraly stop handling my traffic for one second. I agree with @vvakame this default value is too high for a lot of use cases and should be easily configurable in the other google clients based on this repo. |
There's a bug in the reconnect logic, the backoff time is not reset after creating a connection. The fix is in #2669. Can you guys try it and see if it fixes the problems you are seeing? Thanks! |
Closing. Please comment if you are still experiencing problems. |
Does the fix in #2669 still require |
I hope not, since I intend to remove that option soon (this or next week). Please let me know if you are experiencing any problems with it unset. |
It seems to still be an issue for us. I reported it to Google Cloud support which pointed me to this thread and suggested the workaround was to set We running a Golang Google App Engine flexible app with Go v1.11 and grpc-gio v1.20.0. With Spanner, we see context timeouts:
With Cloud Tasks, we see these RPC errors:
These errors are not occuring consistently, only intermittently and in bursts. |
How long do these bursts last and how often do they occur? Does setting |
Currently, we do not have a production environment with any meaningful load, but in a trial production run that we ran for 3 days, we had one burst of errors for ~20 minutes. Otherwise, we see it less than once a day, for a minute. |
I turned on GRPC logging on one of our deployments and it appears I get the
The reconnects seems to happening immediately and I haven't experienced any RPC failures with We do configure |
It's not normal for your connection to be dying so frequently (several per second). Is this happening continuously or are these entries the only ones? Unfortunately, there isn't enough logging in place to see why the connection is being closed. But it does go into READY, which means at least TLS was successful.
Go HTTP server settings should not affect grpc-go - we have our own server implementation. |
It seems to happen in bunches (~3/sec) every few minutes (1-4 min). This is on an GAE flexible instance with basically no load:
|
@mikesun - one last thing I can think of: make sure you aren't running a grpc-go version between #2740 (82fdf27 - 21d ago) and #2862 (a1d4c28 - 10d ago). We had a bug in that range that could potentially cause problems like this. If it's not that, then most likely the problem is not on the client side. Logs from the server/proxy side would be needed for further debugging. Also, since your problems are not related to this issue, could you file a new issue if you suspect a problem with grpc-go? |
This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by grpc/grpc-go#1882. These failures manifest themselves as the following set of errors: ``` ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE ``` Which then lead to tripped breakers and general badness. I suspect that there are several other good bug fixes in here, including some purported leaks and correctness fixes on shutdown. I have verified that with this upgrade I no longer see connections break in overload scenarios which reliably reproduced the situation in the above log. This commit removes one condition from grpcutil.IsClosedConnection which should be subsumed by the status check above. The `transport` subpackage has not been around for many releases. This does not upgrade to the current release 1.22.0 because the maintainer mentions that it contains a bug (grpc/grpc-go#2663 (comment)). This change also unfortunately updates the keepalive behavior to be more spec compliant (grpc/grpc-go#2642). This change mandates a minimum ping time of 10s to the client. Given grpc/grpc-go#2638 this means that the rpc test for keepalives now takes over 20s. I would be okay skipping it but leave that discussion for review. Also updated the acceptance test to look out for an HTTP/2.0 header because grpc now does not send RPCs until after the HTTP handshake has completed (see grpc/grpc-go#2406). Release note (bug fix): Upgrade grpc library to fix connection state management bug.
39041: vendor: upgrade grpc from 1.13.0 to 1.21.2 r=ajwerner a=ajwerner This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by grpc/grpc-go#1882. These failures manifest themselves as the following set of errors: ``` ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE ``` Which then lead to tripped breakers and general badness. I suspect that there are several other good bug fixes in here, including some purported leaks and correctness fixes on shutdown. I have verified that with this upgrade I no longer see connections break in overload scenarios which reliably reproduced the situation in the above log. This commit removes one condition from grpcutil.IsClosedConnection which should be subsumed by the status check above. The `transport` subpackage has not been around for many releases. This does not upgrade to the current release 1.22.0 because the maintainer mentions that it contains a bug (grpc/grpc-go#2663 (comment)). Release note (bug fix): Upgrade grpc library to fix connection state management bug. Co-authored-by: Andrew Werner <[email protected]>
Please answer these questions before submitting your issue.
What version of gRPC are you using?
1.18.0
What version of Go are you using (
go version
)?1.11.5
What operating system (Linux, Windows, …) and version?
Alpine Linux 3.9 running on Google Kubernetes Engine
What did you do?
Since upgrading, I've noticed a large increase in grpc errors leading to failed requests and service disruptions.
We don't use any connection settings on the clients, nor do we have any special settings on the server (keepalives, fastfail, etc). The only thing we have on the connections is Mutual TLS. It's roughly configured like this:
Client:
Server:
This may be related to one of these other two issues
#2653
#2636
Based on the feedback in one of the above linked issues, I set these ENV vars on one service that had a lot of failures:
GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info
I added
(__GRPC CALL HERE__)
at the end of the log lines below where the error was logged as a failed grpc call that we made.The text was updated successfully, but these errors were encountered: