-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vendor: upgrade grpc from 1.13.0 to 1.21.2 #39041
vendor: upgrade grpc from 1.13.0 to 1.21.2 #39041
Conversation
a0a9752
to
60225a3
Compare
Grr the following PR grpc/grpc-go#2642 breaks This test however under stress hit grpc/grpc-go#2818 which is only fixed in 1.22 which is unfortunately subject to grpc/grpc-go#2862. In summary there's not published release which is thought to be bug free. I'm tempted to pull master as it purports to contain bug fixes for all of these bugs and then to watch for the next release. Thoughts? |
I don't feel particularly great about pulling master. What other bugs might be lurking? What's the version released before 2642? |
1.18.1, I've verified that it passes, though seems somewhat flakey under stress on a different test. Digging in to that now. Master also seems to fail the test unreliably due to what appears to be requests waiting on writeQuota. I can't tell if it's working as expected but it feels like a bug. The connection should be closed. I'll dig a little deeper and potentially file a grpc bug. |
60225a3
to
d213a78
Compare
I saw this after 4 minutes of stress:
Digging. Otherwise this feels a bit better. Will also verify the disconnection behavior. |
931302d
to
7e0b75e
Compare
7e0b75e
to
6efb47a
Compare
There's no winning here. 1.18.1 contains grpc/grpc-go#2692 which causes the flakiness in TestStatsHandlerWithHeartbeats and also would just totally in general muck with our stats. I think I'm going to go back to trying to pick up 1.21.1 and adjust the tests appropriately. |
6efb47a
to
8bd17e4
Compare
Okay this is updated to use 1.21.1 and roughly RFAL. This bump brings us to really long gRPC keepalives. It will mean that detecting network partitions from the client to the server will take ~3x longer but that's probably okay. The more annoying thing is that the test now takes much longer. The test also needed to be updated because it used to rely on having enough quota in flow control to send without blocking. It passes now reliably but I added a second commit to skip it because of its crazy long duration. Another issue this brings up are that the stats in master (since 1.17) are totally broken (see grpc/grpc-go#2932). This process has been rather frustrating. I'm wondering if it's worth having a fork of grpc for little things like this. I can try to get this change backported but the gRPC maintainers seem to have a tendency to let PRs sit for a long time. |
7130a95
to
4817f3e
Compare
This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by grpc/grpc-go#1882. These failures manifest themselves as the following set of errors: ``` ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE ``` Which then lead to tripped breakers and general badness. I suspect that there are several other good bug fixes in here, including some purported leaks and correctness fixes on shutdown. I have verified that with this upgrade I no longer see connections break in overload scenarios which reliably reproduced the situation in the above log. This commit removes one condition from grpcutil.IsClosedConnection which should be subsumed by the status check above. The `transport` subpackage has not been around for many releases. This does not upgrade to the current release 1.22.0 because the maintainer mentions that it contains a bug (grpc/grpc-go#2663 (comment)). This change also unfortunately updates the keepalive behavior to be more spec compliant (grpc/grpc-go#2642). This change mandates a minimum ping time of 10s to the client. Given grpc/grpc-go#2638 this means that the rpc test for keepalives now takes over 20s. I would be okay skipping it but leave that discussion for review. Also updated the acceptance test to look out for an HTTP/2.0 header because grpc now does not send RPCs until after the HTTP handshake has completed (see grpc/grpc-go#2406). Release note (bug fix): Upgrade grpc library to fix connection state management bug.
This test now takes 21 seconds which is too long. Release note: None
4817f3e
to
03aac64
Compare
Okay, good news, I got grpc/grpc-go#2932 merged, backported, and released as 1.21.2. I also defeated the linter. This is now truly RFAL. @jordanlewis if you want to pass this off to another reviewer feel free. |
@bdarnell or @asubiotto interested in having a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it closes #31361, right?
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
bors r+ |
39041: vendor: upgrade grpc from 1.13.0 to 1.21.2 r=ajwerner a=ajwerner This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by grpc/grpc-go#1882. These failures manifest themselves as the following set of errors: ``` ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE ``` Which then lead to tripped breakers and general badness. I suspect that there are several other good bug fixes in here, including some purported leaks and correctness fixes on shutdown. I have verified that with this upgrade I no longer see connections break in overload scenarios which reliably reproduced the situation in the above log. This commit removes one condition from grpcutil.IsClosedConnection which should be subsumed by the status check above. The `transport` subpackage has not been around for many releases. This does not upgrade to the current release 1.22.0 because the maintainer mentions that it contains a bug (grpc/grpc-go#2663 (comment)). Release note (bug fix): Upgrade grpc library to fix connection state management bug. Co-authored-by: Andrew Werner <[email protected]>
Build succeeded |
A test failure has been observed with the following error: ``` 503 Service Unavailable, content-type: application/json, body: { "error": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"", "message": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"", "code": 14, "details": [ ] }, error: <nil> ``` That error makes some sense if the DefaultClass connection has failed for some reason. In the fullness of time we should get to the bottom of why these gRPC connections still close when overloaded. We had hoped that cockroachdb#39041 would be the end of that but unfortunately it still seems to happen sometimes. That being said, the situation resolves itself rapidly when the load stops. This PR adds a retry loop to make the system robust to these transient failures. It's also possible that we should run timeseries queries as SystemClass operations but I'll also leave that for another change. Release note: None
A test failure has been observed with the following error: ``` 503 Service Unavailable, content-type: application/json, body: { "error": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"", "message": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"", "code": 14, "details": [ ] }, error: <nil> ``` That error makes some sense if the DefaultClass connection has failed for some reason. In the fullness of time we should get to the bottom of why these gRPC connections still close when overloaded. We had hoped that cockroachdb#39041 would be the end of that but unfortunately it still seems to happen sometimes. That being said, the situation resolves itself rapidly when the load stops. This PR adds a retry loop to make the system robust to these transient failures. It's also possible that we should run timeseries queries as SystemClass operations but I'll also leave that for another change. Release note: None
This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this
upgrade is to eliminate the disconnections caused by
grpc/grpc-go#1882. These failures manifest themselves
as the following set of errors:
Which then lead to tripped breakers and general badness. I suspect that there
are several other good bug fixes in here, including some purported leaks and
correctness fixes on shutdown.
I have verified that with this upgrade I no longer see connections break in
overload scenarios which reliably reproduced the situation in the above log.
This commit removes one condition from grpcutil.IsClosedConnection which should
be subsumed by the status check above. The
transport
subpackage has not beenaround for many releases.
This does not upgrade to the current release 1.22.0 because the maintainer
mentions that it contains a bug
(grpc/grpc-go#2663 (comment)).
Release note (bug fix): Upgrade grpc library to fix connection state management
bug.