Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: add tests covering races in server settings handshake #1772

Open
wladh opened this issue Dec 29, 2017 · 9 comments · Fixed by #1779
Open

testing: add tests covering races in server settings handshake #1772

wladh opened this issue Dec 29, 2017 · 9 comments · Fixed by #1779
Labels
Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. P3 Type: Testing

Comments

@wladh
Copy link

wladh commented Dec 29, 2017

Please answer these questions before submitting your issue.

What version of gRPC are you using?

HEAD as of today (65c901e)

What version of Go are you using (go version)?

go version go1.9.2 darwin/amd64

What operating system (Linux, Windows, …) and version?

Linux, macOS

What did you do?

Connecting to a server, passing grpc.WithWaitForHandshake() option.

What did you expect to see?

Connection succeeding and staying up.

What did you see instead?

Connection comes up, and data gets exchanged, but after 20 seconds the connection gets terminated with:

WARNING: 2017/12/29 00:23:01 grpc: addrConn.transportMonitor didn't get server preface after waiting. Closing the new transport now.

(GRPC_GO_LOG_SEVERITY_LEVEL was set to info)

The connection seems to be closed in addrConn.transportMonitor after the connection deadline passes, because ac.backoffDeadline is not zero. The reason why it isn't zero, although the preface has been received (and actually RPC calls were done over that connection) is because the code that sets/resets it is racy.
We seem to reset it in onPrefaceReceipt closure (in addrConn.createTransport) which gets passed to transport.NewClientTransport, but that closure is called from a different goroutine (see http2Client.reader). It might happen that this closure gets called before we set ac.backoffDeadline at the end of the loop in addrConn.createTransport, so ac.backoffDeadline will remain set even though the connection has succeeded.
This race is made a lot more likely by setting WithWaitForHandshake because in that case we wait for done to be closed, which means that we wait for onPrefaceReceipt to be called before we proceed further and set ac.backoffDeadline.

The issue was introduced in #1648.

@MakMukhi
Copy link
Contributor

MakMukhi commented Dec 29, 2017 via email

@pawelkowalak
Copy link

pawelkowalak commented Jan 3, 2018

A new version has been tagged (1.9.0) with this issue. Maybe it should be added to Known issues in release notes :)

@RichiH
Copy link

RichiH commented Jan 3, 2018

@fhibler - it's fixed.
@MakMukhi / @vIRu - thanks!

@wladh
Copy link
Author

wladh commented Jan 3, 2018

@RichiH I think you misunderstood @vIRu’s message. He meant that the version that was just released (1.9.0) is affected by this issue, not that it’s fixed in that version.

@RichiH
Copy link

RichiH commented Jan 3, 2018

@wladh Ah, thanks for clearing that up.

@dfawley
Copy link
Member

dfawley commented Jan 4, 2018

@MakMukhi is there any way to attempt to stimulate this race in a regression test? Why didn't our tests ever run into this problem?

@dfawley
Copy link
Member

dfawley commented Jan 4, 2018

I'm going to reopen this to track adding regression test cases as discussed offline:

  1. Use WithWaitForHandshake() in a case where the remote server never sends the preface, and ensure RPCs are never assigned to that connection (i.e. its state never becomes Ready). Without WithWaitForHandshake(), RPCs should be assigned to it (i.e. its state does become Ready after Accepting the connection).
  2. Add a sleep after the server handshake is received in TestDialWaitsForServerSettings, longer than the minConnectTimeout, and make sure the connection is still valid.

EDIT: 1 doesn't really need to be a separate case that TestDialWaitsForServerSettings isn't already covering. That one ensures Dial doesn't return until the handshake happens, which is essential the same thing.

@dfawley
Copy link
Member

dfawley commented Jan 8, 2018

FYI: v1.9.1 has been released with the fix to this issue.

@dfawley dfawley added P2 and removed P1 labels Jan 18, 2018
@menghanl menghanl changed the title Spurious reconnection testing: add tests covering races in server settings handshake Jun 13, 2018
@stale
Copy link

stale bot commented Sep 6, 2019

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@stale stale bot added the stale label Sep 6, 2019
@dfawley dfawley removed the stale label Sep 6, 2019
@dfawley dfawley added P3 and removed P2 labels May 3, 2021
@eshitachandwani eshitachandwani added the Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. P3 Type: Testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants