-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: read from TCP socket with read deadline, under load, loses data on darwin/amd64 and Windows/amd64 #70395
Comments
I'll record notes on how to reproduce with my RPC library, rpc25519, while they are still fresh in mind, in case the above small reproducer does not suffice. repo: https://github.com/glycerine/rpc25519 checkout at tag v1.1.42-alpha-repro-issue-70395 (should give a branch called repro-70395) steps:
The number outside the square brackets gives the number of successful RPC calls before the hang and client death. The number inside the square brackets is the goroutine number for correlation with stack traces. They are sorted in ascending order by call count, so the negatives (starved out clients) will be first. sample output
In the last log line above, two clients (serviced by server goroutines 36 and 99) have timeout so far. One client successfully did 12948 roundtrips before hanging. The other client only got in 3881 roundtrips before hanging. For others encountering this issue: my current workaround is to simply never use a timeout on a read. Close the connection on another goroutine to terminate the read. Sub-optimal, but it works and alleviates the starvation of clients. |
I can also see data loss on Windows amd64, using the first, short, reproducer. I do not see the 2nd reproducer have a problem under Windows. This also suggests that they may be getting at distinct issues.
|
the short reproducer might(?) be getting at the same issue as: #67748 |
I did some further debugging. It convinced me that the two reproducers above are indeed getting at the same issue. I can observe a faulty read return data that is 8 bytes (commonly) or 12 bytes (less commonly) further down in the TCP receive buffer than it should. See the three sample output runs at the end of the following code. It is a small variation of the first reproducer. I made two changes: a) I shrunk the message size buff to 16 bytes; b) I wrote random bytes into buff[8:16] and these give us a fingerprint of the origin of the faulty read bytes. The fingerprint (and the matching goroutine number) lets us conclude that the faulty read was from 8 or 12 bytes further into the underlying TCP buffer than it should be returning. Put another Run on go version go1.23.3 darwin/amd64:
|
This was my bug. In my readFull() when a timeout occurs, a partial read could also have occurred. I wasn't accounting for that situation. Thanks to Steven Hartland for pointing this out on the golang-nuts list. |
Thanks for following up. A good example of community debugging. |
Go version
go 1.23.3 on darwin/amd64 (edit: and Windows/amd64; see comment below)
Output of
go env
in your module/workspace:What did you do?
On MacOS Sonoma 14.0 amd64 (Intel) architecture, I observe
occassional TCP data loss when reading from a net.Conn
with a read deadline.
I recorded packets and can see their delivery to
the client (at the OS level) in Wireshark, but the
Go client would not receive that data. The expected data
was the response to an RPC call. So therefore my client
would, at times, timeout waiting for the call response.
This caused me to investigate. I've attempted to boil
it down to a minimal reproducer, shown below. This
only was observable under load; I needed to run
at least 12 clients on my 4 core mac to start getting
losses. A larger value of GOMAXPROCS (say 10 * cores) makes it
happen faster.
If need be, I can also provide reproduction in the original RPC library at
a particular historical point, but the commands to reproduce
and the interpretation are more complex. I'm hoping
the below will suffice. Let me know if it does not.
I did not observe the same behavior on Linux. I did not
detect the same behavior under quic-go. So it seems to
be specific to MacOS and TCP sockets.
What did you see happen?
See above. It takes a variable amount of time to manifest. Sometimes 5 seconds, sometimes over a minute. Example output 4 from above:
What did you expect to see?
Continuous read of the integers and not 0 (in the first 8 bytes) returned without error.
The text was updated successfully, but these errors were encountered: