-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: TestServerConnState is flaky #32329
Comments
Maybe the test is wrong, but Dragonfly network tests in general are flaky (#29583 (comment)) so I'm not super eager to debug this until I've seen it fail on another OS. |
|
|
just as an FYI. This also flakes under gVisor (https://github.com/google/gvisor). But the failure is pretty consistent. The states for connection 5 and 6 are swapped . This might totally be due to difference in implementation of gvisor network stack vs linux but I think given that this test is flaky on other OS as well it probably is depending on some implementation detail of linux. I will try and debug to see if I can understand why this happens under gVisor as we do want to be linux compliant. --- FAIL: TestServerConnState (0.52s)
|
In case of gVisor I believe I understand what's going on. What happens is connection 6 get's delivered and accepted sometimes before connection 5 (see strace logs below accept4() clearly shows that connection 6 was delivered before 5). In gVisor this can happen because each connection runs in a goroutine and the scheduling of which can sometimes cause this reordering. So even though handshake was completed for 5 before it got delivered after 6. This is what causes the reordering. I believe this reordering is not possible on linux loopback and is probably why the test is consistent. I believe other failures might be similarly related due to a different ordering of how connections are delivered/accepted. In case of other OSes it's possible there are slight differences in how connections are completed and delivered. An strace w/ some tcpdump traces of the tests would be useful to debug it. A better solution to deflake this would just be to break it up into separate tests then it should not matter what order the connections are accepted.
I1209 21:51:32.865461 1 x:0] send tcp 127.0.0.1:26122 -> 127.0.0.1:27015 len:0 id:0000 flags:0x02 ( S ) seqnum: 4281293669 ack: 0 win: 65408 xsum:0x0 options: {MSS:65495 WS:7 TS:true TSVal:1110474215 TSEcr:0 SACKPermitted:true} |
Change https://golang.org/cl/210618 mentions this issue: |
Change https://golang.org/cl/210717 mentions this issue: |
This approach attempts to ensure that the log for each connection is complete before the next sequence of states begins. Updates #32329 Change-Id: I25150d3ceab6568af56a40d2b14b5f544dc87f61 Reviewed-on: https://go-review.googlesource.com/c/go/+/210717 Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
@hbhasker, can you confirm whether this is still flaky under gVisor? (We can watch the Go builders for the other platforms.) |
Will run the modified test and update. |
Confirming that the test is not flaky under gVisor w/ the new changes. Thanks! |
root@9fed391e8304:/go# dmesg |
Closing on the theory that the failures on the other platforms have a similar root cause as the one on gVisor. (We can open a new issue if it recurs.) |
dragonfly-amd64
: https://build.golang.org/log/c79adae2958dc28dd62ca163328fd6dc526ea37fCC @bradfitz
The text was updated successfully, but these errors were encountered: