-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: refactor flaky net-error-twice #4062
Conversation
Stress test without this fix: Stress test with this fix: |
The test was not reliably creating the error event on Windows 2012. This makes the test more robust by using setImmediate() to delay the server writing to the socket to give time for socket.destroy() to take effect. Fixes: nodejs#4057
84990f2
to
6b38a67
Compare
The stress test is going OK, but the CI still shows a failure for this test. I'd hate to have to delay the |
I added a CI: https://ci.nodejs.org/job/node-test-pull-request/880/ Stress test: https://ci.nodejs.org/job/node-stress-single-test/87/nodes=win2012r2/ |
// Windows-specific handling. See: | ||
// * https://github.com/nodejs/node/pull/4062 | ||
// * https://github.com/nodejs/node/issues/4057 | ||
setTimeout(() => { conn.write(buf); }, 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the delay necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnoordhuis Concise-but-naive answer: Without the delay, the test is flaky on Windows 2012 r2 on the CI infrastructure.
Less concise and marginally less naive: In this test, conn.write(buf)
is supposed to get an error (either EPIPE
or ECONNRESET
depending on timing). Unfortunately, on Windows 2012 r2 only, it appears that the call to UPDATE: Actually, I think the problem is that the client connection and server connection callbacks fire in an unpredictable order. On Windows 2012 r2 only, the client.destroy()
often returns before the socket is set to emit the error.conn.write()
would often not trigger an error and the result was assert.equal(errs.length, 1)
firing (because errs.length
is 0). I tried setImmediate()
and that may have improved things but it did not resolve the flakiness. Using the timeout resolved the flakiness. (I did not experiment with a timeout of less than 100 milliseconds.)
As for the cause of this on Windows 2012 r2, I don't know and would certainly be receptive to some guidance on figuring it out (or someone simply doing the work and identifying the source of the issue). Operating system limitation? Bug in libuv? Bug in socket.destroy
? I don't know. I don't have easy access to a Windows 2012 r2 machine. (I could try to run one via VirtualBox, but my laptop resources are kind of limited right now and my weak excuse is that it would probably fill up my disk.)
Bump again, but this time with an @ mention: @nodejs/testing This continues to fail a lot on Windows and I'd like to get this in as a fix. If this is not going to land, then a different PR can be submitted to at least mark the test as flaky. But I'd rather just see this land... I suppose the best path would be to identify why this is happening on Windows. As I mention in a previous comment, I don't know if this is an OS issue, libuv issue, |
@nodejs/platform-windows Can you investigate why the timeout fixes the test on Windows? |
|
This reproduces every time in a fast single-core VM. Slowing down the VM (CPU cap) solves the problem. Only on Windows 2012. This might be a problem with node and not the test. |
Maybe I was overthinking this. Is it possible that all that's happening is that sometimes the client connection callback is firing before the server connection callback and sometimes vice versa? So sometimes the socket is destroyed (which is what the test expects) and sometimes not? It seems that all we'd have to do to fix this is wrap the @joaocgreis If we change the |
Alas, |
OK, so back to the "why the timeout" question:
|
I now have access to a Windows machine and can iterate/test more efficiently. Ordering of callbacks is not the issue. Bummer. Starting to look/feel more and more like an issue with Node itself, as @joaocgreis suggested... |
Even just introducing a single |
Also interesting: If I remove |
This is getting weird: It appears that the culprit is adding the Any ideas as to why adding the cc @bengl since I roped him in on this in between sessions at Node Interactive... |
I think I found a solution that doesn't rely on a timeout, in #4342. Basically it waits until the server connection handler fires to destroy the client socket and write to the server socket. Doing it this way results in a single ECONNRESET every time on win2012r2. |
I'd still like to know why adding |
removing lts watch tag as nothing landed in here |
The test was not reliably creating the error event on Windows 2012.
This makes the test more robust by using setTimeout() to delay the
server writing to the socket to give time for socket.destroy() to take
effect.
Fixes: #4057