-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Happy eyeballs implementation times out prematurely #54359
Comments
I'm unable to reproduce: $ NODE_DEBUG=net node -e 'fetch ("https://ds.sydney.test-ipv6.com/ip/?callback=?&testdomain=test-ipv6.com&testname=test_ds");'
NET 86651: pipe false undefined
NET 86651: connect: find host ds.sydney.test-ipv6.com
NET 86651: connect: dns options { family: undefined, hints: 32 }
NET 86651: connect: autodetecting
NET 86651: _read - n 16384 isConnecting? true hasHandle? true
NET 86651: _read wait for connection
NET 86651: connect/multiple: will try the following addresses [
{ address: '45.79.238.217', family: 4 },
{ address: '2400:8907::f03c:94ff:fed0:3fc3', family: 6 }
]
NET 86651: connect/multiple: attempting to connect to 45.79.238.217:443 (addressType: 4)
NET 86651: connect/multiple: setting the attempt timeout to 250 ms
NET 86651: connect/multiple: connection attempt to 45.79.238.217:443 completed with status 0
NET 86651: afterConnect
NET 86651: _read - n 16384 isConnecting? false hasHandle? true
NET 86651: Socket._handle.readStart
NET 86651: _read - n 16384 isConnecting? false hasHandle? true Note I only have IPv4 |
Try a server further away so you can't complete the IPv4 connection in 250ms. You can also try saturating your connection with a speed test to artificially increase your latency. Some possible alternative servers to test with: https://test-ipv6.com/mirrors.html.en_US |
I created a test server that should reproduce this reliably by artificially delaying IPv4 connections. Try the following:
|
|
cc @ShogunPanda |
@notr1ch You can use --network-family-autoselection-attempt-timeout or net.setDefaultAutoSelectFamilyAttemptTimeout to adjust the default value. |
Yes, increasing the timeout or using |
Unfortunately we can't do that due to some limitation in our internal architecture. |
No modern language runtime should abandon a connection after 250ms and fail in this fashion. A quick search of past closed issues and Google shows many cases of users running into this bug in real-world situations. If you can't implement proper happy eyeballs then the default timeout needs adjusting to match real-world network conditions - I would suggest a minimum of 4-5 seconds to allow for a TCP SYN to be retried to account for packet loss. |
I think that is not going to work. 5 seconds before failing a connection would stall your process. This will be not acceptable for people with problematic routing. In Node we aim to provide good defaults for the vast majority of people. Excluding some early stage problem in my first implementation of this feature, I think we are now in a good shape. Moreover, since all you need is to use a simple API call when booting your process, I think we are fine the way we are now. |
Defaults matter. I opened this issue because it's happening in our production environment on a 70ms RTT API with datacenter to datacenter connectivity after we upgraded to Node v20. The internet isn't perfect, packet loss happens - a single lost packet should not cause Node to fail the connection due to the connect timeout being shorter than TCP's retry mechanisms. Having to ask people using Node 20+ to add command line parameters to make connections function normally suggests the defaults are not good. The default also causes issues for many poorly connected countries, China in particular has very high latency through the GFW and countries like Australia have high RTT to endpoints hosted in North America and Europe. Entire types of connections will run into the issue with this default - satellite, in-flight Wi-Fi, old generation cellular networks, etc. will all regularly have over 250ms of overhead and hit this bug. Not everyone in the world is in on gigabit fiber connecting to us-east-1. And not every user of Node is a developer who will know to look through GitHub issues to find a command line parameter to fix it.
I also don't understand how the process would stall with a 5 second timeout - as Node is event based, there should be no blocking socket operations that stall the process. As Node already uses a long timeout on the final candidate, process stalling should have already been an issue with the current implementation. Regardless of the above arguments, I still consider the behavior described a bug. Node intentionally uses a longer timeout on the final candidate in an attempt to ensure the connection succeeds even if all other candidates have failed. It just so happens that the final candidate is an address family that isn't reachable at all in an IPv4-only environment so the intended longer timeout never even gets attempted. If a better default timeout can't be implemented, I would suggest using the long timeout on the final candidate from each address family instead of just the final candidate overall. This would ensure both IPv4 and IPv6 endpoints can be reached in the event the 250ms timeout is too short for an initial candidate to be viable. |
I agree that default matters.
I totally agree on that. That's why we use a default timeout which is 2.5 higher that the minimum default which the reference standard suggests.
Neither do I. I live in the middle of nowhere Italy, definitely far away from meaningful PoP.
That's probably a good point. Shall we include a better error when all attempts fail? Are you willing to send a PR?
System resources are finite. If each connection takes up to 5 seconds, you might run of available sockets.
You can provide a PR which will increase the default attempt timeout. I wouldn't go higher than a second. So we can have a broader audience and see what people think about it. |
I forgot: consider that increasing this timeout will be a semver-major change so it will not get into Node before 23.0.0. |
While it's not a problem for me personally to do this, it's more that I shouldn't need to in the first place. Opening a TCP connection should pretty much "just work" in any programming language, whether it's to a far away server or over a network that has some packet loss. I feel this is a kind of "gotcha" that developers really shouldn't have to worry or even know about.
While 100ms is the minimum recommended delay between attempts, it is not intended to be a timeout. The current standard recommends 250ms between connection attempts which Node appears to follow, however Node treats this value as a timeout on the attempt as a whole when it is intended to be a delay between starting attempts. Specifically, Node does not adhere to this part of the standard:
By closing the first attempt early, it reduces the chances of a successful connection compared to a non-happy-eyeballs path which goes against the entire idea of the standard.
Node uses a 10 second connection timeout for hosts without mixed address families, so 5 seconds per attempt per address family seems reasonable to me. The full 5 seconds would also be a worst case situation where the IPv4 host is actually down or otherwise unreachable, it would be expected that most hosts would respond within 1000ms depending on network conditions. I'll try and get some more eyes on this issue for additional thoughts and feedback. |
Well, while I agree on the principle, there have been not many reports like this so I guess the current setup is fine for most of the people.
Yes, unfortunately this is a known technical limitation of Node and its implementation. The reason is that there is no way for us to cancel successful connection if another goes through so we chose to only have one attempt running at any time.
The problem is that for high volume system having a 5 seconds timeout would vastely limit the amount of sockets you can open per second.
Absolutely. I'm more than happy to change my opinion if more people agree. |
This has been discussed elsewhere before, e.g. #52216 and https://github.com/orgs/nodejs/discussions/48028. I very much agree that Node should provide sensible defaults; I can't see how the current defaults are sensible. As a possibly-helpful datapoint, curl and Python are both able to fetch from I don't know enough about curl/Python/other runtimes to know what they're doing differently, but it's clear that better default behavior is possible, and I hope that it can work its way into Node, even if it has to wait for a future major version. |
It took a while for me to find out the root cause, and then the workaround, of all my ETIMEDOUT errors. My code connects to many different hosts around the world owned by different networks, and some of them would fail with no discernible pattern. I am not familiar with node internals, and I understand there are technical limitations behind fixing this properly, but IMO it still violates the Principle of Least Astonishment. |
Many of my customers got hit by the same issue, |
This issue popped up for me as "ETIMEDOUT AggregateError" while doing local dev work on Next JS project. After 2 days of digging through various open issues on various repos - people suggesting DNS replacement, not using IPv6 all sorts of funky stuff - finally after learning more about IPv6/IPv4 network address resolution, I was able identify that the issue was NODE I live in a country in top 10 for network speeds, but far away from US. Pinging US based IP takes 200-300ms Setting Adding comment here for search visibility, to maybe prevent someone else from loosing their mind over why suddenly fetch does not work. |
Literally just make it 300ms. Bell curves are a thing: this would solve the problem for almost everyone experiencing this issue, eg everyone in Australia connecting to US East. $ping 172.217.164.110 |
I will not object to any change in the default value up to a reasonable default, like Are you willing to send a PR? |
I had a look through lib/net.js and your discussion about the implementation. You said you can't stop a socket once its started. But then isn't there a risk that it accidentally connects in the current implementation? internalConnectMultipleTimeout has I propose entirely removing close on timeout, allowing parallel connections and stopping sockets in this way after the first has connected. Increasing the timeout is a decent bandaid fix for now, but I'm concerned it will just delay actually implementing this correctly! Thanks, |
There no risk in the current implementation as, differently from RFC specification, the connections are started serially. |
Version
v20.12.2
Platform
Subsystem
No response
What steps will reproduce the bug?
In an IPv4-only environment:
autoSelectFamily defaults on in nodejs v20.
When trying to establish a connection, this fails to account for a host that has no IPv6 connectivity attempting to connect to a dual-stack host with both A and AAAA records. The DNS server returns responses in the order A, AAAA. Node tries to connect to the A address with only 250ms timeout, insufficient for many real-world cases (cellular/satellite links, poorly connected ISPs, far away servers, packet loss, etc). This times out, so node proceed to the last candidate which is supposed to have a longer timeout, however the last candidate is an AAAA address and the host has no IPv6 connectivity so it immediately fails, causing the overall connection to fail.
How often does it reproduce? Is there a required condition?
Reproduced 100% under the correct test conditions.
What is the expected behavior? Why is that the expected behavior?
Connection should be established successfully.
What do you see instead?
Connection times out with ETIMEDOUT AggregateError.
Additional information
The text was updated successfully, but these errors were encountered: