From c9de1665054229bdfd40884cd0b893744ec8ef7e Mon Sep 17 00:00:00 2001 From: Marten Seemann Date: Thu, 8 Jun 2023 19:39:54 +0300 Subject: [PATCH] changelog: improve description of smart dialing (#2342) --- CHANGELOG.md | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8d9b35356c..2306f0ff89 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,15 +14,38 @@ ## 🔦 Highlights ### Smart Dialing -* When connecting to a peer we now do [happy eyeballs](https://www.rfc-editor.org/rfc/rfc8305) like dial prioritisation to prefer QUIC addresses over TCP addresses. We dial the QUIC address first and wait 250ms to dial the TCP address of the peer. -* In our experiments we've seen little impact on latencies up to 80th percentile. 90th and 95th percentile latencies are impacted. For details see discussion on the [PR](https://github.com/libp2p/go-libp2p/pull/2260#issuecomment-1528848170). -* For details of the address ranking logic see godoc for `swarm.DefaultDialRanker`. -* To disable smart dialing and keep the old behaviour use the -`libp2p.NoDelayNetworkDialRanker` option. + +This release introduces smart dialing logic. Currently, libp2p dials all addresses of a remote peer in parallel, and +aborts all outstanding dials as soon as the first one succeeds. +Dialing many addresses in parallel creates a lot of churn on the client side, and unnecessary load on the network and +on the server side, and is heavily discouraged by the networking community (see [RFC 8305](https://www.rfc-editor.org/rfc/rfc8305) for example). + +When connecting to a peer we first determine the order to dial its addresses. This ranking logic considers a number of corner cases +described in detail in the documentation of the swarm package (`swarm.DefaultDialRanker`). +At a high level, this is what happens: +* If a peer offers a WebTransport and a QUIC address (on the same IP:port), the QUIC address is preferred. +* If a peer has a QUIC and a TCP address, the QUIC address is dialed first. Only if the connection attempt doesn't succeed within 250ms, a TCP connection is started. + +Our measurements on the IPFS network show that for >90% of established libp2p connections, the first connection attempt succeeds, +leading a dramatic decrease in the number of aborted connection attempts. + +We also added new metrics to the swarm Grafana dashboard, showing: +* The number of connection attempts it took to establish a connection +* The delay introduced by the ranking logic + +This feature should be safe to enable for nodes running in data centers and for most nodes in home networks. +However, there are some (mostly home and corporate networks) that block all UDP traffic. If enabled, the current implementation +of the smart dialing logic will lead to a regression, since it preferes QUIC addresses over TCP addresses. Nodes would still be +able to connect, but connection establishment of the TCP connection would be delayed by 250ms. + +In a future release (see #1605 for details), we will introduce a feature called blackhole detection. By observing the outcome of +QUIC connection attempts, we can determine if UDP traffic is blocked (namely, if all QUIC connection attempts fail), and stop +dialing QUIC in this case altogether. Once this detection logic is in place, smart dialing will be enabled by default. ### More Metrics! Since the last release, we've added metrics for: * [Holepunching](https://github.com/libp2p/go-libp2p/pull/2246) +* Smart Dialing (see above) ### WebTransport * [#2251](https://github.com/libp2p/go-libp2p/pull/2251): Infer public WebTransport address from `quic-v1` addresses if both transports are using the same port for both quic-v1 and WebTransport addresses.