Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changelog: improve description of smart dialing #2342

Merged
merged 2 commits into from
Jun 8, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 28 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,38 @@
## 🔦 Highlights <!-- omit in toc -->

### Smart Dialing <!-- omit in toc -->
* When connecting to a peer we now do [happy eyeballs](https://www.rfc-editor.org/rfc/rfc8305) like dial prioritisation to prefer QUIC addresses over TCP addresses. We dial the QUIC address first and wait 250ms to dial the TCP address of the peer.
* In our experiments we've seen little impact on latencies up to 80th percentile. 90th and 95th percentile latencies are impacted. For details see discussion on the [PR](https://github.com/libp2p/go-libp2p/pull/2260#issuecomment-1528848170).
* For details of the address ranking logic see godoc for `swarm.DefaultDialRanker`.
* To disable smart dialing and keep the old behaviour use the
`libp2p.NoDelayNetworkDialRanker` option.

This release introduces smart dialing logic. Currently, libp2p dials all addresses of a remote peer in parallel, and
aborts all outstanding dials as soon as the first one succeeds.
Dialing many addresses in parallel creates a lot of churn on the client side, and unnecessary load on the network and
on the server side, and is heavily discouraged by the networking community (see [RFC 8305](https://www.rfc-editor.org/rfc/rfc8305) for example).

When connecting to a peer we first determine the order to dial its addresses. This ranking logic considers a number of corner cases
described in detail in the documentation of the swarm package (`swarm.DefaultDialRanker`).
At a high level, this is what happens:
* If a peer offers a WebTransport and a QUIC address (on the same IP:port), the QUIC address is preferred.
* If a peer has a QUIC and a TCP address, the QUIC address is dialed first. Only if the connection attempt doesn't succeed within 250ms, a TCP connection is started.

Our measurements on the IPFS network show that for >90% of established libp2p connections, the first connection attempt succeeds,
leading a dramatic decrease in the number of aborted connection attempts.
Comment on lines +29 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put numbers to this to inform how dramatic this was? For example, I think the "Cancel fraction" we had in #2260 (comment) was useful.

I'm thinking about what we'd say in a presentation or all hands. Is this reducing the number of "wasted" connections by 50%, 75%, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the numbers again for master branch. This is reducing the number of "wasted" connections by 50%.
With no smart dialing 60% of all connection attempts are cancelled. With smart dialing turned on 30% of all connection attempts are cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can report numbers here, or on the next release when we have black hole detection (or both). I expect the numbers to be significantly better when we have that, as there are quite a few dual stack hosts in the network.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will report the exact numbers when we turn smart dialing on by default.


We also added new metrics to the swarm Grafana dashboard, showing:
* The number of connection attempts it took to establish a connection
* The delay introduced by the ranking logic

This feature should be safe to enable for nodes running in data centers and for most nodes in home networks.
However, there are some (mostly home and corporate networks) that block all UDP traffic. If enabled, the current implementation
of the smart dialing logic will lead to a regression, since it preferes QUIC addresses over TCP addresses. Nodes would still be
able to connect, but connection establishment of the TCP connection would be delayed by 250ms.

In a future release (see #1605 for details), we will introduce a feature called blackhole detection. By observing the outcome of
QUIC connection attempts, we can determine if UDP traffic is blocked (namely, if all QUIC connection attempts fail), and stop
dialing QUIC in this case altogether. Once this detection logic is in place, smart dialing will be enabled by default.

### More Metrics! <!-- omit in toc -->
Since the last release, we've added metrics for:
* [Holepunching](https://github.com/libp2p/go-libp2p/pull/2246)
* Smart Dialing (see above)

### WebTransport <!-- omit in toc -->
* [#2251](https://github.com/libp2p/go-libp2p/pull/2251): Infer public WebTransport address from `quic-v1` addresses if both transports are using the same port for both quic-v1 and WebTransport addresses.
Expand Down