Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm: implement smart dialing logic #2260

Merged
merged 25 commits into from
Jun 4, 2023

Conversation

sukunrt
Copy link
Member

@sukunrt sukunrt commented Apr 24, 2023

we consider private, public ip4, public ip6, relay separately.

In each group, if a quic address is present, we delay tcp addresses.
private: 30 ms delay.
public ip4: 300 ms delay.
public ip6: 300 ms delay.
relay: 300 ms delay.

If a quic-v1 address is present we don't dial quic or webtransport address on the same (ip,port) combination.
If a tcp address is present we don't dial ws or wss address on the same (ip, port) combination.
If both direct and relay addresses are present, all relay addresses are delayed by an additional 500ms. So if there's a quic relay and a tcp relay address, quic relay will be delayed by 500ms and tcp relay will be delayed by 800 ms.

All delays are set to 0 for a holepunch request.

closes: #1785

@sukunrt sukunrt marked this pull request as draft April 24, 2023 14:39
@sukunrt sukunrt force-pushed the smart-dialing branch 4 times, most recently from ea7f3b4 to 67dfaba Compare April 25, 2023 13:19
@sukunrt
Copy link
Member Author

sukunrt commented Apr 25, 2023

Some results from a 1 hour simultaneous run of kubo on the same machine

Total dial cancellations:
old: 4100
new: 1700
Screenshot 2023-04-25 at 7 53 49 PM

kubo2 is old
kubo is new

this is the prometheus query
sum by (job, transport) (increase(libp2p_swarm_dial_errors_total{error="canceled"}[$__rate_interval]))

@sukunrt sukunrt marked this pull request as ready for review April 25, 2023 14:26
@marten-seemann
Copy link
Contributor

Total dial cancellations:
old: 4100
new: 1700

Impressive numbers! Two questions:

  1. Do you have any idea what the reason for the remaining cancelations is?
  2. Do you have any numbers on connection establishment latency? How much are we adding?

@sukunrt
Copy link
Member Author

sukunrt commented Apr 25, 2023

Do you have any idea what the reason for the remaining cancelations is?

For some reason there are many quic-draft29 cancellations. Nodes are just reporting a lot of quic addresses and not as many quic-v1 addresses. Still debugging what is causing this.

Do you have any numbers on connection establishment latency? How much are we adding?

I'll have to measure this. The handshake latency metric currently measures the latency from the time of dialing, so I'll have to instrument this number.

@marten-seemann
Copy link
Contributor

Do you have any idea what the reason for the remaining cancelations is?

For some reason there are many quic-draft29 cancellations. Nodes are just reporting a lot of quic addresses and not as many quic-v1 addresses. Still debugging what is causing this.

Are you dialing quic-v1 and quic-draft29 in parallel? If we have a v1 address, we should never dial draft-29.

Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts:

  1. Should we prioritize WebTransport over TCP (in cases where we don't have QUIC)?
  2. Do I understand correctly that we're dialing IPv6 and IPv4 QUIC addresses in parallel?
  3. What happens if a node gives us multiple QUIC IP addresses (of the same address family). Should we just randomly pick one and dial it?

p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@MarcoPolo MarcoPolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits, but this looks great!

p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_worker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_worker_test.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_worker_test.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_worker_test.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_worker_test.go Outdated Show resolved Hide resolved
@@ -342,3 +358,206 @@ func TestDialWorkerLoopConcurrentFailureStress(t *testing.T) {
close(reqch)
worker.wg.Wait()
}

func TestDialWorkerLoopRanking(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always appreciate more tests in this part of the codebase, thanks!

A feature request for me would be to have some sort of generative test here. See testing/quick for the tool. If we could randomly generate test cases and verify that do what we expect, I'd be much more confident in rolling this out and making future changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one randomized test using testing/quick. Is this what you had in mind?

@BigLep
Copy link
Contributor

BigLep commented Apr 28, 2023

Thanks for the work here and for the numbers.

To put the number of cancellations in context, how many total connections were established during this same 1 hour window?

@sukunrt
Copy link
Member Author

sukunrt commented Apr 29, 2023

@marten-seemann

Should we prioritize WebTransport over TCP (in cases where we don't have QUIC)?

This is the strategy in the current PR

Do I understand correctly that we're dialing IPv6 and IPv4 QUIC addresses in parallel?

Yes, I now think we should change this and dial all ipv4 addresses 300ms after ipv6.
The PR dials in parallel because my isp doesn't support ipv6 and so I didn't understand how to model that. Running kubo on a cloud environment helped here.

What happens if a node gives us multiple QUIC IP addresses (of the same address family). Should we just randomly pick one and dial it?

Excellent idea. I did some experiments and found that if a peer shares a 4001 port address and another port address, the 4001 address is more likely to be the correct one. So the strategy I've used is to sort the addresses by port number, it is likely that nodes will dial out of a much higher port than the one they choose to listen on.

Some more numbers:

kubo on a t2micro aws instance with both ipv4 and ipv6 support.

happy eyeballs (public == private | quic > tcp | ipv6 > ipv4 ):

This strategy is essentially what @marten-seemann suggests only difference being that we prioritise ipv6 over ipv4

here we first use quic addresses and then use tcp addresses
within a transport group we rank ipv6 over ipv4
The first address of the group is dialed immediately and the rest all are dialed after 300ms
The tcp group is dialed 300ms after the last quic dial
ex: quic1, quic2, quic3, tcp1, tcp2, tcp3
quic1: 0, quic2: 300, quic3: 300, tcp1: 600: tcp2: 900, tcp3: 900
public and private addresses are dialed parallely using the same logic

PR: (ip4 == ip6 == private | quic > tcp )
strategy of the pr. all tcp addresses delayed by 300ms

master: no delay

single-dial (public == private | quic > tcp | ipv6 > ipv4):
same as happy eyeball but we dial one address at a time and wait 300ms for a result.

All latency numbers are in milliseconds

Successes is the number of successful outgoing dials which resulted in a connection

Strategy Cancellations Successes Cancel Fraction Latency (50p) Latency (80p) Latency (90p) Latency (95p)
master 1950 1600 0.54 90 200 240 310
happy eyeballs 510 1550 0.24 94 219 360 650
PR 1050 1997 0.34 93 200 270 538
single-dial 520 1450 0.26 95 212 340 600

I'm still debugging why happy eyeballs latency numbers are worse than single-dial latency numbers.

@BigLep

To put the number of cancellations in context, how many total connections were established during this same 1 hour window?

I'm sorry I somehow deleted prometheus data for that run 🤦‍♂️
But the above numbers are more representative and reproducible. The previous numbers were obtained on my dev machine with a isp that doesn't support ipv6 and I think simultaneous kubo runs aren't very comparable.

@sukunrt
Copy link
Member Author

sukunrt commented Apr 29, 2023

@marten-seemann

Do you have any idea what the reason for the remaining cancelations is?

Some cancellations are because the user is cancelling the dials. No successful connection is made.
Some cancellations are because tcp dial succeeds and we cancel the quic dial. Some cancellations are because we had multiple quic dials and had to cancel one of them.

In the graphs below, the first run is master(no delay), the second run is happy-eyeballs, the third run is this pr strategy where all quic addresses are dialed together.

quic- means we cancelled a quic dial and there was no successful connection
quic-tcp means we cancelled a quic dial and there was a successful tcp connection.

Screenshot 2023-04-30 at 12 20 22 AM

Here you can see, there's not much impact on cancellations where there was no successful connection.

Screenshot 2023-04-30 at 12 22 16 AM

Here we can see that tcp-quic(tcp cancelled, quic succeeded) is reduced considerably for both strategies as expected

Screenshot 2023-04-30 at 12 23 22 AM

The happy eyeballs strategy(middle one) considerably reduces quic-quic and quicv1-quicv1 cancellations

Screenshot 2023-04-30 at 12 24 21 AM

None of the strategies have much of an impact in case the successful connection was over tcp. as expected.

@p-shahi p-shahi mentioned this pull request May 1, 2023
27 tasks
@sukunrt sukunrt force-pushed the smart-dialing branch 3 times, most recently from f0aa41d to 9e44071 Compare May 7, 2023 12:22
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to actually understand what dial worker loop is doing. I have to admit I'm pretty lost...

p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_ranker.go Show resolved Hide resolved
p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved

// Clock is a clock that can create timers that trigger at some
// instant rather than some duration
type Clock interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to introduce a new interface here? We're using https://github.com/benbjohnson/clock elsewhere in the code base, would it be possible to just reuse that one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need a new interface.
In this specific case I'm using InstantTimer which is not provided by benbjohnson/clock, but I can use standard Timers. I didn't use benbjohnson/clock because of the negative timer not being fired immediately and I thought we were going to use our own implementation going forward.

I see that benbjohnson/clock#50 is merged. So I don't have any objections to using benbjohnson/clock.

@MarcoPolo what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@MarcoPolo MarcoPolo May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are setting a timer based on an instant in time rather than some duration you should use this clock (which is the case with this diff). The benbjohnson clock will be flaky for this use case because you have two Goroutines that are both trying to use the value returned from Now().

Here's a simple example: that library has an internal ticker and you have your timer handler logic. Your handler wants to reset the timer for 1 minute from the moment it was called (now), and after the library has finished notifying all timers, it'll advance the clock (let's call the advanced clock time the future). If the your handler goroutine calls reset before the ticker finishes calling all timers and advancing the clock, you're fine because now the timer is registered for now+1min. But if the ticker advanced to the future you're out of luck because you've just registered the timer for the future+1min.

This isn't a problem with the benbjohnson clock, it's actually a problem with trying to mock the timer interface since this only accepts a duration not a time. Which is why this Clock interface lets you define timers that trigger at some point in time rather than by some duration in the future.

Does that make sense? If so I think we should include this logic in the codebase as a comment when this comes up again in the future, since it's not super obvious.

Another added bonus is that this mock clock can be implemented in about 100 LoC :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoPolo. I didn't realise this case would be flaky. We should add this comment.

p2p/net/swarm/dial_worker.go Outdated Show resolved Hide resolved
p2p/net/swarm/dial_worker.go Show resolved Hide resolved
p2p/net/swarm/dial_worker.go Outdated Show resolved Hide resolved
@sukunrt sukunrt requested a review from marten-seemann May 12, 2023 12:48
@marten-seemann
Copy link
Contributor

The fix was released in benbjohnson/[email protected] (release).

Continuing the discussion here: timer.Reset is not handled in v1.3.4 I've raised benbjohnson/clock#55

We can keep the current implementation for now. I'll change it to use benbjohnson/clock when it's merged.

Sounds good to me.

@MarcoPolo
Copy link
Collaborator

The fix was released in benbjohnson/[email protected] (release).

Continuing the discussion here: timer.Reset is not handled in v1.3.4 I've raised benbjohnson/clock#55
We can keep the current implementation for now. I'll change it to use benbjohnson/clock when it's merged.

Sounds good to me.

Making sure that you both saw my comment here: https://github.com/libp2p/go-libp2p/pull/2260/files/241fd6a912e8ec50e9dadd16e092b4de22885a42#r1201284744

@MarcoPolo
Copy link
Collaborator

Before merge:

  • Document this change in the CHANGELOG.md file (finally I didn't forget about this).
  • Document how to disable this and why you would want to.

@sukunrt
Copy link
Member Author

sukunrt commented May 23, 2023

Thanks @MarcoPolo
I've added an entry.
Made default dial ranker and no delay ranker public to point to godoc for the logic.

Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. A few suggestions for the metrics.

"refId": "C"
}
],
"title": "Dial Ranking Delay",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the 2 new dashboards in a new row?

Help: "Number of addresses dialed per peer",
// to count histograms with integral values accurately the bucket needs to be
// very narrow around the integer value
Buckets: []float64{0, 0.99, 1, 1.99, 2, 2.99, 3, 3.99, 4, 4.99, 5},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if a histogram is the right abstraction here. We can probably also improve the graph here:
image

What about using a counter here (with label 1, 2, 3, 4, 5, more), and incrementing the respective counter directly?

We could then display this a pie chart, which would allows to easily see that X% of connections succeed on the first attempt, Y% on the second one, and so on. That would be more meaningful than percentiles, wouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good idea. I'll try it.

Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!
image

Thanks @sukunrt!

@marten-seemann marten-seemann merged commit 6f27081 into libp2p:master Jun 4, 2023
@BigLep
Copy link
Contributor

BigLep commented Jun 7, 2023

A few things from looking at https://github.com/libp2p/go-libp2p/blob/master/CHANGELOG.md#smart-dialing-

  1. I don't think we're really selling the positive impact. We speak to how there's no/low negative impact, but can we we also summarize the positive impact?
  2. There are snapshots of various dashboards in this PR? Are those shareable links?
  3. What's the methodology that we're using for our metric collection in this PR. If I understand correctly, we've spun up a Kubo node with this version of go-libp2p. What is the usage pattern of that Kubo node? What peers is it dialing? Are we triggering anything on that Kubo node to force dialing of other nodes?
  4. The table in swarm: implement smart dialing logic #2260 (comment) was useful earlier. Do we have the latest numbers of the cancellation rate and latency impact of old code vs. new code? (If that's in a dashboard, that's great).

If we don't want to embed that kind of info in the changelog itself, we could give a summary here and link to that comment.

@marten-seemann
Copy link
Contributor

Thanks Steve, I agree. I've made some changes in #2342.

sukunrt added a commit that referenced this pull request Jun 12, 2023
sukunrt added a commit that referenced this pull request Jun 12, 2023
marten-seemann pushed a commit that referenced this pull request Jun 15, 2023
gts2030 pushed a commit to superblock-dev/go-libp2p that referenced this pull request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proposal: use Happy Eyeballs-like logic for dialing peers
4 participants