-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
swarm: implement smart dialing logic #2260
Conversation
ea7f3b4
to
67dfaba
Compare
Impressive numbers! Two questions:
|
For some reason there are many quic-draft29 cancellations. Nodes are just reporting a lot of quic addresses and not as many quic-v1 addresses. Still debugging what is causing this.
I'll have to measure this. The handshake latency metric currently measures the latency from the time of dialing, so I'll have to instrument this number. |
Are you dialing quic-v1 and quic-draft29 in parallel? If we have a v1 address, we should never dial draft-29. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts:
- Should we prioritize WebTransport over TCP (in cases where we don't have QUIC)?
- Do I understand correctly that we're dialing IPv6 and IPv4 QUIC addresses in parallel?
- What happens if a node gives us multiple QUIC IP addresses (of the same address family). Should we just randomly pick one and dial it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits, but this looks great!
p2p/net/swarm/dial_worker_test.go
Outdated
@@ -342,3 +358,206 @@ func TestDialWorkerLoopConcurrentFailureStress(t *testing.T) { | |||
close(reqch) | |||
worker.wg.Wait() | |||
} | |||
|
|||
func TestDialWorkerLoopRanking(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always appreciate more tests in this part of the codebase, thanks!
A feature request for me would be to have some sort of generative test here. See testing/quick
for the tool. If we could randomly generate test cases and verify that do what we expect, I'd be much more confident in rolling this out and making future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one randomized test using testing/quick. Is this what you had in mind?
Thanks for the work here and for the numbers. To put the number of cancellations in context, how many total connections were established during this same 1 hour window? |
This is the strategy in the current PR
Yes, I now think we should change this and dial all ipv4 addresses 300ms after ipv6.
Excellent idea. I did some experiments and found that if a peer shares a 4001 port address and another port address, the 4001 address is more likely to be the correct one. So the strategy I've used is to sort the addresses by port number, it is likely that nodes will dial out of a much higher port than the one they choose to listen on. Some more numbers: kubo on a t2micro aws instance with both ipv4 and ipv6 support. happy eyeballs (public == private | quic > tcp | ipv6 > ipv4 ): This strategy is essentially what @marten-seemann suggests only difference being that we prioritise ipv6 over ipv4 here we first use quic addresses and then use tcp addresses PR: (ip4 == ip6 == private | quic > tcp ) master: no delay single-dial (public == private | quic > tcp | ipv6 > ipv4): All latency numbers are in milliseconds Successes is the number of successful outgoing dials which resulted in a connection
I'm still debugging why happy eyeballs latency numbers are worse than single-dial latency numbers.
I'm sorry I somehow deleted prometheus data for that run 🤦♂️ |
Some cancellations are because the user is cancelling the dials. No successful connection is made. In the graphs below, the first run is master(no delay), the second run is happy-eyeballs, the third run is this pr strategy where all quic addresses are dialed together. quic- means we cancelled a quic dial and there was no successful connection Here you can see, there's not much impact on cancellations where there was no successful connection. Here we can see that tcp-quic(tcp cancelled, quic succeeded) is reduced considerably for both strategies as expected The happy eyeballs strategy(middle one) considerably reduces quic-quic and quicv1-quicv1 cancellations None of the strategies have much of an impact in case the successful connection was over tcp. as expected. |
f0aa41d
to
9e44071
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to actually understand what dial worker loop is doing. I have to admit I'm pretty lost...
|
||
// Clock is a clock that can create timers that trigger at some | ||
// instant rather than some duration | ||
type Clock interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to introduce a new interface here? We're using https://github.com/benbjohnson/clock elsewhere in the code base, would it be possible to just reuse that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need a new interface.
In this specific case I'm using InstantTimer
which is not provided by benbjohnson/clock, but I can use standard Timers. I didn't use benbjohnson/clock because of the negative timer not being fired immediately and I thought we were going to use our own implementation going forward.
I see that benbjohnson/clock#50 is merged. So I don't have any objections to using benbjohnson/clock.
@MarcoPolo what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix was released in https://github.com/benbjohnson/clock/releases/tag/v1.3.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are setting a timer based on an instant in time rather than some duration you should use this clock (which is the case with this diff). The benbjohnson clock will be flaky for this use case because you have two Goroutines that are both trying to use the value returned from Now()
.
Here's a simple example: that library has an internal ticker and you have your timer handler logic. Your handler wants to reset the timer for 1 minute from the moment it was called (now), and after the library has finished notifying all timers, it'll advance the clock (let's call the advanced clock time the future). If the your handler goroutine calls reset before the ticker finishes calling all timers and advancing the clock, you're fine because now the timer is registered for now+1min. But if the ticker advanced to the future you're out of luck because you've just registered the timer for the future+1min.
This isn't a problem with the benbjohnson clock, it's actually a problem with trying to mock the timer interface since this only accepts a duration not a time. Which is why this Clock
interface lets you define timers that trigger at some point in time rather than by some duration in the future.
Does that make sense? If so I think we should include this logic in the codebase as a comment when this comes up again in the future, since it's not super obvious.
Another added bonus is that this mock clock can be implemented in about 100 LoC :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoPolo. I didn't realise this case would be flaky. We should add this comment.
Sounds good to me. |
clean up redundant address filtering committed
Making sure that you both saw my comment here: https://github.com/libp2p/go-libp2p/pull/2260/files/241fd6a912e8ec50e9dadd16e092b4de22885a42#r1201284744 |
Before merge:
|
Thanks @MarcoPolo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. A few suggestions for the metrics.
"refId": "C" | ||
} | ||
], | ||
"title": "Dial Ranking Delay", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the 2 new dashboards in a new row?
p2p/net/swarm/swarm_metrics.go
Outdated
Help: "Number of addresses dialed per peer", | ||
// to count histograms with integral values accurately the bucket needs to be | ||
// very narrow around the integer value | ||
Buckets: []float64{0, 0.99, 1, 1.99, 2, 2.99, 3, 3.99, 4, 4.99, 5}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if a histogram is the right abstraction here. We can probably also improve the graph here:
What about using a counter here (with label 1, 2, 3, 4, 5, more), and incrementing the respective counter directly?
We could then display this a pie chart, which would allows to easily see that X% of connections succeed on the first attempt, Y% on the second one, and so on. That would be more meaningful than percentiles, wouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea. I'll try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sukunrt!
A few things from looking at https://github.com/libp2p/go-libp2p/blob/master/CHANGELOG.md#smart-dialing-
If we don't want to embed that kind of info in the changelog itself, we could give a summary here and link to that comment. |
Thanks Steve, I agree. I've made some changes in #2342. |
originally explained by @MarcoPolo here: #2260 (comment)
originally explained by @MarcoPolo here: #2260 (comment)
originally explained by @MarcoPolo here: #2260 (comment)
originally explained by @MarcoPolo here: libp2p/go-libp2p#2260 (comment)
we consider private, public ip4, public ip6, relay separately.
In each group, if a quic address is present, we delay tcp addresses.
private: 30 ms delay.
public ip4: 300 ms delay.
public ip6: 300 ms delay.
relay: 300 ms delay.
If a quic-v1 address is present we don't dial quic or webtransport address on the same (ip,port) combination.
If a tcp address is present we don't dial ws or wss address on the same (ip, port) combination.
If both direct and relay addresses are present, all relay addresses are delayed by an additional 500ms. So if there's a quic relay and a tcp relay address, quic relay will be delayed by 500ms and tcp relay will be delayed by 800 ms.
All delays are set to 0 for a holepunch request.
closes: #1785