-
Notifications
You must be signed in to change notification settings - Fork 37
Conversation
aarshkshah1992
commented
May 5, 2020
•
edited
Loading
edited
- A dial address is FD consuming when:
- For non-relay addresses, the transport is TCP or WebSocket.
- For relay addresses, the transport of the Relay Server is TCP or Websocket.
- The limiter ONLY consumes FD's for non-relay addresses to avoid double counting.
- For dialling:
- We partition the addresses into FD consuming and Non-FD consuming addresses.
- We then sort each of two sets in descending order of priority such that private addresses have the highest priority followed by non-relay public addresses followed by relay addresses.
- We first attempt to dial all the sorted Non-FD consuming addresses and ONLY after/if we exhaust all of them without getting a connection, we attempt to dial the sorted FD consuming addresses.
swarm_dial.go
Outdated
var errNonFd *DialError | ||
var dialErr *DialError | ||
|
||
connC, errNonFd = s.dialAddrs(ctx, p, addrsToChanFnc(nonFdAddrs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just get rid of the channel on this interface to simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This is done.
swarm_dial.go
Outdated
var errFd *DialError | ||
connC, errFd = s.dialAddrs(ctx, p, addrsToChanFnc(fdAddrs)) | ||
if errFd != nil { | ||
dialErr = &DialError{Peer: p, DialErrors: append(errNonFd.DialErrors, errFd.DialErrors...), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop additional errors if we're over the maxDialDialErrors
limit (not sure what I did with that variable name). We might want to just add a "combine" method to the DialError
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done.
swarm_dial.go
Outdated
rankAddrsFnc := func(addrs []ma.Multiaddr) []ma.Multiaddr { | ||
var localAddrs []ma.Multiaddr | ||
var relayAddrs []ma.Multiaddr | ||
var others []ma.Multiaddr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd pre-allocate and over-allocate (i.e., with length len(addrs)
) all of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: as the rules here grow more complicated, it may be simpler to use sort.Slice()
.
(can be changed later if necessary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien Using sort.Slice now
and got rid of the allocs.
swarm_dial.go
Outdated
} | ||
} | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I take back what I said before. Let's just simplify this:
- If the transport is a "proxy" transport, don't consume a file descriptor token.
- Otherwise, just call
addrutil.IsFDCostlyTransport
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien Please see above comment.
For dialing in the limiter, we do NOT consume fd's for proxy addrs.
For classifying an address as FD or Non-FD consuming so we can dial all the latter ones before attempting the former, we look at the address of the relay server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. You're right, we do need to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'd still simplify this (given that we only really support circuit addresses anyways):
- Modify
addrutil.IsFDCostlyTransport
to ignore circuit addresses. - When sorting between relay connections when dialing, split on the
P_CIRCUIT
protocol and calladdrutil.IsFDCostlyTransport
on the relay part.
That way:
- We don't need to look at the swarm's transports. You're right in that we should probably expose this information on the transport itself, but we don't do that yet anyways.
- Can just go "full special case" for circuit addresses.
Does it sound like that will simplify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aarshkshah1992 thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien I'd replied earlier at #212 (comment).
Reproducing it here:
"We shouldn't leak how our dial uses the addrutil.IsFDCostlyTransport utility into the
shared utility itself. I've documented the code better for now and have filed an issue(#214 ) to simplify/fix this later.
".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this code needs to be simpler:
- We're checking to see if the address is a "proxy" address, but we only actually support circuit addresses. We should just split on the circuit address and be done with it.
- We're hard coding rules for specific transport protocols when we should be able to just say "if protocol X appears in the target multiaddr (before any proxy protocols), the address must consume a file descriptor". That is, we know that the tcp and unix protocols both must consume a file descriptor per connection while udp, memory, etc. protocols don't.
Basically, this code is trying to be general purpose and abstract over transports, but we can't actually do that so it's hard coding a bunch of stuff. Sitting halfway in between is just confusing.
@Stebalien Have addressed your review with some open comments. Please take a look. |
dial_error.go
Outdated
} | ||
|
||
return cbd | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to do this with two allocations (slice + error), two appends, some some slicing, and no for loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien Turns out we need only one alloc. This is done.
swarm_dial.go
Outdated
} | ||
} | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. You're right, we do need to do that.
swarm_dial.go
Outdated
} | ||
} | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'd still simplify this (given that we only really support circuit addresses anyways):
- Modify
addrutil.IsFDCostlyTransport
to ignore circuit addresses. - When sorting between relay connections when dialing, split on the
P_CIRCUIT
protocol and calladdrutil.IsFDCostlyTransport
on the relay part.
That way:
- We don't need to look at the swarm's transports. You're right in that we should probably expose this information on the transport itself, but we don't do that yet anyways.
- Can just go "full special case" for circuit addresses.
Does it sound like that will simplify this?
@Stebalien We shouldn't leak how our dial uses the I've documented the code better for now and have filed an issue(#214 ) to simplify/fix this later. Have made the other changes. Please take a look when you can. |
swarm_dial.go
Outdated
var errNonFd *DialError | ||
var dialErr *DialError | ||
|
||
connC, errNonFd = s.dialAddrs(ctx, p, nonFdAddrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea. QUIC is a UDP-based protocol. UDP is unreliable, it has no connection establishment flow like TCP, the feedback from lack of connectivity/port closed is not immediate like with TCP. If we dialled an IP that was routable but a bad/closed port, I suspect we'd have to wait for the full connection timeout to notice (5 seconds?), whereas with TCP we'd notice during the TCP handshake (much sooner). Same applies if firewalls/routers are dropping UDP packets.
What does this buy us? If the ultimate goal is to select a succeeding a QUIC connection over a succeeding TCP connection, sequentialising the dials is going to cause more pain than gain IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TCP handshake is a layer below TCP reliability. It's all IP at that point:
- The feedback for a UDP packet sent to a closed port is the same as the feedback for TCP: an ICMP unreachable packet. Unfortunately, many home firewalls will just drop packets so you'll get no feedback either way.
- If a SYN gets dropped, it will be resent eventually (2s?). I assume QUIC does the same.
The real question is whether or not the ICMP unreachable packet this is exposed to the reader. @marten-seemann, any ideas on this?
What does this buy us?
It:
- Avoids tying up slots in the file descriptor dial limiter needed by peers that only support TCP.
- Avoids spraying SYN packets (NATs don't like that).
However, it's not critical. I'm fine landing a version of this patch that simply prioritizes UDP connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TCP handshake is a layer below TCP reliability. It's all IP at that point
That is correct.
If a SYN gets dropped, it will be resent eventually (2s?). I assume QUIC does the same.
This is the order of magnitute that the TCP RFC recommends, but most implementations use a much shorter value. Similarly, the QUIC draft was changed after discussing with the TCP folks to use the same value, but not many implementations are willing to take that performance hit. quic-go retransmits two copies of the Initial packet after 200ms (with exponential backoff after that).
The real question is whether or not the ICMP unreachable packet this is exposed to the reader. @marten-seemann, any ideas on this?
For UDP, the kernel will only deliver ICMP packets if you're using a connected socket.
You can't really rely on ICMP packets anyway, since they're frequently dropped (or mis-routed). Furthermore, as ICMP packets are not authenticated, it would be inadvisable to take any action, except for the very early stages of the handshake.
swarm_dial.go
Outdated
// try to get a connection to any addr | ||
connC, dialErr := s.dialAddrs(ctx, p, goodAddrsChan) | ||
// sorts addresses in descending order of preference for dialing | ||
// Private UDP > Private TCP > Public UDP > Public TCP > UDP Relay server > TCP Relay server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually say public UDP goes before private TCP as TCP dials can get blocked on the fd limiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Have made the change.
swarm_dial.go
Outdated
} | ||
} | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aarshkshah1992 thoughts on this?
@Stebalien Have addressed your review and we now rank Public UDP addresses above private TCP addresses. Have also replied to your comment about simplifying the FD consuming address func at: Please take a look when you can. |
1f9d2a7
to
da0e646
Compare
ping @Stebalien for one last look. |
swarm_dial.go
Outdated
var fdConsumingTptProtos = map[int]struct{}{ | ||
ma.P_WS: struct{}{}, | ||
ma.P_TCP: struct{}{}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd strongly prefer to just check if the address contains tcp
and/or unix
(i.e., a fd consuming transports) before any proxy addresses. Otherwise, we're going to miss transports (e.g., WSS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I grok what you mean here and in #212 (comment) and it makes sense. I've made the changes.
swarm_dial.go
Outdated
} | ||
} | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this code needs to be simpler:
- We're checking to see if the address is a "proxy" address, but we only actually support circuit addresses. We should just split on the circuit address and be done with it.
- We're hard coding rules for specific transport protocols when we should be able to just say "if protocol X appears in the target multiaddr (before any proxy protocols), the address must consume a file descriptor". That is, we know that the tcp and unix protocols both must consume a file descriptor per connection while udp, memory, etc. protocols don't.
Basically, this code is trying to be general purpose and abstract over transports, but we can't actually do that so it's hard coding a bunch of stuff. Sitting halfway in between is just confusing.
return 4 | ||
} | ||
return 6 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Numeric scores are hard to read and easy to get wrong. Honestly, now that I see this, I'd go back to sorting these addresses into separate slices then concatenating them like you were doing before (sorry for going back and forth on this).
Note: This is fine as-is, just a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think that way is easier to read. This is done.
@Stebalien I've addressed all your concerns and it looks simpler/cleaner now. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change. Otherwise, LGTM!
swarm_dial.go
Outdated
var othersFd []ma.Multiaddr // public fd consuming | ||
|
||
for _, a := range addrs { | ||
if manet.IsPrivateAddr(a) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will include circuit addrs. We should probably check for that first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This is done.