Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

Rank Dial addresses #212

Merged
merged 9 commits into from
May 20, 2020
Merged

Rank Dial addresses #212

merged 9 commits into from
May 20, 2020

Conversation

aarshkshah1992
Copy link
Collaborator

@aarshkshah1992 aarshkshah1992 commented May 5, 2020

  • A dial address is FD consuming when:
    • For non-relay addresses, the transport is TCP or WebSocket.
    • For relay addresses, the transport of the Relay Server is TCP or Websocket.
  • The limiter ONLY consumes FD's for non-relay addresses to avoid double counting.
  • For dialling:
    • We partition the addresses into FD consuming and Non-FD consuming addresses.
    • We then sort each of two sets in descending order of priority such that private addresses have the highest priority followed by non-relay public addresses followed by relay addresses.
  • We first attempt to dial all the sorted Non-FD consuming addresses and ONLY after/if we exhaust all of them without getting a connection, we attempt to dial the sorted FD consuming addresses.

swarm_dial.go Show resolved Hide resolved
swarm_dial.go Show resolved Hide resolved
swarm_dial.go Outdated
var errNonFd *DialError
var dialErr *DialError

connC, errNonFd = s.dialAddrs(ctx, p, addrsToChanFnc(nonFdAddrs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just get rid of the channel on this interface to simplify it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is done.

swarm_dial.go Outdated
var errFd *DialError
connC, errFd = s.dialAddrs(ctx, p, addrsToChanFnc(fdAddrs))
if errFd != nil {
dialErr = &DialError{Peer: p, DialErrors: append(errNonFd.DialErrors, errFd.DialErrors...),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop additional errors if we're over the maxDialDialErrors limit (not sure what I did with that variable name). We might want to just add a "combine" method to the DialError type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

swarm_dial.go Outdated
rankAddrsFnc := func(addrs []ma.Multiaddr) []ma.Multiaddr {
var localAddrs []ma.Multiaddr
var relayAddrs []ma.Multiaddr
var others []ma.Multiaddr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd pre-allocate and over-allocate (i.e., with length len(addrs)) all of these.

Copy link
Member

@Stebalien Stebalien May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: as the rules here grow more complicated, it may be simpler to use sort.Slice().

(can be changed later if necessary)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien Using sort.Slice now and got rid of the allocs.

swarm_dial.go Outdated Show resolved Hide resolved
swarm_dial.go Outdated
}
}
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I take back what I said before. Let's just simplify this:

  1. If the transport is a "proxy" transport, don't consume a file descriptor token.
  2. Otherwise, just call addrutil.IsFDCostlyTransport.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien Please see above comment.

For dialing in the limiter, we do NOT consume fd's for proxy addrs.
For classifying an address as FD or Non-FD consuming so we can dial all the latter ones before attempting the former, we look at the address of the relay server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. You're right, we do need to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd still simplify this (given that we only really support circuit addresses anyways):

  1. Modify addrutil.IsFDCostlyTransport to ignore circuit addresses.
  2. When sorting between relay connections when dialing, split on the P_CIRCUIT protocol and call addrutil.IsFDCostlyTransport on the relay part.

That way:

  • We don't need to look at the swarm's transports. You're right in that we should probably expose this information on the transport itself, but we don't do that yet anyways.
  • Can just go "full special case" for circuit addresses.

Does it sound like that will simplify this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarshkshah1992 thoughts on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien I'd replied earlier at #212 (comment).

Reproducing it here:

"We shouldn't leak how our dial uses the addrutil.IsFDCostlyTransport utility into the
shared utility itself. I've documented the code better for now and have filed an issue(#214 ) to simplify/fix this later.
".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this code needs to be simpler:

  • We're checking to see if the address is a "proxy" address, but we only actually support circuit addresses. We should just split on the circuit address and be done with it.
  • We're hard coding rules for specific transport protocols when we should be able to just say "if protocol X appears in the target multiaddr (before any proxy protocols), the address must consume a file descriptor". That is, we know that the tcp and unix protocols both must consume a file descriptor per connection while udp, memory, etc. protocols don't.

Basically, this code is trying to be general purpose and abstract over transports, but we can't actually do that so it's hard coding a bunch of stuff. Sitting halfway in between is just confusing.

@aarshkshah1992
Copy link
Collaborator Author

@Stebalien Have addressed your review with some open comments. Please take a look.

dial_error.go Outdated
}

return cbd
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to do this with two allocations (slice + error), two appends, some some slicing, and no for loops.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien Turns out we need only one alloc. This is done.

swarm_dial.go Show resolved Hide resolved
swarm_dial.go Outdated
}
}
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. You're right, we do need to do that.

swarm_dial.go Outdated
}
}
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd still simplify this (given that we only really support circuit addresses anyways):

  1. Modify addrutil.IsFDCostlyTransport to ignore circuit addresses.
  2. When sorting between relay connections when dialing, split on the P_CIRCUIT protocol and call addrutil.IsFDCostlyTransport on the relay part.

That way:

  • We don't need to look at the swarm's transports. You're right in that we should probably expose this information on the transport itself, but we don't do that yet anyways.
  • Can just go "full special case" for circuit addresses.

Does it sound like that will simplify this?

@aarshkshah1992
Copy link
Collaborator Author

aarshkshah1992 commented May 11, 2020

@Stebalien We shouldn't leak how our dial uses the addrutil.IsFDCostlyTransport utility into the shared utility itself.

I've documented the code better for now and have filed an issue(#214 ) to simplify/fix this later.

Have made the other changes. Please take a look when you can.

swarm_dial.go Outdated
var errNonFd *DialError
var dialErr *DialError

connC, errNonFd = s.dialAddrs(ctx, p, nonFdAddrs)
Copy link
Member

@raulk raulk May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea. QUIC is a UDP-based protocol. UDP is unreliable, it has no connection establishment flow like TCP, the feedback from lack of connectivity/port closed is not immediate like with TCP. If we dialled an IP that was routable but a bad/closed port, I suspect we'd have to wait for the full connection timeout to notice (5 seconds?), whereas with TCP we'd notice during the TCP handshake (much sooner). Same applies if firewalls/routers are dropping UDP packets.

What does this buy us? If the ultimate goal is to select a succeeding a QUIC connection over a succeeding TCP connection, sequentialising the dials is going to cause more pain than gain IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TCP handshake is a layer below TCP reliability. It's all IP at that point:

  • The feedback for a UDP packet sent to a closed port is the same as the feedback for TCP: an ICMP unreachable packet. Unfortunately, many home firewalls will just drop packets so you'll get no feedback either way.
  • If a SYN gets dropped, it will be resent eventually (2s?). I assume QUIC does the same.

The real question is whether or not the ICMP unreachable packet this is exposed to the reader. @marten-seemann, any ideas on this?

What does this buy us?

It:

  • Avoids tying up slots in the file descriptor dial limiter needed by peers that only support TCP.
  • Avoids spraying SYN packets (NATs don't like that).

However, it's not critical. I'm fine landing a version of this patch that simply prioritizes UDP connections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TCP handshake is a layer below TCP reliability. It's all IP at that point

That is correct.

If a SYN gets dropped, it will be resent eventually (2s?). I assume QUIC does the same.

This is the order of magnitute that the TCP RFC recommends, but most implementations use a much shorter value. Similarly, the QUIC draft was changed after discussing with the TCP folks to use the same value, but not many implementations are willing to take that performance hit. quic-go retransmits two copies of the Initial packet after 200ms (with exponential backoff after that).

The real question is whether or not the ICMP unreachable packet this is exposed to the reader. @marten-seemann, any ideas on this?

For UDP, the kernel will only deliver ICMP packets if you're using a connected socket.

You can't really rely on ICMP packets anyway, since they're frequently dropped (or mis-routed). Furthermore, as ICMP packets are not authenticated, it would be inadvisable to take any action, except for the very early stages of the handshake.

@aarshkshah1992 aarshkshah1992 changed the title [WIP] Swarm Dial Priorities Rank Dial addresses May 14, 2020
@aarshkshah1992
Copy link
Collaborator Author

@steb @raulk We no longer wait for UDP addresses to finish before dialling TCP addresses. Please take a look at the PR when you can.

swarm_dial.go Outdated
// try to get a connection to any addr
connC, dialErr := s.dialAddrs(ctx, p, goodAddrsChan)
// sorts addresses in descending order of preference for dialing
// Private UDP > Private TCP > Public UDP > Public TCP > UDP Relay server > TCP Relay server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually say public UDP goes before private TCP as TCP dials can get blocked on the fd limiter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Have made the change.

swarm_dial.go Outdated
}
}
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarshkshah1992 thoughts on this?

@aarshkshah1992
Copy link
Collaborator Author

aarshkshah1992 commented May 15, 2020

@Stebalien Have addressed your review and we now rank Public UDP addresses above private TCP addresses. Have also replied to your comment about simplifying the FD consuming address func at:

#212 (comment)

Please take a look when you can.

@aarshkshah1992 aarshkshah1992 force-pushed the feat/dial-priorities branch from 1f9d2a7 to da0e646 Compare May 18, 2020 12:25
@aarshkshah1992
Copy link
Collaborator Author

ping @Stebalien for one last look.

swarm_dial.go Outdated
var fdConsumingTptProtos = map[int]struct{}{
ma.P_WS: struct{}{},
ma.P_TCP: struct{}{},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly prefer to just check if the address contains tcp and/or unix (i.e., a fd consuming transports) before any proxy addresses. Otherwise, we're going to miss transports (e.g., WSS).

Copy link
Collaborator Author

@aarshkshah1992 aarshkshah1992 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I grok what you mean here and in #212 (comment) and it makes sense. I've made the changes.

swarm_dial.go Outdated
}
}
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this code needs to be simpler:

  • We're checking to see if the address is a "proxy" address, but we only actually support circuit addresses. We should just split on the circuit address and be done with it.
  • We're hard coding rules for specific transport protocols when we should be able to just say "if protocol X appears in the target multiaddr (before any proxy protocols), the address must consume a file descriptor". That is, we know that the tcp and unix protocols both must consume a file descriptor per connection while udp, memory, etc. protocols don't.

Basically, this code is trying to be general purpose and abstract over transports, but we can't actually do that so it's hard coding a bunch of stuff. Sitting halfway in between is just confusing.

return 4
}
return 6
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Numeric scores are hard to read and easy to get wrong. Honestly, now that I see this, I'd go back to sorting these addresses into separate slices then concatenating them like you were doing before (sorry for going back and forth on this).

Note: This is fine as-is, just a bit confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think that way is easier to read. This is done.

@aarshkshah1992
Copy link
Collaborator Author

@Stebalien I've addressed all your concerns and it looks simpler/cleaner now. Please take a look.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change. Otherwise, LGTM!

swarm_dial.go Outdated
var othersFd []ma.Multiaddr // public fd consuming

for _, a := range addrs {
if manet.IsPrivateAddr(a) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will include circuit addrs. We should probably check for that first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is done.

@aarshkshah1992 aarshkshah1992 merged commit fbfe382 into master May 20, 2020
@aarshkshah1992 aarshkshah1992 deleted the feat/dial-priorities branch May 20, 2020 09:04
@Stebalien Stebalien mentioned this pull request May 26, 2020
77 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants