Skip to content
This repository has been archived by the owner on Sep 9, 2022. It is now read-only.

Hard Limit the number of hop stream goroutines #74

Merged
merged 2 commits into from
May 7, 2019
Merged

Conversation

vyzo
Copy link
Contributor

@vyzo vyzo commented May 5, 2019

This adds a hard limit to the number of hop goroutines, so that relays don't get overloaded.

Note that the live hop tracking has been removed for two reasons:

  • it was not by any way exported to the outside world, just taking memory
  • there is lock contention

Note 2: DO NOT MERGE AS IS; I will rebase/squash before merging to clean up history.

@vyzo vyzo requested review from Stebalien and whyrusleeping May 5, 2019 08:46
@ghost ghost assigned vyzo May 5, 2019
@ghost ghost added the status/in-progress In progress label May 5, 2019
@vyzo
Copy link
Contributor Author

vyzo commented May 5, 2019

An alternative to hard resetting is to add a new error code for overloaded relays.
But if we do this, we must stop lingering awaiting EOF in handleError; perhaps we should just reset after sending the error.

@vyzo
Copy link
Contributor Author

vyzo commented May 5, 2019

Resetting the stream won't work, the error will not be propagated further.
But we can reduce the EOF timeout to something much more reasonable than the current 1 minute (to say 5s).
Fortunately this is a variable in go-libp2p-net, so the consumer can set it.

@vyzo
Copy link
Contributor Author

vyzo commented May 5, 2019

Implemented the RELAY_OVERLOADED error code, so that we don't do hard resets without notifying the other side.
The relay daemon can set the EOFTimeout to something shorter in order to reduce the goroutine linger time.

@vyzo
Copy link
Contributor Author

vyzo commented May 5, 2019

I reverted to resetting the stream when the hop limit is exceeded -- being nice has deleterious effects in the number of lingering goroutines.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like something we need although I'm a bit worried we should be using per-peer limits instead. At the moment, ~500 peers could fully mesh-connect through the relay to kill it.

relay.go Outdated
lhCount uint64
lhLk sync.Mutex
// atomic counters
sCount int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we give these full names? (streamCount?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

relay.go Outdated
@@ -29,6 +30,9 @@ var (
RelayAcceptTimeout = 10 * time.Second
HopConnectTimeout = 30 * time.Second
StopHandshakeTimeout = 1 * time.Minute

HopStreamBuffer = 4096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultra nit: HopStreamBufferSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pb/relay.proto Outdated
@@ -21,6 +21,7 @@ message CircuitRelay {
STOP_DST_MULTIADDR_INVALID = 351;
STOP_RELAY_REFUSED = 390;
MALFORMED_MESSAGE = 400;
RELAY_OVERLOADED = 500;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to use this or should we just leave it at a reset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will drop it in the rebase/squash.

relay.go Outdated
@@ -29,6 +30,9 @@ var (
RelayAcceptTimeout = 10 * time.Second
HopConnectTimeout = 30 * time.Second
StopHandshakeTimeout = 1 * time.Minute

HopStreamBuffer = 4096
HopStreamLimit = 1 << 18 // 256K hops for 512K goroutines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we checked this against our current numbers? This seems kind of low, actually. For 20k peers, this'll give us less than 10 streams per peer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can easily double it -- in fact the mplex relay where I am testing this is running with double the count (it overrides in the daemon init).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try with quadruple the count (hop limit at 1M) and evaluate memory usage with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are tight on memory with 2M goroutines active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubled the default, and actual daemons can set it higher if they have the resources.

@vyzo
Copy link
Contributor Author

vyzo commented May 7, 2019

Sounds like something we need although I'm a bit worried we should be using per-peer limits instead. At the moment, ~500 peers could fully mesh-connect through the relay to kill it.

Per-peer limits are a little complex to implement, and need a lock (which I would like to avoid).
We have not observed fully connected meshes in the relays whatsoever; what has been observed is a long tail distribution where most peers have a small number of streams (they are connecting to the peers behind the relay) and then a bunch have a large number of streams (they are accepting connections).

@Stebalien
Copy link
Member

Per-peer limits are a little complex to implement, and need a lock (which I would like to avoid).
We have not observed fully connected meshes in the relays whatsoever; what has been observed is a long tail distribution where most peers have a small number of streams (they are connecting to the peers behind the relay) and then a bunch have a large number of streams (they are accepting connections).

I'm primarily worried about someone attacking a relay this way but this certainly isn't the only way.

@vyzo vyzo force-pushed the feat/hop-limit branch from 9c610be to ee0d6be Compare May 7, 2019 18:01
@vyzo
Copy link
Contributor Author

vyzo commented May 7, 2019

rebased/squashed to just 2 commits and dropped the RELAY_OVERLOADED change in pb.

@vyzo vyzo merged commit 24bc85b into master May 7, 2019
@ghost ghost removed the status/in-progress In progress label May 7, 2019
@vyzo vyzo deleted the feat/hop-limit branch May 7, 2019 18:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants