Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server]: Start Peers Asynchronously #1658

Merged
merged 4 commits into from
Aug 16, 2018

Conversation

cfromknecht
Copy link
Contributor

@cfromknecht cfromknecht commented Jul 31, 2018

This PR adds asynchronous starting of peers,
in order to avoid potential DOS vectors. Currently,
we block with the server's mutex while peers exchange
Init messages and perform other setup. Thus, a remote
peer that does not reply with an init message will
cause server to block for 15s per attempt.

We also modify the startup behavior to spawn
peerTerminationWatchers before starting the
peer itself, ensuring that a peer is properly
cleaned up if the initialization fails. Currently,
failing to start a peer does not execute the bulk
of the teardown logic, since it is not spawned
until after a successful Start occurs.

The final commit is purely just a code move to place
the relevant methods in closer proximity, and
organize them roughly in the expected execution
order.

Prerequesities:

@cfromknecht cfromknecht self-assigned this Jul 31, 2018
@cfromknecht cfromknecht added p2p Code related to the peer-to-peer behaviour safety General label for issues/PRs related to the safety of using the software dos/hardening Related to the resilience of LND against denial of service or other related attacks server needs review PR needs review by regular contributors labels Jul 31, 2018
@Roasbeef
Copy link
Member

Running this on the faucet now!

@cfromknecht cfromknecht added this to the 0.5 milestone Jul 31, 2018
@cfromknecht cfromknecht added the P2 should be fixed if one has time label Jul 31, 2018
@cfromknecht
Copy link
Contributor Author

Added commit to bump peer write timeout to 50s, and rebased on master

halseth
halseth previously approved these changes Aug 1, 2018
Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

server.go Outdated
@@ -1719,10 +1719,10 @@ func (s *server) findPeerByPubStr(pubStr string) (*peer, error) {
// the cleanup routine to exit early.
//
// NOTE: This MUST be launched as a goroutine.
func (s *server) peerTerminationWatcher(p *peer) {
func (s *server) peerTerminationWatcher(p *peer, ready chan struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready should be explained in the godoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!


// Otherwise, signal to the peerTerminationWatcher that the peer startup
// was successful, and to begin watching the peer's wait group.
close(ready)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for doing this here, and not as the last thing in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be done later, as at this point even a Disconnect would unblock the select in WaitForDisconnect. I chose to put it here for clarity, and to stop selecting as soon as possible to avoid futex inflation

@cfromknecht cfromknecht force-pushed the async-peer-start branch 3 times, most recently from 90f5fb2 to b90365f Compare August 2, 2018 00:35
halseth
halseth previously approved these changes Aug 2, 2018
wpaulino
wpaulino previously approved these changes Aug 2, 2018
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ⚡️

@cfromknecht cfromknecht added ready to merge LGTM'd, may need rebase to address conflicts, anyone can merge and removed needs review PR needs review by regular contributors labels Aug 3, 2018
@cfromknecht cfromknecht dismissed stale reviews from wpaulino and halseth via 2965f67 August 3, 2018 03:32
@cfromknecht cfromknecht force-pushed the async-peer-start branch 2 times, most recently from 2965f67 to 329a69c Compare August 3, 2018 10:24
@cfromknecht
Copy link
Contributor Author

Been thinking about the ramifications of this PR, and I'd prefer to have #1551 (and transitively, #1668) before this lands in master. Marking this as blocked in the meantime

@Roasbeef
Copy link
Member

Can be rebased now that the two dependent PR's have been merged.

This commit adds additional synchronization logic to
WaitForDisconnect, such that it can be spawned before
Start has been executed by the server. Without
modification, the current version will return
immediately since no goroutines will have been
spawned.

To solve this, we modify WaitForDisconnect to block until:
 1) the peer is disconnected,
 2) the peer is successfully started,
before watching the waitgroup.

In the first case, the waitgroup will block until all
(if any) spawned goroutines have exited. Otherwise, if
the Start is successful, we can switch to watching the
waitgroup, knowing that waitgroup counter is positive.
This commit adds asynchronous starting of peers,
in order to avoid potential DOS vectors. Currently,
we block with the server's mutex while peers exchange
Init messages and perform other setup. Thus, a remote
peer that does not reply with an init message will
cause server to block for 15s per attempt.

We also modify the startup behavior to spawn
peerTerminationWatchers before starting the
peer itself, ensuring that a peer is properly
cleaned up if the initialization fails. Currently,
failing to start a peer does not execute the bulk
of the teardown logic, since it is not spawned
until after a successful Start occurs.
Sometimes when performing an initial sync, the remote
node isn't able to pull messages off the wire because
of long running tasks and queues are saturated. With
a shorter write timeout, we will give up trying to send
messages and teardown the connection, even though the
peer is still active.
@cfromknecht
Copy link
Contributor Author

rebased and 💚

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚜

// call to Start returns no error. Otherwise, if the peer fails to start,
// calling Disconnect will signal the quit channel and the method will not
// block, since no goroutines were spawned.
func (p *peer) WaitForDisconnect(ready chan struct{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dos/hardening Related to the resilience of LND against denial of service or other related attacks p2p Code related to the peer-to-peer behaviour P2 should be fixed if one has time ready to merge LGTM'd, may need rebase to address conflicts, anyone can merge safety General label for issues/PRs related to the safety of using the software server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants