Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix P2P Deadlocks #957

Merged
merged 3 commits into from
Jun 23, 2021

Conversation

TheBlueMatt
Copy link
Collaborator

Based on #948 this drops the practically-impossible requirements set by peer handler, see the first commit for more details. This should fix the immediate issue in #951 somewhat as a side-effect, while also fixing much more rare races as well.

@TheBlueMatt TheBlueMatt changed the title 2021 06 p2p no deadlock Fix P2P Deadlocks Jun 17, 2021
@codecov
Copy link

codecov bot commented Jun 17, 2021

Codecov Report

Merging #957 (4cc6682) into main (2f6205b) will increase coverage by 0.03%.
The diff coverage is 84.61%.

❗ Current head 4cc6682 differs from pull request most recent head 05157b1. Consider uploading reports for the commit 05157b1 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #957      +/-   ##
==========================================
+ Coverage   90.62%   90.66%   +0.03%     
==========================================
  Files          60       60              
  Lines       30409    30407       -2     
==========================================
+ Hits        27558    27568      +10     
+ Misses       2851     2839      -12     
Impacted Files Coverage Δ
lightning/src/ln/peer_handler.rs 46.50% <80.00%> (+1.05%) ⬆️
lightning-net-tokio/src/lib.rs 77.35% <87.50%> (+1.43%) ⬆️
lightning/src/ln/functional_tests.rs 97.16% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f6205b...05157b1. Read the comment docs.

@devrandom
Copy link
Member

the first commit message says:

but only so far as the socket descriptor is different from any later socket descriptor (ie until the file descriptor is re-used)

and SocketDescriptor says:

/// You probably want to just extend an int and put a file descriptor in a struct and implement

but we should actually encourage the user to use a unique u64, like the lightning-net-tokio implementation of the descriptor does. i.e. discourage the use of a potentially non-unique ID such as an fd.

We might even go as far as having PeerManager provide opaque unique IDs that must be stored in the descriptors, but maybe that's overkill.

@TheBlueMatt
Copy link
Collaborator Author

but we should actually encourage the user to use a unique u64, like the lightning-net-tokio implementation of the descriptor does. i.e. discourage the use of a potentially non-unique ID such as an fd.

Hmm, I mean if you're writing it in C you can just use shutdown instead of close, but maybe that should be made clear on the comment instead of just saying "dont use close"?

There's definitely nothing wrong with using an fd, and it makes writing to this api super-duper trivial in C-like languages, the only reason we don't use the fd in tokio is its kinda awkward to actually fetch the fd from the TcpStream.

@TheBlueMatt TheBlueMatt force-pushed the 2021-06-p2p-no-deadlock branch from 455a733 to 9e49062 Compare June 18, 2021 17:00
@TheBlueMatt
Copy link
Collaborator Author

Rebased on upstream and added a commit to not recommend file descriptor use directly while still calling out the requirements if you do.

@TheBlueMatt TheBlueMatt added this to the 0.0.99 milestone Jun 20, 2021
The only practical way to meet this requirement is to block
disconnect_socket until any pending events are fully processed,
leading to this trivial deadlock:

 * Thread 1: select() woken up due to a read event
 * Thread 2: Event processing causes a disconnect_socket call to
             fire while the PeerManager lock is held.
 * Thread 2: disconnect_socket blocks until the read event in
             thread 1 completes.
 * Thread 1: bytes are read from the socket and
             PeerManager::read_event is called, waiting on the lock
             still held by thread 2.

There isn't a trivial way to address this deadlock without simply
making the final read_event call return immediately, which we do
here. This also implies that users can freely call event methods
after disconnect_socket, but only so far as the socket descriptor
is different from any later socket descriptor (ie until the file
descriptor is re-used).
@TheBlueMatt TheBlueMatt force-pushed the 2021-06-p2p-no-deadlock branch from 9e49062 to c5322f7 Compare June 21, 2021 20:26
@TheBlueMatt
Copy link
Collaborator Author

Rebased now that #948 landed.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean-up on the documentation! Left a bunch of nits on that commit. As documentation grows in size, it would be advantageous to give section headings to break it up for the user, making it easier to scan.

/// be careful to ensure you don't have races whereby you might register a new connection with an
/// fd which is the same as a previous one which has yet to be removed via
/// PeerManager::socket_disconnected().
/// If applicable in your language, you probably want to just extend an int and put a file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/probably want to just/can

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc was rewritten in a later commit.

/// be careful to ensure you don't have races whereby you might register a new connection with an
/// fd which is the same as a previous one which has yet to be removed via
/// PeerManager::socket_disconnected().
/// If applicable in your language, you probably want to just extend an int and put a file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/probably want to just/can

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc was rewritten in a later commit.

Comment on lines 181 to 182
/// If applicable in your language, you probably want to just extend an int and put a file
/// descriptor in a struct and implement this trait on it. Note, of course, that if you do so and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/and put ... and implement/by wrapping ... and implementing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc was rewritten in a later commit.

@TheBlueMatt TheBlueMatt force-pushed the 2021-06-p2p-no-deadlock branch from 4cc6682 to 8e2c6be Compare June 23, 2021 00:51
See the previous commit for more information.
There are various typo and grammatical fixes here, as well as
concrete updates to correctness.
@TheBlueMatt TheBlueMatt force-pushed the 2021-06-p2p-no-deadlock branch from 8e2c6be to 05157b1 Compare June 23, 2021 00:51
@TheBlueMatt
Copy link
Collaborator Author

Squashed with no diff:

$ git diff-tree -U1 4cc66823 05157b17
$

Diff from Val's ACK (all pretty trivial grammar and wording changes, so will merge after CI passes):

$ git diff-tree -U1 c5322f7 05157b17
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index d438ea4e..5546227e 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -162,3 +162,3 @@ pub struct MessageHandler<CM: Deref, RM: Deref> where
 	/// A message handler which handles messages specific to channels. Usually this is just a
-	/// [`ChannelManager`] object or a [`ErroringMessageHandler`].
+	/// [`ChannelManager`] object or an [`ErroringMessageHandler`].
 	///
@@ -181,7 +181,7 @@ pub struct MessageHandler<CM: Deref, RM: Deref> where
 /// Two descriptors may compare equal (by [`cmp::Eq`] and [`hash::Hash`]) as long as the original
-/// has been disconnected, the [`PeerManager`] informed of the disconnection (either by it having
-/// triggered the disconnection or a call to [`PeerManager::socket_disconnected`]), and no further
-/// calls to the [`PeerManager`] related to the original socket occur. This allows you to use a
-/// file descriptor for your SocketDescriptor directly, however for simplicity you may wish to
-/// simply use another value which is guaranteed to be globally unique instead.
+/// has been disconnected, the [`PeerManager`] has been informed of the disconnection (either by it
+/// having triggered the disconnection or a call to [`PeerManager::socket_disconnected`]), and no
+/// further calls to the [`PeerManager`] related to the original socket occur. This allows you to
+/// use a file descriptor for your SocketDescriptor directly, however for simplicity you may wish
+/// to simply use another value which is guaranteed to be globally unique instead.
 pub trait SocketDescriptor : cmp::Eq + hash::Hash + Clone {
@@ -193,5 +193,5 @@ pub trait SocketDescriptor : cmp::Eq + hash::Hash + Clone {
 	///
-	/// If the returned size is smaller than data.len(), a
+	/// If the returned size is smaller than `data.len()`, a
 	/// [`PeerManager::write_buffer_space_avail`] call must be made the next time more data can be
-	/// written. Additionally, until a send_data event completes fully, no further
+	/// written. Additionally, until a `send_data` event completes fully, no further
 	/// [`PeerManager::read_event`] calls should be made for the same peer! Because this is to
@@ -202,7 +202,9 @@ pub trait SocketDescriptor : cmp::Eq + hash::Hash + Clone {
 	/// (indicating that read events should be paused to prevent DoS in the send buffer),
-	/// resume_read may be set indicating that read events on this descriptor should resume. A
-	/// resume_read of false does *not* imply on its own that further read events should be paused.
+	/// `resume_read` may be set indicating that read events on this descriptor should resume. A
+	/// `resume_read` of false carries no meaning, and should not cause any action.
 	fn send_data(&mut self, data: &[u8], resume_read: bool) -> usize;
 	/// Disconnect the socket pointed to by this SocketDescriptor.
-	/// No [`PeerManager::socket_disconnected`] call need be generated as a result of this call.
+	///
+	/// You do *not* need to call [`PeerManager::socket_disconnected`] with this socket after this
+	/// call (doing so is a noop).
 	fn disconnect_socket(&mut self);
@@ -321,3 +323,3 @@ pub type SimpleRefPeerManager<'a, 'b, 'c, 'd, 'e, 'f, 'g, SD, M, T, F, C, L> = P
 /// A PeerManager manages a set of peers, described by their [`SocketDescriptor`] and marshalls
-/// socket events into messages which it passes on to its MessageHandlers.
+/// socket events into messages which it passes on to its [`MessageHandler`].
 ///
@@ -328,5 +330,5 @@ pub type SimpleRefPeerManager<'a, 'b, 'c, 'd, 'e, 'f, 'g, SD, M, T, F, C, L> = P
 /// [`ChannelMessageHandler`], likely doing message processing in-line. Thus, the primary form of
-/// parallelism in Rust-Lightning is parallelism in calls to [`read_event`]. Note, however, that
-/// calls to any [`PeerManager`] functions related to the same connection must occur only in
-/// serial, making new calls only after previous ones have returned.
+/// parallelism in Rust-Lightning is in calls to [`read_event`]. Note, however, that calls to any
+/// [`PeerManager`] functions related to the same connection must occur only in serial, making new
+/// calls only after previous ones have returned.
 ///
@@ -337,2 +339,4 @@ pub type SimpleRefPeerManager<'a, 'b, 'c, 'd, 'e, 'f, 'g, SD, M, T, F, C, L> = P
 /// you're using lightning-net-tokio.
+///
+/// [`read_event`]: PeerManager::read_event
 pub struct PeerManager<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref> where
@@ -626,7 +630,7 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref> PeerManager<D
 	///
-	/// Will most likely call [`send_data`] on the descriptor passed in (or the descriptor handed
-	/// into new_*\_connection) before returning. Thus, be very careful with reentrancy issues! The
-	/// invariants around calling [`write_buffer_space_avail`] in case a write did not fully
-	/// complete must still hold - be ready to call `[write_buffer_space_avail`] again if a write
-	/// call generated here isn't sufficient!
+	/// May call [`send_data`] on the descriptor passed in (or an equal descriptor) before
+	/// returning. Thus, be very careful with reentrancy issues! The invariants around calling
+	/// [`write_buffer_space_avail`] in case a write did not fully complete must still hold - be
+	/// ready to call `[write_buffer_space_avail`] again if a write call generated here isn't
+	/// sufficient!
 	///
@@ -656,7 +660,8 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref> PeerManager<D
 	/// Will *not* call back into [`send_data`] on any descriptors to avoid reentrancy complexity.
-	/// Thus, however, you almost certainly want to call [`process_events`] after any read_event
-	/// to generate [`send_data`] calls to handle responses.
+	/// Thus, however, you should call [`process_events`] after any `read_event` to generate
+	/// [`send_data`] calls to handle responses.
 	///
-	/// If Ok(true) is returned, further read_events should not be triggered until a [`send_data`]
-	/// call on this descriptor has resume_read set (preventing DoS issues in the send buffer).
+	/// If `Ok(true)` is returned, further read_events should not be triggered until a
+	/// [`send_data`] call on this descriptor has `resume_read` set (preventing DoS issues in the
+	/// send buffer).
 	///
@@ -1111,9 +1116,9 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref> PeerManager<D
 	/// response messages as well as messages generated by calls to handler functions directly (eg
-	/// functions like [`ChannelManager::process_pending_htlc_forward`] or [`send_payment`]).
+	/// functions like [`ChannelManager::process_pending_htlc_forwards`] or [`send_payment`]).
 	///
-	/// Will most likely call [`send_data`] on descriptors previously provided to
-	/// new_*\_connection. Thus, be very careful with reentrancy issues!
+	/// May call [`send_data`] on [`SocketDescriptor`]s. Thus, be very careful with reentrancy
+	/// issues!
 	///
 	/// [`send_payment`]: crate::ln::channelmanager::ChannelManager::send_payment
-	/// [`ChannelManager::process_pending_htlc_forward`]: crate::ln::channelmanager::ChannelManager::process_pending_htlc_forward
+	/// [`ChannelManager::process_pending_htlc_forwards`]: crate::ln::channelmanager::ChannelManager::process_pending_htlc_forwards
 	/// [`send_data`]: SocketDescriptor::send_data
@@ -1357,3 +1362,3 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref> PeerManager<D
 	///
-	/// Set no_connection_possible to true to prevent any further connection with this peer,
+	/// Set `no_connection_possible` to true to prevent any further connection with this peer,
 	/// force-closing any channels we have with it.
@@ -1375,6 +1380,9 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref> PeerManager<D
 	/// This function should be called roughly once every 30 seconds.
-	/// It will send pings to each peer and disconnect those which did not respond to the last round of pings.
+	/// It will send pings to each peer and disconnect those which did not respond to the last
+	/// round of pings.
 	///
-	/// Will most likely call [`send_data`] all descriptors previously provided to
-	/// new_*\_connection. Thus, be very careful with reentrancy issues!
+	/// May call [`send_data`] on all [`SocketDescriptor`]s. Thus, be very careful with reentrancy
+	/// issues!
+	///
+	/// [`send_data`]: SocketDescriptor::send_data
 	pub fn timer_tick_occurred(&self) {

@TheBlueMatt TheBlueMatt merged commit 073afbb into lightningdevkit:main Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants