-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug failing CS-RAA resend order on pending commitment signatures #3149
Fix bug failing CS-RAA resend order on pending commitment signatures #3149
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #3149 +/- ##
==========================================
+ Coverage 89.81% 90.60% +0.79%
==========================================
Files 121 121
Lines 99314 105075 +5761
Branches 99314 105075 +5761
==========================================
+ Hits 89195 95205 +6010
+ Misses 7514 7313 -201
+ Partials 2605 2557 -48 ☔ View full report in Codecov by Sentry. |
fe7afcf
to
536b6b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good test, too, though I need to go look at the test coverage changes. I assume this was hit in prod? We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.
lightning/src/ln/channel.rs
Outdated
@@ -5475,10 +5476,6 @@ impl<SP: Deref> Channel<SP> where | |||
&self.context.channel_id(), if update_fee.is_some() { " update_fee," } else { "" }, | |||
update_add_htlcs.len(), update_fulfill_htlcs.len(), update_fail_htlcs.len(), update_fail_malformed_htlcs.len()); | |||
let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) { | |||
if self.context.signer_pending_commitment_update { | |||
log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update"); | |||
self.context.signer_pending_commitment_update = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, I'm a bit skeptical of this. get_last_commitment_update_for_send
is called in:
monitor_updating_restored
, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it callssigner_maybe_unblocked
(c) the monitor persistence completes, causingget_last_commitment_update_for_send
to return a valid commitment which we send to our counterparty, then (d) the user callssigner_maybe_unblocked
and we send it again, possibly causing the peer to FC on us.channel_reestablish
, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us theirchannel_reestablish
, then (c) the user callssigner_maybe_unblocked
which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsettingsigner_pending_commitment_update
when a peer disconnects.
In any case, we should almost certainly add a test for the two above cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.
I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.
channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.
great catch! yea this fails with my change. changing it back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was hit in prod?
Yep, have had a couple related issues here in the past
We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.
SGTM!
lightning/src/ln/channel.rs
Outdated
@@ -5475,10 +5476,6 @@ impl<SP: Deref> Channel<SP> where | |||
&self.context.channel_id(), if update_fee.is_some() { " update_fee," } else { "" }, | |||
update_add_htlcs.len(), update_fulfill_htlcs.len(), update_fail_htlcs.len(), update_fail_malformed_htlcs.len()); | |||
let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) { | |||
if self.context.signer_pending_commitment_update { | |||
log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update"); | |||
self.context.signer_pending_commitment_update = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.
I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.
channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.
great catch! yea this fails with my change. changing it back!
536b6b4
to
119b703
Compare
fn do_test_async_commitment_signature_ordering(monitor_update_failure: bool, | ||
enable_signer_before_monitor_completion: bool, enable_signer_before_reestablish: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the latest additional tests this helper is a bit unwieldy, not sure the best way to address it. I could copy this test 4 times and get rid of the conditionals, but it adds a lot of filler...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaknow actually i realized for what these are testing it isn't necessary to include them in this test (they don't care about the CS-RAA ordering, just that the CS signer pending flag is cleared at the right time), i'll just move them to their own separate tests tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, fixed this, moved them into the peer disconnection test
ca37547
to
221ef18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
funding_signed, | ||
channel_ready, | ||
order: self.context.resend_order.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: clone() is not needed on order
order: self.context.resend_order.clone(), | |
order: self.context.resend_order, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i think it actually does require clone since RAACommitmentOrder
doesn't implement Copy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. RAACommitmentOrder
is a primitive enum, but it indeed does not have Copy
trait. I tried it without and it worked, but to be safe for future changes, either the clone is needed or the enum should be marked explicitly Copy
. It's fine with the clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to squash the fixups.
let bs_second_commitment_signed = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id()); | ||
check_added_monitors!(nodes[1], 1); | ||
|
||
// The rest of this is boilerplate for resolving the previous state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least some of this may be replaceable with one of the inscruitable commitment_signed_dance
variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feared this day would come...
I tried my best, but didn't get very far, it seems this is a little different from the normal exchange so i'm not sure how much more time i should spend trying:
nodes0 nodes1
uah ->
cs ->
raa ->
<- raa
<- cs
cs ->
raa ->
<- raa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first few messages (after the uah) could be, but if you fought with it a bit I'm not gonna nitpick.
221ef18
to
315193d
Compare
Merging as the only changes since @optout21's "LGTM" are addressing his feedback. |
Across disconnects we may end up in a situation where we need to send a
commitment_signed
and thenrevoke_and_ack
. We need to make sure that if the signer is pending for CS but not RAA, we don't screw up the order by sending the RAA first. We defer sending the RAA by setting the flagsigner_pending_revoke_and_ack
, which will lead to us generating an RAA uponsigner_unblocked
.We test this for both the case where we send messages after a channel reestablish, as well as restoring a channel after persisting a monitor update asynchronously.