-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reload pending outbound payments from ChannelMonitors on startup #1104
Reload pending outbound payments from ChannelMonitors on startup #1104
Conversation
e97930f
to
a0fe042
Compare
Codecov Report
@@ Coverage Diff @@
## main #1104 +/- ##
==========================================
+ Coverage 90.40% 90.47% +0.06%
==========================================
Files 68 68
Lines 34796 35176 +380
==========================================
+ Hits 31458 31826 +368
- Misses 3338 3350 +12
Continue to review full report at Codecov.
|
This is now based on #1108, which should go first. |
Depends on #1109 as well now. |
a0fe042
to
f821c93
Compare
Now includes at least a naive test of payment data reloading, which works 🎉 |
6e60c4f
to
a7cea3d
Compare
a7cea3d
to
7d23578
Compare
Rebased on upstream! |
7d23578
to
df8b291
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly wondering how we handle failures that were previously fulfilled
df8b291
to
42a854c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about any confusion around terminology. I don't have as a firm of grasp as I'd like to around some of this.
lightning/src/ln/channel.rs
Outdated
pub(super) struct RAAUpdates { | ||
pub commitment_update: Option<msgs::CommitmentUpdate>, | ||
pub to_forward_htlcs: Vec<(PendingHTLCInfo, u64)>, | ||
pub failed_htlcs: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>, | ||
pub finalized_claim_htlcs: Vec<HTLCSource>, | ||
pub monitor_update: ChannelMonitorUpdate, | ||
pub holding_cell_failed_htlcs: Vec<(HTLCSource, PaymentHash)>, | ||
} | ||
|
||
/// The return value of `monitor_updating_resotred` | ||
pub(super) struct MonitorRestoreUpdates { | ||
pub raa: Option<msgs::RevokeAndACK>, | ||
pub commitment_update: Option<msgs::CommitmentUpdate>, | ||
pub order: RAACommitmentOrder, | ||
pub forwards: Vec<(PendingHTLCInfo, u64)>, | ||
pub failures: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>, | ||
pub finalized_claims: Vec<HTLCSource>, | ||
pub funding_broadcastable: Option<Transaction>, | ||
pub funding_locked: Option<msgs::FundingLocked>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where it makes sense, could we use consistent field names between these two structs?
lightning/src/ln/channel.rs
Outdated
@@ -2899,8 +2927,14 @@ impl<Signer: Sign> Channel<Signer> { | |||
} | |||
self.monitor_pending_forwards.append(&mut to_forward_infos); | |||
self.monitor_pending_failures.append(&mut revoked_htlcs); | |||
self.monitor_pending_finalized_fulfills.append(&mut finalized_claim_htlcs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of terminology, is there a difference between a "fulfill" and a "claim"? Does it depend on the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It mostly just depends on when the code was written :p
commitment_update: Some(commitment_update), | ||
finalized_claim_htlcs, | ||
to_forward_htlcs: to_forward_infos, | ||
failed_htlcs: revoked_htlcs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question wrt "failed" and "revoked". Just want to make sure I understand if it is context dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer, there really isn't.
lightning/src/ln/channelmanager.rs
Outdated
if let hash_map::Entry::Occupied(payment) = &payment_entry { | ||
if !payment.get().is_retryable() { | ||
return Err(APIError::RouteError { | ||
err: "Payment already completed" | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me why this shouldn't be if payment.get().is_complete() {
as indicated by the error message. Do we want to hit this for legacy payments, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, we probably should hit this for legacy payments. I mean its unreachable for legacy payments, I believe (retry_payment
refuses to call send_payment_internal
at all), but nothing wrong with an extra check here.
lightning/src/ln/channelmanager.rs
Outdated
/// When a pending payment completes, we continue tracking it until all pendings HTLCs have | ||
/// been resolved. This ensures we don't look up pending payments in ChannelMonitors on restart | ||
/// and add a pending payment that was already completed. | ||
Fulfilled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terminology question related to my earlier comment: is there a difference between "completes" and "fulfilled" as it pertains to the first sentence in the docs and the variant name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just missed the docs in the rename - once we get one HTLC fulfillment we consider the payment fulfilled.
Pushed a fix for the lockorder comment, leaving other stuff until we settle on naming. |
2aef42c
to
ca6ae4d
Compare
ca6ae4d
to
a050236
Compare
Ok, I believe I've addressed all feedback and renamed things to be more clear. |
a050236
to
f47acdb
Compare
// We only rebuild the pending payments map if we were most recently serialized by | ||
// 0.0.102+ | ||
for (_, monitor) in args.channel_monitors { | ||
if by_id.get(&monitor.get_funding_txo().0.to_channel_id()).is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't see why we don't load pending outbound payments for open channels as well. Couldn't we still have a situation where the ChannelManager
doesn't get persisted after initiating an outbound payment, but the ChannelMonitor
still knows about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel monitor knowing about a payment that the ChannelManager does not implies by definition that the ChannelMonitor has some newer state than the ChannelManager. It's possible I missed some edge case but it feels pretty straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right 🤦♀️
Fwiw, when I add this diff, it gets hit a fair amount of times and all tests pass:
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index af8a65b9..5d7ee3a5 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -5939,7 +5939,23 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref>
}
}
}
- }
+ } else {
+ for (htlc_source, htlc) in monitor.get_pending_outbound_htlcs() {
+ if let HTLCSource::OutboundRoute { payment_id, session_priv, path, payment_secret, .. } = htlc_source {
+ let path_amt = path.last().unwrap().fee_msat;
+ let mut session_priv_bytes = [0; 32];
+ session_priv_bytes[..].copy_from_slice(&session_priv[..]);
+ match pending_outbound_payments.as_mut().unwrap().entry(payment_id) {
+ hash_map::Entry::Occupied(mut entry) => {
+ assert!(!entry.get_mut().insert(session_priv_bytes, path_amt));
+ },
+ hash_map::Entry::Vacant(entry) => {
+ panic!("no entry for this payment");
+ }
+ }
+ }
+ }
+}
}
}
which is reassuring as well
f47acdb
to
4687624
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I'm ACK mod outstanding feedback. Might give it one more glance in the morning. Thanks for the additional test coverage!
// We only rebuild the pending payments map if we were most recently serialized by | ||
// 0.0.102+ | ||
for (_, monitor) in args.channel_monitors { | ||
if by_id.get(&monitor.get_funding_txo().0.to_channel_id()).is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right 🤦♀️
Fwiw, when I add this diff, it gets hit a fair amount of times and all tests pass:
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index af8a65b9..5d7ee3a5 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -5939,7 +5939,23 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref>
}
}
}
- }
+ } else {
+ for (htlc_source, htlc) in monitor.get_pending_outbound_htlcs() {
+ if let HTLCSource::OutboundRoute { payment_id, session_priv, path, payment_secret, .. } = htlc_source {
+ let path_amt = path.last().unwrap().fee_msat;
+ let mut session_priv_bytes = [0; 32];
+ session_priv_bytes[..].copy_from_slice(&session_priv[..]);
+ match pending_outbound_payments.as_mut().unwrap().entry(payment_id) {
+ hash_map::Entry::Occupied(mut entry) => {
+ assert!(!entry.get_mut().insert(session_priv_bytes, path_amt));
+ },
+ hash_map::Entry::Vacant(entry) => {
+ panic!("no entry for this payment");
+ }
+ }
+ }
+ }
+}
}
}
which is reassuring as well
lightning/src/ln/payment_tests.rs
Outdated
@@ -266,3 +274,410 @@ fn no_pending_leak_on_initial_send_failure() { | |||
|
|||
assert!(!nodes[0].node.has_pending_payments()); | |||
} | |||
|
|||
#[test] | |||
fn retry_with_no_persist() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the new version of this test passes with the code that had the find_map
bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Yep, added even more coverage :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also pretty close on an ACK after remaining feedback is addressed.
@valentinewallace Nice work on finding the test gaps. I was staring at retry_with_no_persist
trying to figure out if the bug was being exercised. Mutation testing FTW lol.
I wonder if there's any improvements we can make to our testing strategy to make reviewing a bit easier. Some of the tests are pretty lengthy/verbose but I'm not sure if they can be broken up into smaller tests and still exercise the behavior.
@@ -3097,11 +3108,16 @@ impl<Signer: Sign> Channel<Signer> { | |||
/// which failed. The messages which were generated from that call which generated the | |||
/// monitor update failure must *not* have been sent to the remote end, and must instead | |||
/// have been dropped. They will be regenerated when monitor_updating_restored is called. | |||
pub fn monitor_update_failed(&mut self, resend_raa: bool, resend_commitment: bool, mut pending_forwards: Vec<(PendingHTLCInfo, u64)>, mut pending_fails: Vec<(HTLCSource, PaymentHash, HTLCFailReason)>) { | |||
pub fn monitor_update_failed(&mut self, resend_raa: bool, resend_commitment: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move these parameters down a line, aligned with the other parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a small function that doesn't seem to look any better to my eye? Feels very "rustfmt wants your code to use as many lines as possible" to me :).
// Now reload nodes[1]... | ||
persister = test_utils::TestPersister::new(); | ||
let keys_manager = &chanmon_cfgs[1].keys_manager; | ||
new_chain_monitor = test_utils::TestChainMonitor::new(Some(nodes[1].chain_source), nodes[1].tx_broadcaster.clone(), nodes[1].logger, node_cfgs[1].fee_estimator, &persister, keys_manager); | ||
nodes[1].chain_monitor = &new_chain_monitor; | ||
let mut chan_0_monitor_read = &chan_0_monitor_serialized.0[..]; | ||
let (_, mut chan_0_monitor) = <(BlockHash, ChannelMonitor<EnforcingSigner>)>::read( | ||
&mut chan_0_monitor_read, keys_manager).unwrap(); | ||
assert!(chan_0_monitor_read.is_empty()); | ||
|
||
let (_, nodes_1_deserialized_tmp) = { | ||
let mut channel_monitors = HashMap::new(); | ||
channel_monitors.insert(chan_0_monitor.get_funding_txo().0, &mut chan_0_monitor); | ||
<(BlockHash, ChannelManager<EnforcingSigner, &test_utils::TestChainMonitor, &test_utils::TestBroadcaster, &test_utils::TestKeysInterface, &test_utils::TestFeeEstimator, &test_utils::TestLogger>)> | ||
::read(&mut io::Cursor::new(&chan_manager_serialized.0[..]), ChannelManagerReadArgs { | ||
default_config: Default::default(), | ||
keys_manager, | ||
fee_estimator: node_cfgs[1].fee_estimator, | ||
chain_monitor: nodes[1].chain_monitor, | ||
tx_broadcaster: nodes[1].tx_broadcaster.clone(), | ||
logger: nodes[1].logger, | ||
channel_monitors, | ||
}).unwrap() | ||
}; | ||
nodes_1_deserialized = nodes_1_deserialized_tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty verbose and looks repeated from retry_with_no_persist
. Wondering if we could make a utility function for reloading so the tests are easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, its repeated in quite a few tests. I believe we could may be macro it, but it requires a few variables declared at top of function scope. I'll look at it as a followup, its not a new problem.
4687624
to
4348833
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
b67f15c
to
d9a85b3
Compare
Pushed yet more test coverage, I believe all outstanding comments have been addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK after squash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK d9a85b3
This substantially improves readability at the callsite and in the function.
This improves readability at the callsite and in the function.
This allows us to read a `HashMap` that has values which may be skipped if they are some backwards-compatibility type. We also take this opportunity to fail deserialization if keys are duplicated.
When an HTLC has been failed, we track it up until the point there exists no broadcastable commitment transaction which has the HTLC present, at which point Channel returns the HTLCSource back to the ChannelManager, which fails the HTLC backwards appropriately. When an HTLC is fulfilled, however, we fulfill on the backwards path immediately. This is great for claiming upstream HTLCs, but when we want to track pending payments, we need to ensure we can check with ChannelMonitor data to rebuild pending payments. In order to do so, we need an event similar to the HTLC failure event, but for fulfills instead. Specifically, if we force-close a channel, we remove its off-chain `Channel` object entirely, at which point, on reload, we may notice HTLC(s) which are not present in our pending payments map (as they may have received a payment preimage, but not fully committed to it). Thus, we'd conclude we still have a retryable payment, which is untrue. This commit does so, informing the ChannelManager via a new return element where appropriate of the HTLCSource corresponding to the failed HTLC.
In the next commit, we will reload lost pending payments from ChannelMonitors during restart. However, in order to avoid re-adding pending payments which have already been fulfilled, we must ensure that we do not fully remove pending payments until all HTLCs for the payment have been fully removed from their ChannelMonitors. We do so here, introducing a new PendingOutboundPayment variant called `Completed` which only tracks the set of pending HTLCs.
If we go to send a payment, add the HTLC(s) to the channel(s), commit the ChannelMonitor updates to disk, and then crash, we'll come back up with no pending payments but HTLC(s) ready to be claim/failed. This makes it rather impractical to write a payment sender/retryer, as you cannot guarantee atomicity - you cannot guarantee you'll have retry data persisted even if the HTLC(s) are actually pending. Because ChannelMonitors are *the* atomically-persisted data in LDK, we lean on their current HTLC data to figure out what HTLC(s) are a part of an outbound payment, rebuilding the pending payments list on reload.
test_dup_htlc_onchain_fails_on_reload is now more of a payment_test than a functional_test, testing for handling of pending payments.
Peers probably shouldn't do this, but if they want to give us free money, we should take it and not generate any spurious events.
d9a85b3
to
7af5d12
Compare
Squashed without diff, will land after CI 🎉
|
When we add payment retries, we need some kind of on-disk consistency story. I think a compelling one is described, roughly at [1] - make the retryer persist first, before calling into the ChannelManager, then when reloading we either have a pending payment there or its pending nowhere. Once the send_payment method returns, we'll either have a pending payment persisted in both the retryer and a ChannelMonitor, or in neither. Note that we may not have anything in a ChannelManager until much later, preventing us from retrying the payment if we crash.
In such a case, we'll force-close the relevant channel (its ChannelMonitor has run ahead of the manager), but the actual HTLC will still be pending, just not present in ChannelManager at all. Ideally, we'd then be able to retry it over a whole new path.
This PR addresses this situation by having ChannelManager examine the available ChannelMonitors at startup, rebuilding pending payment entries on the basis of what's in its ChannelMonitors.
Note that we aren't particularly concerned with the payment resolution end of things - if ChannelManager thinks a payment is still pending when it has been fully resolved, we'll eventually consider it sent and generate a fresh PaymentSent when the ChannelMonitor resolves the current state.
This should fix #1102, but generally needs lots more tests (its not tested at all, aside from a hacky thing at the end to compare the loaded-from-disk and the rebuilt set of payments, which passes most tests).
[1] #1059 (comment)