Stop relying on ChannelMonitor persistence after manager read #3322

TheBlueMatt · 2024-09-18T01:28:18Z

When we discover we've only partially claimed an MPP HTLC during
ChannelManager reading, we need to add the payment preimage to
all other ChannelMonitors that were a part of the payment.

We previously did this with a direct call on the ChannelMonitor,
requiring users write the full ChannelMonitor to disk to ensure
that updated information made it.

This adds quite a bit of delay during initial startup - fully
resilvering each ChannelMonitor just to handle this one case is
incredibly excessive.

Instead, we rewrite the MPP claim replay logic to use only (new) data included in ChannelMonitors, which has a nice side-effect of teeing up future ChannelManager-non-persistence features as well as makes our PaymentClaimed event generation much more robust.

arik-so · 2024-09-24T03:40:35Z

A bunch of tests are still trying to call provide_payment_preimage()

TheBlueMatt · 2024-09-30T21:10:10Z

Ugh, sorry, fixed. Also rebased and fixed an issue introduced in #3303

codecov · 2024-09-30T21:20:32Z

Codecov Report

Attention: Patch coverage is 90.65657% with 37 lines in your changes missing coverage. Please review.

Project coverage is 89.66%. Comparing base (a661c92) to head (b0fa756).

Files with missing lines	Patch %	Lines
lightning/src/ln/channelmanager.rs	89.36%	23 Missing and 9 partials ⚠️
lightning/src/chain/channelmonitor.rs	96.00%	2 Missing ⚠️
lightning/src/ln/reload_tests.rs	88.88%	0 Missing and 2 partials ⚠️
lightning/src/ln/channel.rs	91.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3322      +/-   ##
==========================================
- Coverage   89.68%   89.66%   -0.02%     
==========================================
  Files         126      126              
  Lines      103168   103370     +202     
  Branches   103168   103370     +202     
==========================================
+ Hits        92522    92686     +164     
- Misses       7934     7971      +37     
- Partials     2712     2713       +1

Flag	Coverage Δ
	`89.66% <90.65%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TheBlueMatt · 2024-10-01T13:03:10Z

Tagging 0.1 because of the first commit.

lightning/src/chain/channelmonitor.rs

arik-so

Nit: fix "insetad" typo in commit message d44066b

arik-so · 2024-10-14T15:27:44Z

lightning/src/ln/channelmanager.rs

-							.expect("Failed to get node_id for phantom node recipient");
-						receiver_node_id = phantom_pubkey;
-						break;
+		let (sources, claiming_payment) = {


to compare the equivalence between the before and after, the scroll distance is 5,000 lines of code. I wonder whether perhaps some of this could be moved into a separate file?

git show --color-moved is your friend here, I think :)

And, yea, obviously we should move parts of ChannelManager out, but I don't think this specific logic should be moved given its tie to other stuff in ChannelManager.

pending_changelog/matt-persist-preimage-on-upgrade.txt

valentinewallace

Sorry for the delay! Still making my way through this.

lightning/src/chain/channelmonitor.rs

lightning/src/ln/channelmanager.rs

lightning/src/ln/channel.rs

lightning/src/ln/channelmanager.rs

valentinewallace · 2024-10-21T18:55:38Z

lightning/src/ln/channelmanager.rs

+									return Err(DecodeError::InvalidValue);
+								},
+								hash_map::Entry::Vacant(entry) => {
+									entry.insert(payment_claim.claiming_payment);


Adding a panic here hits a bunch of tests, which is good, but for some of them I'm not sure why or if we should be reinserting here. For example, we hit this in test_mpp_keysend on drop of the test Node. In that test, it seems like the preimage should've been pruned from the monitor already. Do you know why this is happening?

We drop preimages one update cycle after we need to. So if a channel completed a claim in a full normal cycle and is "at rest" it'll still have the preimage. Only after the next round of commitment updates will that preimage get dropped. This supports some weird setups where you might have multiple redundant copies of a ChannelMonitor but also its kinda nice. Still, its definitely weird here, it means if a channel is "at rest" we'll always replay a claim event on startup.

With the claim IDs I think its okay, and kinda nice anyway cause it reduces chances of errors on the client side.

That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages since IIUC that's the reason we hold a Vec<PaymentClaimDetails> there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.

I guess that's a part of it? I wasn't really being so deliberate about it - generally we just try not to assume too much about payment hash reuse so it seemed weird to assume it can't happen here. Good to document either way.

lightning/src/ln/channelmanager.rs

valentinewallace · 2024-10-21T20:47:34Z

lightning/src/ln/reload_tests.rs

+	assert!(get_monitor!(nodes[3], chan_id_persisted).get_stored_preimages().contains_key(&payment_hash));
+	assert!(get_monitor!(nodes[3], chan_id_not_persisted).get_stored_preimages().contains_key(&payment_hash));
+


Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed? It might be nice to also adapt Node::drop to check that there are no dangling RAA blockers.

Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed?

Done

It might be nice to also adapt Node::drop to check that there are no dangling RAA blockers.

Hmm, I checked, looks like ~75 tests fail if we do it...that might be one for a followup 😅

`or_default` is generally less readable than writing out the thing we're writing, as `Default` is opaque but explicit constructors generally are not. Thus, we ignore the clippy lint (ideally we could invert it and ban the use of `Default` in the crate entirely but alas).

In aa09c33 we added a new secret in `ChannelManager` with which to derive inbound `PaymentId`s. We added read support for the new field, but forgot to add writing support for it. Here we fix this oversight.

TheBlueMatt · 2024-10-23T16:15:14Z

Squashed fixups and rebased to tell clippy to shut up.

valentinewallace · 2024-10-23T19:10:48Z

lightning/src/ln/channelmanager.rs

+									return Err(DecodeError::InvalidValue);
+								},
+								hash_map::Entry::Vacant(entry) => {
+									entry.insert(payment_claim.claiming_payment);


That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages since IIUC that's the reason we hold a Vec<PaymentClaimDetails> there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.

valentinewallace · 2024-10-23T19:11:22Z

lightning/src/ln/channelmanager.rs

+struct HTLCClaimSource {
+	counterparty_node_id: Option<PublicKey>,
+	funding_txo: OutPoint,
+	channel_id: ChannelId,
+	htlc_id: u64,
+}
+
+impl From<&MPPClaimHTLCSource> for HTLCClaimSource {
+	fn from(o: &MPPClaimHTLCSource) -> HTLCClaimSource {
+		HTLCClaimSource {
+			counterparty_node_id: Some(o.counterparty_node_id),
+			funding_txo: o.funding_txo,
+			channel_id: o.channel_id,
+			htlc_id: o.htlc_id,
+		}
+	}
+}
+
+#[derive(Clone, Debug, PartialEq, Eq)]
+struct MPPClaimHTLCSource {
+	counterparty_node_id: PublicKey,
+	funding_txo: OutPoint,
+	channel_id: ChannelId,
+	htlc_id: u64,
+}


Can we add some comments for these structs and clarify there why we need both of them?

lightning/src/ln/channelmanager.rs

valentinewallace

Took me a while to get up to speed on this patch, but I think this is basically there on my end now.

valentinewallace · 2024-10-23T20:10:32Z

lightning/src/ln/channelmanager.rs

 								if let MonitorUpdateCompletionAction::FreeOtherChannelImmediately {
 									downstream_counterparty_node_id: node_id,
-									downstream_funding_outpoint: funding_outpoint,
+									downstream_funding_outpoint: _,
 									blocking_action: blocker, downstream_channel_id: channel_id,
 								} = action {


I'm not sure if this is reachable, and it's pre-existing, but FYI we don't have coverage for getting a MonitorUpdateCompletionAction::FreeOtherChannelImmediately while during_init. If it's unreachable, adding a debug_assert could be useful.

I'm not entirely convinced its unreachable (and certainly not that it won't become reachable) - on the read end if an HTLC ends up in pending_claims_to_replay we'll call claim_funds_internal which for relayed payments calls claim_funds_from_hop and can return FreeOtherChannelImmediately for duplicate claims. Indeed we should get test coverage but that's not related to this PR :/.

valentinewallace · 2024-10-24T17:36:41Z

LGTM after squash.

When we started tracking which channels had MPP parts claimed durably on-disk in their `ChannelMonitor`, we did so with a tuple. This was fine in that it was only ever accessed in two places, but as we will start tracking it through to the `ChannelMonitor`s themselves in the coming commit(s), it is useful to have it in a struct instead.

When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we take the first step, building a list of MPP parts and metadata in `ChannelManager` and passing it through to `ChannelMonitor` in the `ChannelMonitorUpdate`s.

When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we store the required MPP parts and metadata in `ChannelMonitor`s and make them available to `ChannelManager` on load.

In a coming commit we'll use the existing `ChannelManager` claim flow to claim HTLCs which we found partially claimed on startup, necessitating having a full `ChannelManager` when we go to do so. Here we move the re-claim logic down in the `ChannelManager`-read logic so that we have that.

Here we wrap the logic which moves claimable payments from `claimable_payments` to `pending_claiming_payments` to a new utility function on `ClaimablePayments`. This will allow us to call this new logic during `ChannelManager` deserialization in a few commits.

In the next commit we'll start using (much of) the normal HTLC claim pipeline to replay payment claims on startup. In order to do so, however, we have to properly handle cases where we get a `DuplicateClaim` back from the channel for an inbound-payment HTLC. Here we do so, handling the `MonitorUpdateCompletionAction` and allowing an already-completed RAA blocker.

When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we finally implement claiming using the new MPP part list and metadata stored in `ChannelMonitor`s. In doing so, we use much more of the existing HTLC-claiming pipeline in `ChannelManager`, utilizing the on-startup background events flow as well as properly re-applying the RAA-blockers to ensure preimages cannot be lost.

When we discover we've only partially claimed an MPP HTLC during `ChannelManager` reading, we need to add the payment preimage to all other `ChannelMonitor`s that were a part of the payment. We previously did this with a direct call on the `ChannelMonitor`, requiring users write the full `ChannelMonitor` to disk to ensure that updated information made it. This adds quite a bit of delay during initial startup - fully resilvering each `ChannelMonitor` just to handle this one case is incredibly excessive. Over the past few commits we dropped the need to pass HTLCs directly to the `ChannelMonitor`s using the background events to provide `ChannelMonitorUpdate`s insetad. Thus, here we finally drop the requirement to resilver `ChannelMonitor`s on startup.

Because the new startup `ChannelMonitor` persistence semantics rely on new information stored in `ChannelMonitor` only for claims made in the upgraded code, users upgrading from previous version of LDK must apply the old `ChannelMonitor` persistence semantics at least once (as the old code will be used to handle partial claims).

TheBlueMatt · 2024-10-24T17:44:50Z

Squashed without further changes.

TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch 2 times, most recently from 5437927 to b0fa756 Compare September 30, 2024 21:13

TheBlueMatt added this to the 0.1 milestone Oct 1, 2024

TheBlueMatt commented Oct 11, 2024

View reviewed changes

lightning/src/chain/channelmonitor.rs Show resolved Hide resolved

arik-so reviewed Oct 14, 2024

View reviewed changes

This was linked to issues Oct 16, 2024

Don't modify ChannelMonitors during ChannelManager load #3112

Closed

Avoid re-persisting monitors if they are unchanged on startup #2852

Closed

valentinewallace reviewed Oct 21, 2024

View reviewed changes

TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch 3 times, most recently from dabb898 to cd4f665 Compare October 23, 2024 15:57

TheBlueMatt added 2 commits October 23, 2024 16:15

Add missing inbound_payment_id_secret write in ChannelManager

a65d37b

In aa09c33 we added a new secret in `ChannelManager` with which to derive inbound `PaymentId`s. We added read support for the new field, but forgot to add writing support for it. Here we fix this oversight.

TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch from aa8a0ad to 2fd87cf Compare October 23, 2024 16:15

valentinewallace reviewed Oct 23, 2024

View reviewed changes

TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch from 2fd87cf to 3e355b3 Compare October 23, 2024 21:44

TheBlueMatt mentioned this pull request Oct 23, 2024

Better test coverage of HTLC claim pipeline post-3322 #3381

Open

TheBlueMatt added 5 commits October 24, 2024 17:44

TheBlueMatt added 4 commits October 24, 2024 17:44

TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch from 3e355b3 to ba26432 Compare October 24, 2024 17:44

valentinewallace approved these changes Oct 24, 2024

View reviewed changes

arik-so approved these changes Oct 28, 2024

View reviewed changes

TheBlueMatt merged commit 5c975f7 into lightningdevkit:main Oct 28, 2024
16 of 18 checks passed

TheBlueMatt mentioned this pull request Nov 11, 2024

Stop using a constant for monitor update_ids after closure #3355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop relying on ChannelMonitor persistence after manager read #3322

Stop relying on ChannelMonitor persistence after manager read #3322

TheBlueMatt commented Sep 18, 2024

arik-so commented Sep 24, 2024

TheBlueMatt commented Sep 30, 2024

codecov bot commented Sep 30, 2024

TheBlueMatt commented Oct 1, 2024

arik-so left a comment

arik-so Oct 14, 2024

TheBlueMatt Oct 16, 2024

TheBlueMatt Oct 16, 2024

valentinewallace left a comment

valentinewallace Oct 21, 2024

TheBlueMatt Oct 22, 2024

valentinewallace Oct 23, 2024

TheBlueMatt Oct 23, 2024

valentinewallace Oct 21, 2024

TheBlueMatt Oct 22, 2024

TheBlueMatt commented Oct 23, 2024

valentinewallace Oct 23, 2024

valentinewallace Oct 23, 2024

valentinewallace left a comment

valentinewallace Oct 23, 2024

TheBlueMatt Oct 23, 2024

valentinewallace commented Oct 24, 2024

TheBlueMatt commented Oct 24, 2024

		assert!(get_monitor!(nodes[3], chan_id_persisted).get_stored_preimages().contains_key(&payment_hash));
		assert!(get_monitor!(nodes[3], chan_id_not_persisted).get_stored_preimages().contains_key(&payment_hash));

Stop relying on ChannelMonitor persistence after manager read #3322

Stop relying on ChannelMonitor persistence after manager read #3322

Conversation

TheBlueMatt commented Sep 18, 2024

arik-so commented Sep 24, 2024

TheBlueMatt commented Sep 30, 2024

codecov bot commented Sep 30, 2024

Codecov Report

TheBlueMatt commented Oct 1, 2024

arik-so left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentinewallace commented Oct 24, 2024

TheBlueMatt commented Oct 24, 2024