Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop relying on ChannelMonitor persistence after manager read #3322

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

When we discover we've only partially claimed an MPP HTLC during
ChannelManager reading, we need to add the payment preimage to
all other ChannelMonitors that were a part of the payment.

We previously did this with a direct call on the ChannelMonitor,
requiring users write the full ChannelMonitor to disk to ensure
that updated information made it.

This adds quite a bit of delay during initial startup - fully
resilvering each ChannelMonitor just to handle this one case is
incredibly excessive.

Instead, we rewrite the MPP claim replay logic to use only (new) data included in ChannelMonitors, which has a nice side-effect of teeing up future ChannelManager-non-persistence features as well as makes our PaymentClaimed event generation much more robust.

@arik-so
Copy link
Contributor

arik-so commented Sep 24, 2024

A bunch of tests are still trying to call provide_payment_preimage()

@TheBlueMatt
Copy link
Collaborator Author

Ugh, sorry, fixed. Also rebased and fixed an issue introduced in #3303

@TheBlueMatt TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch 2 times, most recently from 5437927 to b0fa756 Compare September 30, 2024 21:13
Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 90.65657% with 37 lines in your changes missing coverage. Please review.

Project coverage is 89.66%. Comparing base (a661c92) to head (b0fa756).

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 89.36% 23 Missing and 9 partials ⚠️
lightning/src/chain/channelmonitor.rs 96.00% 2 Missing ⚠️
lightning/src/ln/reload_tests.rs 88.88% 0 Missing and 2 partials ⚠️
lightning/src/ln/channel.rs 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3322      +/-   ##
==========================================
- Coverage   89.68%   89.66%   -0.02%     
==========================================
  Files         126      126              
  Lines      103168   103370     +202     
  Branches   103168   103370     +202     
==========================================
+ Hits        92522    92686     +164     
- Misses       7934     7971      +37     
- Partials     2712     2713       +1     
Flag Coverage Δ
89.66% <90.65%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TheBlueMatt TheBlueMatt added this to the 0.1 milestone Oct 1, 2024
@TheBlueMatt
Copy link
Collaborator Author

Tagging 0.1 because of the first commit.

Copy link
Contributor

@arik-so arik-so left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: fix "insetad" typo in commit message d44066b

.expect("Failed to get node_id for phantom node recipient");
receiver_node_id = phantom_pubkey;
break;
let (sources, claiming_payment) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to compare the equivalence between the before and after, the scroll distance is 5,000 lines of code. I wonder whether perhaps some of this could be moved into a separate file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git show --color-moved is your friend here, I think :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, yea, obviously we should move parts of ChannelManager out, but I don't think this specific logic should be moved given its tie to other stuff in ChannelManager.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Still making my way through this.

return Err(DecodeError::InvalidValue);
},
hash_map::Entry::Vacant(entry) => {
entry.insert(payment_claim.claiming_payment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a panic here hits a bunch of tests, which is good, but for some of them I'm not sure why or if we should be reinserting here. For example, we hit this in test_mpp_keysend on drop of the test Node. In that test, it seems like the preimage should've been pruned from the monitor already. Do you know why this is happening?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We drop preimages one update cycle after we need to. So if a channel completed a claim in a full normal cycle and is "at rest" it'll still have the preimage. Only after the next round of commitment updates will that preimage get dropped. This supports some weird setups where you might have multiple redundant copies of a ChannelMonitor but also its kinda nice. Still, its definitely weird here, it means if a channel is "at rest" we'll always replay a claim event on startup.

With the claim IDs I think its okay, and kinda nice anyway cause it reduces chances of errors on the client side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages since IIUC that's the reason we hold a Vec<PaymentClaimDetails> there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's a part of it? I wasn't really being so deliberate about it - generally we just try not to assume too much about payment hash reuse so it seemed weird to assume it can't happen here. Good to document either way.

Comment on lines +911 to +913
assert!(get_monitor!(nodes[3], chan_id_persisted).get_stored_preimages().contains_key(&payment_hash));
assert!(get_monitor!(nodes[3], chan_id_not_persisted).get_stored_preimages().contains_key(&payment_hash));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed? It might be nice to also adapt Node::drop to check that there are no dangling RAA blockers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed?

Done

It might be nice to also adapt Node::drop to check that there are no dangling RAA blockers.

Hmm, I checked, looks like ~75 tests fail if we do it...that might be one for a followup 😅

@TheBlueMatt TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch 3 times, most recently from dabb898 to cd4f665 Compare October 23, 2024 15:57
`or_default` is generally less readable than writing out the thing
we're writing, as `Default` is opaque but explicit constructors
generally are not. Thus, we ignore the clippy lint (ideally we
could invert it and ban the use of `Default` in the crate entirely
but alas).
In aa09c33 we added a new secret
in `ChannelManager` with which to derive inbound `PaymentId`s. We
added read support for the new field, but forgot to add writing
support for it. Here we fix this oversight.
@TheBlueMatt TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch from aa8a0ad to 2fd87cf Compare October 23, 2024 16:15
@TheBlueMatt
Copy link
Collaborator Author

Squashed fixups and rebased to tell clippy to shut up.

return Err(DecodeError::InvalidValue);
},
hash_map::Entry::Vacant(entry) => {
entry.insert(payment_claim.claiming_payment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages since IIUC that's the reason we hold a Vec<PaymentClaimDetails> there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.

Comment on lines 1128 to 1166
struct HTLCClaimSource {
counterparty_node_id: Option<PublicKey>,
funding_txo: OutPoint,
channel_id: ChannelId,
htlc_id: u64,
}

impl From<&MPPClaimHTLCSource> for HTLCClaimSource {
fn from(o: &MPPClaimHTLCSource) -> HTLCClaimSource {
HTLCClaimSource {
counterparty_node_id: Some(o.counterparty_node_id),
funding_txo: o.funding_txo,
channel_id: o.channel_id,
htlc_id: o.htlc_id,
}
}
}

#[derive(Clone, Debug, PartialEq, Eq)]
struct MPPClaimHTLCSource {
counterparty_node_id: PublicKey,
funding_txo: OutPoint,
channel_id: ChannelId,
htlc_id: u64,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some comments for these structs and clarify there why we need both of them?

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a while to get up to speed on this patch, but I think this is basically there on my end now.

if let MonitorUpdateCompletionAction::FreeOtherChannelImmediately {
downstream_counterparty_node_id: node_id,
downstream_funding_outpoint: funding_outpoint,
downstream_funding_outpoint: _,
blocking_action: blocker, downstream_channel_id: channel_id,
} = action {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is reachable, and it's pre-existing, but FYI we don't have coverage for getting a MonitorUpdateCompletionAction::FreeOtherChannelImmediately while during_init. If it's unreachable, adding a debug_assert could be useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced its unreachable (and certainly not that it won't become reachable) - on the read end if an HTLC ends up in pending_claims_to_replay we'll call claim_funds_internal which for relayed payments calls claim_funds_from_hop and can return FreeOtherChannelImmediately for duplicate claims. Indeed we should get test coverage but that's not related to this PR :/.

@valentinewallace
Copy link
Contributor

LGTM after squash.

When we started tracking which channels had MPP parts claimed
durably on-disk in their `ChannelMonitor`, we did so with a tuple.
This was fine in that it was only ever accessed in two places, but
as we will start tracking it through to the `ChannelMonitor`s
themselves in the coming commit(s), it is useful to have it in a
struct instead.
When we claim an MPP payment, then crash before persisting all the
relevant `ChannelMonitor`s, we rely on the payment data being
available in the `ChannelManager` on restart to re-claim any parts
that haven't yet been claimed. This is fine as long as the
`ChannelManager` was persisted before the `PaymentClaimable` event
was processed, which is generally the case in our
`lightning-background-processor`, but may not be in other cases or
in a somewhat rare race.

In order to fix this, we need to track where all the MPP parts of
a payment are in the `ChannelMonitor`, allowing us to re-claim any
missing pieces without reference to any `ChannelManager` data.

Further, in order to properly generate a `PaymentClaimed` event
against the re-started claim, we have to store various payment
metadata with the HTLC list as well.

Here we take the first step, building a list of MPP parts and
metadata in `ChannelManager` and passing it through to
`ChannelMonitor` in the `ChannelMonitorUpdate`s.
When we claim an MPP payment, then crash before persisting all the
relevant `ChannelMonitor`s, we rely on the payment data being
available in the `ChannelManager` on restart to re-claim any parts
that haven't yet been claimed. This is fine as long as the
`ChannelManager` was persisted before the `PaymentClaimable` event
was processed, which is generally the case in our
`lightning-background-processor`, but may not be in other cases or
in a somewhat rare race.

In order to fix this, we need to track where all the MPP parts of
a payment are in the `ChannelMonitor`, allowing us to re-claim any
missing pieces without reference to any `ChannelManager` data.

Further, in order to properly generate a `PaymentClaimed` event
against the re-started claim, we have to store various payment
metadata with the HTLC list as well.

Here we store the required MPP parts and metadata in
`ChannelMonitor`s and make them available to `ChannelManager` on
load.
In a coming commit we'll use the existing `ChannelManager` claim
flow to claim HTLCs which we found partially claimed on startup,
necessitating having a full `ChannelManager` when we go to do so.

Here we move the re-claim logic down in the `ChannelManager`-read
logic so that we have that.
Here we wrap the logic which moves claimable payments from
`claimable_payments` to `pending_claiming_payments` to a new
utility function on `ClaimablePayments`. This will allow us to call
this new logic during `ChannelManager` deserialization in a few
commits.
In the next commit we'll start using (much of) the normal HTLC
claim pipeline to replay payment claims on startup. In order to do
so, however, we have to properly handle cases where we get a
`DuplicateClaim` back from the channel for an inbound-payment HTLC.

Here we do so, handling the `MonitorUpdateCompletionAction` and
allowing an already-completed RAA blocker.
When we claim an MPP payment, then crash before persisting all the
relevant `ChannelMonitor`s, we rely on the payment data being
available in the `ChannelManager` on restart to re-claim any parts
that haven't yet been claimed. This is fine as long as the
`ChannelManager` was persisted before the `PaymentClaimable` event
was processed, which is generally the case in our
`lightning-background-processor`, but may not be in other cases or
in a somewhat rare race.

In order to fix this, we need to track where all the MPP parts of
a payment are in the `ChannelMonitor`, allowing us to re-claim any
missing pieces without reference to any `ChannelManager` data.

Further, in order to properly generate a `PaymentClaimed` event
against the re-started claim, we have to store various payment
metadata with the HTLC list as well.

Here we finally implement claiming using the new MPP part list and
metadata stored in `ChannelMonitor`s. In doing so, we use much more
of the existing HTLC-claiming pipeline in `ChannelManager`,
utilizing the on-startup background events flow as well as properly
re-applying the RAA-blockers to ensure preimages cannot be lost.
When we discover we've only partially claimed an MPP HTLC during
`ChannelManager` reading, we need to add the payment preimage to
all other `ChannelMonitor`s that were a part of the payment.

We previously did this with a direct call on the `ChannelMonitor`,
requiring users write the full `ChannelMonitor` to disk to ensure
that updated information made it.

This adds quite a bit of delay during initial startup - fully
resilvering each `ChannelMonitor` just to handle this one case is
incredibly excessive.

Over the past few commits we dropped the need to pass HTLCs
directly to the `ChannelMonitor`s using the background events to
provide `ChannelMonitorUpdate`s insetad.

Thus, here we finally drop the requirement to resilver
`ChannelMonitor`s on startup.
Because the new startup `ChannelMonitor` persistence semantics rely
on new information stored in `ChannelMonitor` only for claims made
in the upgraded code, users upgrading from previous version of LDK
must apply the old `ChannelMonitor` persistence semantics at least
once (as the old code will be used to handle partial claims).
@TheBlueMatt TheBlueMatt force-pushed the 2024-06-mpp-claim-without-man branch from 3e355b3 to ba26432 Compare October 24, 2024 17:44
@TheBlueMatt
Copy link
Collaborator Author

Squashed without further changes.

@TheBlueMatt TheBlueMatt merged commit 5c975f7 into lightningdevkit:main Oct 28, 2024
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants