-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop relying on ChannelMonitor persistence after manager read #3322
Stop relying on ChannelMonitor persistence after manager read #3322
Conversation
A bunch of tests are still trying to call |
Ugh, sorry, fixed. Also rebased and fixed an issue introduced in #3303 |
5437927
to
b0fa756
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3322 +/- ##
==========================================
- Coverage 89.68% 89.66% -0.02%
==========================================
Files 126 126
Lines 103168 103370 +202
Branches 103168 103370 +202
==========================================
+ Hits 92522 92686 +164
- Misses 7934 7971 +37
- Partials 2712 2713 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Tagging 0.1 because of the first commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: fix "insetad" typo in commit message d44066b
.expect("Failed to get node_id for phantom node recipient"); | ||
receiver_node_id = phantom_pubkey; | ||
break; | ||
let (sources, claiming_payment) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to compare the equivalence between the before and after, the scroll distance is 5,000 lines of code. I wonder whether perhaps some of this could be moved into a separate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git show --color-moved
is your friend here, I think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, yea, obviously we should move parts of ChannelManager
out, but I don't think this specific logic should be moved given its tie to other stuff in ChannelManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! Still making my way through this.
return Err(DecodeError::InvalidValue); | ||
}, | ||
hash_map::Entry::Vacant(entry) => { | ||
entry.insert(payment_claim.claiming_payment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a panic here hits a bunch of tests, which is good, but for some of them I'm not sure why or if we should be reinserting here. For example, we hit this in test_mpp_keysend
on drop
of the test Node
. In that test, it seems like the preimage should've been pruned from the monitor already. Do you know why this is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We drop preimages one update cycle after we need to. So if a channel completed a claim in a full normal cycle and is "at rest" it'll still have the preimage. Only after the next round of commitment updates will that preimage get dropped. This supports some weird setups where you might have multiple redundant copies of a ChannelMonitor
but also its kinda nice. Still, its definitely weird here, it means if a channel is "at rest" we'll always replay a claim event on startup.
With the claim IDs I think its okay, and kinda nice anyway cause it reduces chances of errors on the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages
since IIUC that's the reason we hold a Vec<PaymentClaimDetails>
there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's a part of it? I wasn't really being so deliberate about it - generally we just try not to assume too much about payment hash reuse so it seemed weird to assume it can't happen here. Good to document either way.
assert!(get_monitor!(nodes[3], chan_id_persisted).get_stored_preimages().contains_key(&payment_hash)); | ||
assert!(get_monitor!(nodes[3], chan_id_not_persisted).get_stored_preimages().contains_key(&payment_hash)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed? It might be nice to also adapt Node::drop
to check that there are no dangling RAA blockers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed?
Done
It might be nice to also adapt Node::drop to check that there are no dangling RAA blockers.
Hmm, I checked, looks like ~75 tests fail if we do it...that might be one for a followup 😅
dabb898
to
cd4f665
Compare
`or_default` is generally less readable than writing out the thing we're writing, as `Default` is opaque but explicit constructors generally are not. Thus, we ignore the clippy lint (ideally we could invert it and ban the use of `Default` in the crate entirely but alas).
In aa09c33 we added a new secret in `ChannelManager` with which to derive inbound `PaymentId`s. We added read support for the new field, but forgot to add writing support for it. Here we fix this oversight.
aa8a0ad
to
2fd87cf
Compare
Squashed fixups and rebased to tell clippy to shut up. |
return Err(DecodeError::InvalidValue); | ||
}, | ||
hash_map::Entry::Vacant(entry) => { | ||
entry.insert(payment_claim.claiming_payment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages
since IIUC that's the reason we hold a Vec<PaymentClaimDetails>
there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.
lightning/src/ln/channelmanager.rs
Outdated
struct HTLCClaimSource { | ||
counterparty_node_id: Option<PublicKey>, | ||
funding_txo: OutPoint, | ||
channel_id: ChannelId, | ||
htlc_id: u64, | ||
} | ||
|
||
impl From<&MPPClaimHTLCSource> for HTLCClaimSource { | ||
fn from(o: &MPPClaimHTLCSource) -> HTLCClaimSource { | ||
HTLCClaimSource { | ||
counterparty_node_id: Some(o.counterparty_node_id), | ||
funding_txo: o.funding_txo, | ||
channel_id: o.channel_id, | ||
htlc_id: o.htlc_id, | ||
} | ||
} | ||
} | ||
|
||
#[derive(Clone, Debug, PartialEq, Eq)] | ||
struct MPPClaimHTLCSource { | ||
counterparty_node_id: PublicKey, | ||
funding_txo: OutPoint, | ||
channel_id: ChannelId, | ||
htlc_id: u64, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some comments for these structs and clarify there why we need both of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a while to get up to speed on this patch, but I think this is basically there on my end now.
if let MonitorUpdateCompletionAction::FreeOtherChannelImmediately { | ||
downstream_counterparty_node_id: node_id, | ||
downstream_funding_outpoint: funding_outpoint, | ||
downstream_funding_outpoint: _, | ||
blocking_action: blocker, downstream_channel_id: channel_id, | ||
} = action { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is reachable, and it's pre-existing, but FYI we don't have coverage for getting a MonitorUpdateCompletionAction::FreeOtherChannelImmediately
while during_init
. If it's unreachable, adding a debug_assert
could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely convinced its unreachable (and certainly not that it won't become reachable) - on the read end if an HTLC ends up in pending_claims_to_replay
we'll call claim_funds_internal
which for relayed payments calls claim_funds_from_hop
and can return FreeOtherChannelImmediately
for duplicate claims. Indeed we should get test coverage but that's not related to this PR :/.
2fd87cf
to
3e355b3
Compare
LGTM after squash. |
When we started tracking which channels had MPP parts claimed durably on-disk in their `ChannelMonitor`, we did so with a tuple. This was fine in that it was only ever accessed in two places, but as we will start tracking it through to the `ChannelMonitor`s themselves in the coming commit(s), it is useful to have it in a struct instead.
When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we take the first step, building a list of MPP parts and metadata in `ChannelManager` and passing it through to `ChannelMonitor` in the `ChannelMonitorUpdate`s.
When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we store the required MPP parts and metadata in `ChannelMonitor`s and make them available to `ChannelManager` on load.
In a coming commit we'll use the existing `ChannelManager` claim flow to claim HTLCs which we found partially claimed on startup, necessitating having a full `ChannelManager` when we go to do so. Here we move the re-claim logic down in the `ChannelManager`-read logic so that we have that.
Here we wrap the logic which moves claimable payments from `claimable_payments` to `pending_claiming_payments` to a new utility function on `ClaimablePayments`. This will allow us to call this new logic during `ChannelManager` deserialization in a few commits.
In the next commit we'll start using (much of) the normal HTLC claim pipeline to replay payment claims on startup. In order to do so, however, we have to properly handle cases where we get a `DuplicateClaim` back from the channel for an inbound-payment HTLC. Here we do so, handling the `MonitorUpdateCompletionAction` and allowing an already-completed RAA blocker.
When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we finally implement claiming using the new MPP part list and metadata stored in `ChannelMonitor`s. In doing so, we use much more of the existing HTLC-claiming pipeline in `ChannelManager`, utilizing the on-startup background events flow as well as properly re-applying the RAA-blockers to ensure preimages cannot be lost.
When we discover we've only partially claimed an MPP HTLC during `ChannelManager` reading, we need to add the payment preimage to all other `ChannelMonitor`s that were a part of the payment. We previously did this with a direct call on the `ChannelMonitor`, requiring users write the full `ChannelMonitor` to disk to ensure that updated information made it. This adds quite a bit of delay during initial startup - fully resilvering each `ChannelMonitor` just to handle this one case is incredibly excessive. Over the past few commits we dropped the need to pass HTLCs directly to the `ChannelMonitor`s using the background events to provide `ChannelMonitorUpdate`s insetad. Thus, here we finally drop the requirement to resilver `ChannelMonitor`s on startup.
Because the new startup `ChannelMonitor` persistence semantics rely on new information stored in `ChannelMonitor` only for claims made in the upgraded code, users upgrading from previous version of LDK must apply the old `ChannelMonitor` persistence semantics at least once (as the old code will be used to handle partial claims).
3e355b3
to
ba26432
Compare
Squashed without further changes. |
When we discover we've only partially claimed an MPP HTLC during
ChannelManager
reading, we need to add the payment preimage toall other
ChannelMonitor
s that were a part of the payment.We previously did this with a direct call on the
ChannelMonitor
,requiring users write the full
ChannelMonitor
to disk to ensurethat updated information made it.
This adds quite a bit of delay during initial startup - fully
resilvering each
ChannelMonitor
just to handle this one case isincredibly excessive.
Instead, we rewrite the MPP claim replay logic to use only (new) data included in
ChannelMonitor
s, which has a nice side-effect of teeing up futureChannelManager
-non-persistence features as well as makes ourPaymentClaimed
event generation much more robust.