-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ChannelMonitor persistence on block connections. #2966
Conversation
Is this something we might want to consider not doing on mobile? Thinking that we won't be able to RBF onchain claims properly if the fee estimator is broken and we're not persisting the most recent feerate we tried within the |
I guess we should/could consider always persisting if there's pending claims (eg channel has been closed but has balances to claim)? Alternatively, we could always persist if we only have < 5 channels. |
What's the status here @G8XSU? |
Yes makes sense, we can always persist if there are pending claims. I am looking for a concept/approach ack here before I proceed with rest of the changes. Are we in the right direction about how to distribute? |
I think wpaulino raised a good point and we should do something to ensure we regularly persist monitors on mobile (like what I suggested above), but otherwise concept ACK. |
13f3e59
to
d8b1203
Compare
lightning/src/chain/chainmonitor.rs
Outdated
let funding_txid_u32 = u32::from_be_bytes([funding_txid_hash_bytes[0], funding_txid_hash_bytes[1], funding_txid_hash_bytes[2], funding_txid_hash_bytes[3]]); | ||
funding_txid_u32.wrapping_add(best_height.unwrap_or_default()) | ||
}; | ||
const CHAINSYNC_MONITOR_PARTITION_FACTOR: u32 = 50; // ~ 8hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50 seems like a good number, every 50 blocks is ~8hours, this will reduce IO by ~50 times and at the same time shouldn't be a lot for mobile nodes to sync up(if they use
listen
), given they are routinely expected to sync this much after every night.
forconfirm
users they are expected to sync all watched transactions on restart/reload in any case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50 seems fine to me, but can we do min(50, channel_count)? That way nodes with relatively few channels won't be paying the cost of startup needing a lot of replay. Specifically, I'm thinking mobile nodes or other nodes that might have few channels probably can pay the sync cost and might restart more often and don't want to pay the irregular-sync startup cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
Now accounting for small nodes/mobile for partition_factor separately.
Changed to piecewise function for bit more predictability for users compared to min.
It is helpful to assert that chain-sync did trigger a monitor persist.
abae637
to
2f29569
Compare
Marking this PR ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to squash the fixup commit and lets find another reviewer.
Currently, every block connection triggers the persistence of all ChannelMonitors with an updated best_block. This approach poses challenges for large node operators managing thousands of channels. Furthermore, it leads to a thundering herd problem (https://en.wikipedia.org/wiki/Thundering_herd_problem), overwhelming the storage with simultaneous requests. To address this issue, we now persist ChannelMonitors at a regular cadence, spreading their persistence across blocks to mitigate spikes in write operations. Outcome: After doing this, Ldk's IO footprint should be reduced by ~50 times. The processing time required to sync each block will be significantly reduced, particularly for nodes with 1000s of channels, as write latency plays a significant role in this process. As a result, the Node/ChainMonitor will be blocked for a shorter duration, leading to further efficiency gains.
Squashed Fixup commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
@@ -297,14 +299,29 @@ where C::Target: chain::Filter, | |||
} | |||
|
|||
fn update_monitor_with_chain_data<FN>( | |||
&self, header: &Header, txdata: &TransactionData, process: FN, funding_outpoint: &OutPoint, | |||
monitor_state: &MonitorHolder<ChannelSigner> | |||
&self, header: &Header, best_height: Option<u32>, txdata: &TransactionData, process: FN, funding_outpoint: &OutPoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason best_height is used to offset the modulus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use it to distribute monitor persistence across time.
Currently, every block connection triggers the persistence of all
ChannelMonitors with an updated best_block. This approach poses
challenges for large node operators managing thousands of channels.
Furthermore, it leads to a thundering herd problem
(https://en.wikipedia.org/wiki/Thundering_herd_problem), overwhelming
the storage with simultaneous requests.
To address this issue, we now persist ChannelMonitors at a
regular cadence, spreading their persistence across blocks to
mitigate spikes in write operations.
Outcome: After doing this, Ldk's IO footprint should be reduced
by ~50 times. The processing time required to sync each block
will be significantly reduced, particularly for nodes with 1000s
of channels, as write latency plays a significant role in this process.
As a result, the Node/ChainMonitor will be blocked for a shorter
duration, leading to further efficiency gains.
Note that this will also increase time taken to sync during startup, a node
will now have to sync 25 blocks per channel on average, since monitors
can be at most 50 blocks out-of-date.
Based on #2957
Tasks:
Closes #2647