Optimize ChannelMonitor persistence on block connections. #2966

G8XSU · 2024-03-25T13:34:16Z

Currently, every block connection triggers the persistence of all
ChannelMonitors with an updated best_block. This approach poses
challenges for large node operators managing thousands of channels.
Furthermore, it leads to a thundering herd problem
(https://en.wikipedia.org/wiki/Thundering_herd_problem), overwhelming
the storage with simultaneous requests.

To address this issue, we now persist ChannelMonitors at a
regular cadence, spreading their persistence across blocks to
mitigate spikes in write operations.

Outcome: After doing this, Ldk's IO footprint should be reduced
by ~50 times. The processing time required to sync each block
will be significantly reduced, particularly for nodes with 1000s
of channels, as write latency plays a significant role in this process.
As a result, the Node/ChainMonitor will be blocked for a shorter
duration, leading to further efficiency gains.

Note that this will also increase time taken to sync during startup, a node
will now have to sync 25 blocks per channel on average, since monitors
can be at most 50 blocks out-of-date.

Based on #2957

Tasks:

Don't pause events for chainsync persistence Don't pause events for chainsync persistence #2957 and base it on that.
Concept/Approach Ack
Decide a good default for partition_factor [50 seems like a good number, every 50 blocks is ~8hours, this will reduce IO by a factor of 50 and at the same time shouldn't be a lot for mobile nodes to sync up, given they are routinely expected to sync this much after every night.]
Write more tests for persistence with partition_factor.
(Not a priority) Don't trigger chain-sync writes for closed channels/monitors. (This is next level of optimization and only offers very little improvement compared to rest of the changes, this PR without this will cut IO by 50 times, and even if we do this item, this further optimization will only reduce IO by 1-5%, so this can be done as followup and not urgent.)
(Not a priority) Maybe we can make partition_factor user-configurable. (We can do this separately and if needed, as our default should be sane enough for now.)

Closes #2647

wpaulino · 2024-04-08T16:27:18Z

Is this something we might want to consider not doing on mobile? Thinking that we won't be able to RBF onchain claims properly if the fee estimator is broken and we're not persisting the most recent feerate we tried within the OnchainTxHandler.

TheBlueMatt · 2024-04-11T13:26:39Z

I guess we should/could consider always persisting if there's pending claims (eg channel has been closed but has balances to claim)? Alternatively, we could always persist if we only have < 5 channels.

TheBlueMatt · 2024-06-03T14:56:40Z

What's the status here @G8XSU?

G8XSU · 2024-06-04T20:48:14Z

Yes makes sense, we can always persist if there are pending claims.

I am looking for a concept/approach ack here before I proceed with rest of the changes. Are we in the right direction about how to distribute?

TheBlueMatt · 2024-06-06T19:42:52Z

I think wpaulino raised a good point and we should do something to ensure we regularly persist monitors on mobile (like what I suggested above), but otherwise concept ACK.

G8XSU · 2024-06-17T19:48:08Z

lightning/src/chain/chainmonitor.rs

+			let funding_txid_u32 = u32::from_be_bytes([funding_txid_hash_bytes[0], funding_txid_hash_bytes[1], funding_txid_hash_bytes[2], funding_txid_hash_bytes[3]]);
+			funding_txid_u32.wrapping_add(best_height.unwrap_or_default())
+		};
+		const CHAINSYNC_MONITOR_PARTITION_FACTOR: u32 = 50; // ~ 8hours


50 seems like a good number, every 50 blocks is ~8hours, this will reduce IO by ~50 times and at the same time shouldn't be a lot for mobile nodes to sync up(if they use listen), given they are routinely expected to sync this much after every night.
for confirm users they are expected to sync all watched transactions on restart/reload in any case

50 seems fine to me, but can we do min(50, channel_count)? That way nodes with relatively few channels won't be paying the cost of startup needing a lot of replay. Specifically, I'm thinking mobile nodes or other nodes that might have few channels probably can pay the sync cost and might restart more often and don't want to pay the irregular-sync startup cost.

Sounds good 👍
Now accounting for small nodes/mobile for partition_factor separately.
Changed to piecewise function for bit more predictability for users compared to min.

It is helpful to assert that chain-sync did trigger a monitor persist.

G8XSU · 2024-06-17T20:11:04Z

Marking this PR ready for review.

lightning/src/chain/chainmonitor.rs

TheBlueMatt

LGTM, feel free to squash the fixup commit and lets find another reviewer.

lightning/src/chain/chainmonitor.rs

Currently, every block connection triggers the persistence of all ChannelMonitors with an updated best_block. This approach poses challenges for large node operators managing thousands of channels. Furthermore, it leads to a thundering herd problem (https://en.wikipedia.org/wiki/Thundering_herd_problem), overwhelming the storage with simultaneous requests. To address this issue, we now persist ChannelMonitors at a regular cadence, spreading their persistence across blocks to mitigate spikes in write operations. Outcome: After doing this, Ldk's IO footprint should be reduced by ~50 times. The processing time required to sync each block will be significantly reduced, particularly for nodes with 1000s of channels, as write latency plays a significant role in this process. As a result, the Node/ChainMonitor will be blocked for a shorter duration, leading to further efficiency gains.

G8XSU · 2024-06-19T07:05:58Z

Squashed Fixup commit.

arik-so

looks good to me!

arik-so · 2024-06-20T08:39:11Z

lightning/src/chain/chainmonitor.rs

@@ -297,14 +299,29 @@ where C::Target: chain::Filter,
 	}

 	fn update_monitor_with_chain_data<FN>(
-		&self, header: &Header, txdata: &TransactionData, process: FN, funding_outpoint: &OutPoint,
-		monitor_state: &MonitorHolder<ChannelSigner>
+		&self, header: &Header, best_height: Option<u32>, txdata: &TransactionData, process: FN, funding_outpoint: &OutPoint,


what's the reason best_height is used to offset the modulus?

We use it to distribute monitor persistence across time.

G8XSU force-pushed the 2647-distribute branch 2 times, most recently from 13f3e59 to d8b1203 Compare June 17, 2024 19:46

G8XSU commented Jun 17, 2024

View reviewed changes

Start tracking chain_sync_monitor_persistences in TestPersister

1912d8d

It is helpful to assert that chain-sync did trigger a monitor persist.

G8XSU force-pushed the 2647-distribute branch 2 times, most recently from abae637 to 2f29569 Compare June 17, 2024 19:55

G8XSU marked this pull request as ready for review June 17, 2024 20:10

G8XSU requested a review from TheBlueMatt June 17, 2024 20:10

G8XSU commented Jun 17, 2024

View reviewed changes

lightning/src/chain/chainmonitor.rs Show resolved Hide resolved

G8XSU requested a review from wpaulino June 18, 2024 04:09

TheBlueMatt reviewed Jun 19, 2024

View reviewed changes

lightning/src/chain/chainmonitor.rs Show resolved Hide resolved

G8XSU force-pushed the 2647-distribute branch from b8e5e3f to bf28957 Compare June 19, 2024 07:04

tnull self-requested a review June 20, 2024 07:17

arik-so reviewed Jun 20, 2024

View reviewed changes

tnull removed their request for review June 20, 2024 08:42

arik-so approved these changes Jun 20, 2024

View reviewed changes

TheBlueMatt approved these changes Jun 20, 2024

View reviewed changes

TheBlueMatt merged commit 07d991c into lightningdevkit:main Jun 20, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ChannelMonitor persistence on block connections. #2966

Optimize ChannelMonitor persistence on block connections. #2966

G8XSU commented Mar 25, 2024 •

edited

Loading

wpaulino commented Apr 8, 2024

TheBlueMatt commented Apr 11, 2024

TheBlueMatt commented Jun 3, 2024

G8XSU commented Jun 4, 2024

TheBlueMatt commented Jun 6, 2024

G8XSU Jun 17, 2024 •

edited

Loading

TheBlueMatt Jun 18, 2024

G8XSU Jun 18, 2024

G8XSU commented Jun 17, 2024

TheBlueMatt left a comment

G8XSU commented Jun 19, 2024

arik-so left a comment

arik-so Jun 20, 2024

G8XSU Jun 20, 2024 •

edited

Loading

Optimize ChannelMonitor persistence on block connections. #2966

Optimize ChannelMonitor persistence on block connections. #2966

Conversation

G8XSU commented Mar 25, 2024 • edited Loading

wpaulino commented Apr 8, 2024

TheBlueMatt commented Apr 11, 2024

TheBlueMatt commented Jun 3, 2024

G8XSU commented Jun 4, 2024

TheBlueMatt commented Jun 6, 2024

G8XSU Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

TheBlueMatt Jun 18, 2024

Choose a reason for hiding this comment

G8XSU Jun 18, 2024

Choose a reason for hiding this comment

G8XSU commented Jun 17, 2024

TheBlueMatt left a comment

Choose a reason for hiding this comment

G8XSU commented Jun 19, 2024

arik-so left a comment

Choose a reason for hiding this comment

arik-so Jun 20, 2024

Choose a reason for hiding this comment

G8XSU Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

G8XSU commented Mar 25, 2024 •

edited

Loading

G8XSU Jun 17, 2024 •

edited

Loading

G8XSU Jun 20, 2024 •

edited

Loading