Stop cleaning monitor updates on new block connect #2779

G8XSU · 2023-12-08T19:50:24Z

Previously, we used to cleanup monitor updates at both consolidation threshold and new block connects. With this change we will only cleanup when our consolidation criteria is met. Also, we remove monitor read from cleanup logic, in case of update consolidation.
Note: In case of channel-closing monitor update, we still need to read the old monitor before persisting the new one in order to determine the cleanup range.

Closes #2706

codecov-commenter · 2023-12-08T19:56:33Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (c2bbfff) 88.53% compared to head (ef09096) 88.55%.
Report is 181 commits behind head on main.

Files	Patch %	Lines
lightning/src/util/persist.rs	85.41%	5 Missing and 2 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2779      +/-   ##
==========================================
+ Coverage   88.53%   88.55%   +0.02%     
==========================================
  Files         114      114              
  Lines       89497    91569    +2072     
  Branches    89497    91569    +2072     
==========================================
+ Hits        79237    81092    +1855     
- Misses       7882     7988     +106     
- Partials     2378     2489     +111

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

G8XSU · 2023-12-08T19:59:53Z

lightning/src/util/persist.rs

-		let maybe_old_monitor = self.read_monitor(&monitor_name);
-		match maybe_old_monitor {
-			Ok((_, ref old_monitor)) => {
-				// Check that this key isn't already storing a monitor with a higher update_id


Removed this check since we no longer read old_monitor in case of full monitor persist anymore.

Yea, I think its fine. I'd previously suggested doing it here, but really ChainMonitor should handle it for us.

G8XSU · 2023-12-08T20:00:22Z

lightning/src/util/persist.rs

-			}
-			// This means the channel monitor is new.
-			Err(ref e) if e.kind() == io::ErrorKind::NotFound => {}
-			_ => return chain::ChannelMonitorUpdateStatus::UnrecoverableError,


We no longer fail in this case.

G8XSU · 2023-12-08T20:01:30Z

lightning/src/util/persist.rs

@@ -641,59 +624,6 @@ where
 			&monitor_bytes,
 		) {
 			Ok(_) => {
-				// Assess cleanup. Typically, we'll clean up only between the last two known full


Moved cleanup logic to update_persisted_channel, to only cleanup in case of consolidation due to max_pending_updates

G8XSU · 2023-12-08T20:02:46Z

lightning/src/util/persist.rs

-							} else if update_id != start && monitor.get_latest_update_id() != CLOSED_CHANNEL_UPDATE_ID
-							{
-								// We're deleting something we should know doesn't exist.
-								panic!(


There is a possibility to issue some extra deletes since there will be no update persisted in case of block connect, but we will try to cleanup update with that update_id.

G8XSU · 2023-12-08T23:14:18Z

Squashed commits into one.

G8XSU · 2023-12-08T23:15:33Z

Tagging @domZippilli as well.

domZippilli · 2023-12-11T17:49:52Z

Tagging @domZippilli as well.

LGTM. I didn't clone it and take an in-depth look yet, but this seems to make sense. Incidentally, we had a problem at c= where we weren't connecting new blocks for a little bit, so with no new blocks we were effectively running in a manner very similar to this changeset, which gives a kind of "preview" of what we should end up with. The I/O utilization was really low, and memory was dead flat. After a few restarts, things got going again, and there were no FCs or anything either. So I'm optimistic that this changeset will be good not only for reduced heap fragmentation, but even further reduced I/O churn.

I guess the main question on my mind is: We were designing this very defensively on the first pass. How confident are we that we really don't need to do that monitor read anymore?

TheBlueMatt · 2023-12-11T17:58:23Z

I guess the main question on my mind is: We were designing this very defensively on the first pass. How confident are we that we really don't need to do that monitor read anymore?

For users of the ChainMonitor, quite - it already does the "fail on dup monitor" stuff for us. I suppose in theory someone could prune a monitor, leaving it in its datastore but removing it from the in-memory store, and end up overwriting that monitor, but if it was safe to prune (ie nothing more we need to claim on-chain for a closed channel), then it was also safe to overwrite. Now, someone may have a bug where they remove a monitor too soon, and then an attacker could exploit this to make them not able to recover after the bug, but that seems a somewhat esoteric concern.

The more general reason we had it is for those trying to use this without a ChainMonitor, but I'm not sure its worth paying any real price for that - users implementing chain::Watch manually should read its documentation and implement the API faithfully :).

G8XSU · 2023-12-11T19:27:45Z

For now, not too worried about overwriting monitor without reading it first since we already do that in non-MUP implementation.

We were designing this very defensively on the first pass. How confident are we that we really don't need to do that monitor read anymore?

I think earlier we were focusing more on not creating redundant deletes.
With the approach in this PR, we just stop cleaning in block connects(which was in any case being forced upon MUP). We could have just done this in first attempt, but i think we missed this minor optimization.

Worst to worst case there will be some stale un-deleted monitor updates, but that should be fine.

tnull

Did a first pass, generally looks good. Some questions, some nits.

lightning/src/util/persist.rs

tnull · 2023-12-12T08:49:14Z

lightning/src/util/persist.rs

+								);
+								Some((start, end))
+							}
+							_ => None


Is there a reason why we shouldn't do some cleanup (i.e., default to the behavior of the else clause below) if we fail to read the old monitor?

We could also probably just panic instead.

as mentioned here,
earlier we were returning an UnrecoverableError in this case, but i think not failing/panic would be better here. Since the worst case is wasted storage and not something critical to functioning of node.

we can't do some default cleanup since we don't know which update_range to cleanup. Lmk if you have any suggestions.

defaulting to else clause below won't help since it is pointless to cleanup u64::MAX .. u64::MAX-max_pending_updates.
It will not result in any deletes, just extra IO calls.

as mentioned here, earlier we were returning an UnrecoverableError in this case, but i think not failing/panic would be better here. Since the worst case is wasted storage and not something critical to functioning of node.

Hmm, yes, we could ignore the error in this specific case I guess, but generally unrecoverable IO errors are fatal for us? So "as a policy" it might be generally preferable to stop as soon as we see one? But I think we also had this discussion in the original PR and the conclusion was not to error out under certain circumstances, so I don't want to block this PR by restarting that conversation.

we can't do some default cleanup since we don't know which update_range to cleanup. Lmk if you have any suggestions.

defaulting to else clause below won't help since it is pointless to cleanup u64::MAX .. u64::MAX-max_pending_updates. It will not result in any deletes, just extra IO calls.

Ah, right, the update ids will be u64::MAX for closed channels. Fair point. It's still a bit weird to get an unrecoverable IO error here and just continue on as if nothing happened. 🤷‍♂️

If you're positive that we shouldn't error out or panic here, let's at least leave a comment that the error case is unhandled and explain the motivation why this is the case.

Yeah, in earlier version of code in this PR it was a bit more evident, now it is slightly hidden( with .ok()), will add a comment.

It's still a bit weird to get an unrecoverable IO error here and just continue on as if nothing happened.

Note that we were not getting an unrecoverable error but converting monitor read failure into unrecoverable.

There are other instances such as error in delete call where we don't fail. Best to treat it as optimistic cleanup.

we could ignore the error in this specific case I guess, but generally unrecoverable IO errors are fatal for us? So "as a policy" it might be generally preferable to stop as soon as we see one?

I don't think that is an ideal policy for all cases. I would hate it if my node panics in production because a read call got throttled. And it was totally optional for functioning of my node.

Note that we were not getting an unrecoverable error but converting monitor read failure into unrecoverable.

Yeah, that's why I spoke of lower-case 'unrecoverable error' rather than UnrecoverableError. In any case, it's still unrecoverable for us at this point.

I don't think that is an ideal policy for all cases. I would hate it if my node panics in production because a read call got throttled. And it was totally optional for functioning of my node.

If we 'properly' fail to read the monitor due to an IO error at this point it would indicate something is seriously wrong and the user needs to take action ASAP, as much more serious issues could arise from broken IO down the line. That is really the purpose of UnrecoverableError, no? And, as Dom mentioned below, any recoverable errors should be handled in-line by the KVStore implementation.

I mean, sure, it's not strictly a bug here to continue on if we fail to read the monitor, but I'm not sure if we want to trust the user to have proper log monitoring set up to catch that there are IO errors happing before it's too late.

I agree that panicking here, though potentially inconvenient for the user, can definitely save them from a big mishap later down the line.

However, I can also notice one thing. The old_monitor is only tried to be read once throughout the life of a channel that is when the channel is closed.

So, if IO errors are happening for a user, they should be detected by this time in the life of the channel. So a panic here, though potentially useful, will probably also be inconsequential in the grand scheme of things.

lightning/src/util/persist.rs

domZippilli · 2023-12-14T03:39:39Z

I might be nitpicking, but I thought an expectation of the KVStore contract is that the backend would do things like exponential back-off retries to deal with throttling. So a failure from any `impl KVStore` would likely indicate a serious problem.

…

On Wed, Dec 13, 2023, 11:18 AM Gursharan Singh ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lightning/src/util/persist.rs <#2779 (comment)> : > + // We could write this update, but it meets criteria of our design that calls for a full monitor write. + let monitor_update_status = self.persist_new_channel(funding_txo, monitor, monitor_update_call_id); + + if let ChannelMonitorUpdateStatus::Completed = monitor_update_status { + let cleanup_range = if monitor.get_latest_update_id() == CLOSED_CHANNEL_UPDATE_ID { + match maybe_old_monitor { + Some(Ok((_, ref old_monitor))) => { + let start = old_monitor.get_latest_update_id(); + // We never persist an update with update_id = CLOSED_CHANNEL_UPDATE_ID + let end = cmp::min( + start.saturating_add(self.maximum_pending_updates), + CLOSED_CHANNEL_UPDATE_ID - 1, + ); + Some((start, end)) + } + _ => None we could ignore the error in this specific case I guess, but generally unrecoverable IO errors are fatal for us? So "as a policy" it might be generally preferable to stop as soon as we see one? I don't think that is an ideal policy for all cases. I would hate it if my node panics in production because a read call got throttled. And it was totally optional for functioning of my node. — Reply to this email directly, view it on GitHub <#2779 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIR3UF6VY5JD4DD3Q3IPHEDYJH5QNAVCNFSM6AAAAABANDWZN6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOOBQGMZDMOBSGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Previously, we used to cleanup monitor updates at both consolidation threshold and new block connects. With this change we will only cleanup when our consolidation criteria is met. Also, we remove monitor read from cleanup logic, in case of update consolidation. Note: In case of channel-closing monitor update, we still need to read the old monitor before persisting the new one in order to determine the cleanup range.

TheBlueMatt

I don't have a super strong opinion about panicking early on the read failure, but it may be worth noting that its possible the read failure hit a bad sector/flash segment and the write that we just did moved the data to a new segment/remapped sector and managed to fix the issue, so its not like its impossible that continuing could Just Work, even outside of some enterprise environment.

In any case, unless someone feels super strongly, I say we go ahead and merge this.

G8XSU · 2023-12-14T20:58:39Z

In summary, I don't think we should panic.
Even after retries, if a read fails in cleanup, we shouldn't fail, since that doesn't result in any functional degradation to node.
A few KB's of wasted storage for client doesn't mean we should shut down the node immediately.

In cleanup, we don't even fail if all the delete calls start failing. A read failure in "optimistic cleanup" is no different or worse than a delete failure.

I'm not sure if we want to trust the user to have proper log monitoring set up to catch that there are IO errors happing before it's too late.

Read failure in cleanup does not result in any mishap later down the line for sure. If later on, there is a read failure and it was critical for functioning, we should fail there instead of preempting it here. (for example in node restart)
The same way we trust clients to have metric on node shutdown, we should trust our clients to have proper checks in place for read-error metrics, storage usage metrics etc. And in any case, we will fail if there is a critical failure for write/read.

tnull

In summary, I don't think we should panic.

Yes, as mentioned above and elsewhere it feels a bit odd to me to just proceed on read failure, but it's also not really wrong or critical here.

LGTM

tnull · 2023-12-15T08:24:15Z

lightning/src/util/persist.rs

+	SP::Target: SignerProvider + Sized
+{
+	// Cleans up monitor updates for given monitor in range `start..=end`.
+	fn cleanup_in_range(&self, monitor_name: MonitorName, start: u64, end: u64) {


non-blocking nit: Fine for now, but if we touch this again we may want to change these semantics of cleanup_in_range. I already had brought this up in the original PR, but usually start..end in Rust has the semantics start <= x < end (cf. https://doc.rust-lang.org/std/ops/struct.Range.html). So for a reader it may be a bit unexpected to have this work as start..=end. If we want to keep this version we could at least rename the variables to first/last to be a bit more clear and to show that this is unconventional behavior.

I didn't want to complicate semantics for calculating cleanup_range by changing end calculation.
Probably makes sense to rename it as last. (But not super critical for now, since it is private function)

shaavan

Post merge ACK

I also agree with not panicking here, as its potential usefulness is outweighed by the unnecessary preemptive failure it might create.

Other than this, the PR looks perfect! Thanks a lot for this great cleanup improvement!

Summary and notes:

Not needing to clean the monitor, in the case of persist_new_channel makes sense, as we are doing a full persist
Not needing to read the old_monitor, in case of % maximum_pending_updates == 0, to determine the cleanup range, is a great use of the fact of the deterministic length of cleanup.
Separating cleanup_in_range in a separate function makes the code modular and the functionality usable separately, a good preemptive measure for future updates.

As a small follow-up, we can add a comment in persist_new_channel fn to explain why we don’t need to do an old_monitor read & clean up here.

G8XSU commented Dec 8, 2023

View reviewed changes

TheBlueMatt previously approved these changes Dec 8, 2023

View reviewed changes

G8XSU force-pushed the 2706-stop branch from ffd803b to 4b807d7 Compare December 8, 2023 23:10

TheBlueMatt added this to the 0.0.119 milestone Dec 11, 2023

tnull self-requested a review December 11, 2023 18:41

tnull reviewed Dec 12, 2023

View reviewed changes

G8XSU dismissed TheBlueMatt’s stale review via c2fb98e December 12, 2023 19:04

G8XSU force-pushed the 2706-stop branch from f9c9acf to 247c61f Compare December 14, 2023 18:28

G8XSU force-pushed the 2706-stop branch from 247c61f to ef09096 Compare December 14, 2023 18:29

TheBlueMatt approved these changes Dec 14, 2023

View reviewed changes

tnull approved these changes Dec 15, 2023

View reviewed changes

tnull merged commit e1897de into lightningdevkit:main Dec 15, 2023

shaavan reviewed Dec 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop cleaning monitor updates on new block connect #2779

Stop cleaning monitor updates on new block connect #2779

G8XSU commented Dec 8, 2023

codecov-commenter commented Dec 8, 2023 •

edited

Loading

G8XSU Dec 8, 2023

TheBlueMatt Dec 8, 2023

G8XSU Dec 8, 2023

G8XSU Dec 8, 2023

G8XSU Dec 8, 2023

G8XSU commented Dec 8, 2023

G8XSU commented Dec 8, 2023

domZippilli commented Dec 11, 2023

TheBlueMatt commented Dec 11, 2023

G8XSU commented Dec 11, 2023

tnull left a comment

tnull Dec 12, 2023

TheBlueMatt Dec 12, 2023

G8XSU Dec 12, 2023 •

edited

Loading

G8XSU Dec 12, 2023

tnull Dec 13, 2023 •

edited

Loading

G8XSU Dec 13, 2023 •

edited

Loading

G8XSU Dec 13, 2023

tnull Dec 14, 2023 •

edited

Loading

shaavan Dec 14, 2023

domZippilli commented Dec 14, 2023 via email

TheBlueMatt left a comment

G8XSU commented Dec 14, 2023

tnull left a comment

tnull Dec 15, 2023

G8XSU Dec 19, 2023

shaavan left a comment

Stop cleaning monitor updates on new block connect #2779

Stop cleaning monitor updates on new block connect #2779

Conversation

G8XSU commented Dec 8, 2023

codecov-commenter commented Dec 8, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

G8XSU commented Dec 8, 2023

G8XSU commented Dec 8, 2023

domZippilli commented Dec 11, 2023

TheBlueMatt commented Dec 11, 2023

G8XSU commented Dec 11, 2023

tnull left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

G8XSU Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnull Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

G8XSU Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnull Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

domZippilli commented Dec 14, 2023 via email

TheBlueMatt left a comment

Choose a reason for hiding this comment

G8XSU commented Dec 14, 2023

tnull left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaavan left a comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 8, 2023 •

edited

Loading

G8XSU Dec 12, 2023 •

edited

Loading

tnull Dec 13, 2023 •

edited

Loading

G8XSU Dec 13, 2023 •

edited

Loading

tnull Dec 14, 2023 •

edited

Loading