-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop cleaning monitor updates on new block connect #2779
Conversation
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2779 +/- ##
==========================================
+ Coverage 88.53% 88.55% +0.02%
==========================================
Files 114 114
Lines 89497 91569 +2072
Branches 89497 91569 +2072
==========================================
+ Hits 79237 81092 +1855
- Misses 7882 7988 +106
- Partials 2378 2489 +111 ☔ View full report in Codecov by Sentry. |
let maybe_old_monitor = self.read_monitor(&monitor_name); | ||
match maybe_old_monitor { | ||
Ok((_, ref old_monitor)) => { | ||
// Check that this key isn't already storing a monitor with a higher update_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this check since we no longer read old_monitor in case of full monitor persist anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think its fine. I'd previously suggested doing it here, but really ChainMonitor
should handle it for us.
} | ||
// This means the channel monitor is new. | ||
Err(ref e) if e.kind() == io::ErrorKind::NotFound => {} | ||
_ => return chain::ChannelMonitorUpdateStatus::UnrecoverableError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer fail in this case.
@@ -641,59 +624,6 @@ where | |||
&monitor_bytes, | |||
) { | |||
Ok(_) => { | |||
// Assess cleanup. Typically, we'll clean up only between the last two known full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved cleanup logic to update_persisted_channel, to only cleanup in case of consolidation due to max_pending_updates
} else if update_id != start && monitor.get_latest_update_id() != CLOSED_CHANNEL_UPDATE_ID | ||
{ | ||
// We're deleting something we should know doesn't exist. | ||
panic!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a possibility to issue some extra deletes since there will be no update persisted in case of block connect, but we will try to cleanup update with that update_id.
Squashed commits into one. |
Tagging @domZippilli as well. |
LGTM. I didn't clone it and take an in-depth look yet, but this seems to make sense. Incidentally, we had a problem at I guess the main question on my mind is: We were designing this very defensively on the first pass. How confident are we that we really don't need to do that monitor read anymore? |
For users of the The more general reason we had it is for those trying to use this without a |
For now, not too worried about overwriting monitor without reading it first since we already do that in non-MUP implementation.
I think earlier we were focusing more on not creating redundant deletes. Worst to worst case there will be some stale un-deleted monitor updates, but that should be fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass, generally looks good. Some questions, some nits.
lightning/src/util/persist.rs
Outdated
); | ||
Some((start, end)) | ||
} | ||
_ => None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we shouldn't do some cleanup (i.e., default to the behavior of the else
clause below) if we fail to read the old monitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also probably just panic instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned here,
earlier we were returning an UnrecoverableError in this case, but i think not failing/panic would be better here. Since the worst case is wasted storage and not something critical to functioning of node.
we can't do some default cleanup since we don't know which update_range to cleanup. Lmk if you have any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaulting to else clause below won't help since it is pointless to cleanup u64::MAX .. u64::MAX-max_pending_updates
.
It will not result in any deletes, just extra IO calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned here, earlier we were returning an UnrecoverableError in this case, but i think not failing/panic would be better here. Since the worst case is wasted storage and not something critical to functioning of node.
Hmm, yes, we could ignore the error in this specific case I guess, but generally unrecoverable IO errors are fatal for us? So "as a policy" it might be generally preferable to stop as soon as we see one? But I think we also had this discussion in the original PR and the conclusion was not to error out under certain circumstances, so I don't want to block this PR by restarting that conversation.
we can't do some default cleanup since we don't know which update_range to cleanup. Lmk if you have any suggestions.
defaulting to else clause below won't help since it is pointless to cleanup
u64::MAX .. u64::MAX-max_pending_updates
. It will not result in any deletes, just extra IO calls.
Ah, right, the update ids will be u64::MAX
for closed channels. Fair point. It's still a bit weird to get an unrecoverable IO error here and just continue on as if nothing happened. 🤷♂️
If you're positive that we shouldn't error out or panic here, let's at least leave a comment that the error case is unhandled and explain the motivation why this is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in earlier version of code in this PR it was a bit more evident, now it is slightly hidden( with .ok()), will add a comment.
It's still a bit weird to get an unrecoverable IO error here and just continue on as if nothing happened.
Note that we were not getting an unrecoverable error but converting monitor read failure into unrecoverable.
There are other instances such as error in delete call where we don't fail. Best to treat it as optimistic cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could ignore the error in this specific case I guess, but generally unrecoverable IO errors are fatal for us? So "as a policy" it might be generally preferable to stop as soon as we see one?
I don't think that is an ideal policy for all cases. I would hate it if my node panics in production because a read call got throttled. And it was totally optional for functioning of my node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we were not getting an unrecoverable error but converting monitor read failure into unrecoverable.
Yeah, that's why I spoke of lower-case 'unrecoverable error' rather than UnrecoverableError
. In any case, it's still unrecoverable for us at this point.
I don't think that is an ideal policy for all cases. I would hate it if my node panics in production because a read call got throttled. And it was totally optional for functioning of my node.
If we 'properly' fail to read the monitor due to an IO error at this point it would indicate something is seriously wrong and the user needs to take action ASAP, as much more serious issues could arise from broken IO down the line. That is really the purpose of UnrecoverableError
, no? And, as Dom mentioned below, any recoverable errors should be handled in-line by the KVStore
implementation.
I mean, sure, it's not strictly a bug here to continue on if we fail to read the monitor, but I'm not sure if we want to trust the user to have proper log monitoring set up to catch that there are IO errors happing before it's too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that panicking here, though potentially inconvenient for the user, can definitely save them from a big mishap later down the line.
However, I can also notice one thing. The old_monitor
is only tried to be read once throughout the life of a channel that is when the channel is closed.
So, if IO errors are happening for a user, they should be detected by this time in the life of the channel. So a panic here, though potentially useful, will probably also be inconsequential in the grand scheme of things.
I might be nitpicking, but I thought an expectation of the KVStore contract
is that the backend would do things like exponential back-off retries to
deal with throttling. So a failure from any `impl KVStore` would likely
indicate a serious problem.
…On Wed, Dec 13, 2023, 11:18 AM Gursharan Singh ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lightning/src/util/persist.rs
<#2779 (comment)>
:
> + // We could write this update, but it meets criteria of our design that calls for a full monitor write.
+ let monitor_update_status = self.persist_new_channel(funding_txo, monitor, monitor_update_call_id);
+
+ if let ChannelMonitorUpdateStatus::Completed = monitor_update_status {
+ let cleanup_range = if monitor.get_latest_update_id() == CLOSED_CHANNEL_UPDATE_ID {
+ match maybe_old_monitor {
+ Some(Ok((_, ref old_monitor))) => {
+ let start = old_monitor.get_latest_update_id();
+ // We never persist an update with update_id = CLOSED_CHANNEL_UPDATE_ID
+ let end = cmp::min(
+ start.saturating_add(self.maximum_pending_updates),
+ CLOSED_CHANNEL_UPDATE_ID - 1,
+ );
+ Some((start, end))
+ }
+ _ => None
we could ignore the error in this specific case I guess, but generally
unrecoverable IO errors are fatal for us? So "as a policy" it might be
generally preferable to stop as soon as we see one?
I don't think that is an ideal policy for all cases. I would hate it if my
node panics in production because a read call got throttled. And it was
totally optional for functioning of my node.
—
Reply to this email directly, view it on GitHub
<#2779 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIR3UF6VY5JD4DD3Q3IPHEDYJH5QNAVCNFSM6AAAAABANDWZN6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOOBQGMZDMOBSGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Previously, we used to cleanup monitor updates at both consolidation threshold and new block connects. With this change we will only cleanup when our consolidation criteria is met. Also, we remove monitor read from cleanup logic, in case of update consolidation. Note: In case of channel-closing monitor update, we still need to read the old monitor before persisting the new one in order to determine the cleanup range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a super strong opinion about panicking early on the read failure, but it may be worth noting that its possible the read failure hit a bad sector/flash segment and the write that we just did moved the data to a new segment/remapped sector and managed to fix the issue, so its not like its impossible that continuing could Just Work, even outside of some enterprise environment.
In any case, unless someone feels super strongly, I say we go ahead and merge this.
In summary, I don't think we should panic. In cleanup, we don't even fail if all the delete calls start failing. A read failure in "optimistic cleanup" is no different or worse than a delete failure.
Read failure in cleanup does not result in any mishap later down the line for sure. If later on, there is a read failure and it was critical for functioning, we should fail there instead of preempting it here. (for example in node restart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, I don't think we should panic.
Yes, as mentioned above and elsewhere it feels a bit odd to me to just proceed on read failure, but it's also not really wrong or critical here.
LGTM
SP::Target: SignerProvider + Sized | ||
{ | ||
// Cleans up monitor updates for given monitor in range `start..=end`. | ||
fn cleanup_in_range(&self, monitor_name: MonitorName, start: u64, end: u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking nit: Fine for now, but if we touch this again we may want to change these semantics of cleanup_in_range
. I already had brought this up in the original PR, but usually start..end
in Rust has the semantics start <= x < end
(cf. https://doc.rust-lang.org/std/ops/struct.Range.html). So for a reader it may be a bit unexpected to have this work as start..=end
. If we want to keep this version we could at least rename the variables to first
/last
to be a bit more clear and to show that this is unconventional behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to complicate semantics for calculating cleanup_range by changing end calculation.
Probably makes sense to rename it as last
. (But not super critical for now, since it is private function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post merge ACK
I also agree with not panicking here, as its potential usefulness is outweighed by the unnecessary preemptive failure it might create.
Other than this, the PR looks perfect! Thanks a lot for this great cleanup improvement!
Summary and notes:
- Not needing to clean the monitor, in the case of
persist_new_channel
makes sense, as we are doing a full persist - Not needing to read the old_monitor, in case of
% maximum_pending_updates == 0
, to determine the cleanup range, is a great use of the fact of the deterministic length of cleanup. - Separating
cleanup_in_range
in a separate function makes the code modular and the functionality usable separately, a good preemptive measure for future updates.
As a small follow-up, we can add a comment in persist_new_channel
fn to explain why we don’t need to do an old_monitor read & clean up here.
Previously, we used to cleanup monitor updates at both consolidation threshold and new block connects. With this change we will only cleanup when our consolidation criteria is met. Also, we remove monitor read from cleanup logic, in case of update consolidation.
Note: In case of channel-closing monitor update, we still need to read the old monitor before persisting the new one in order to determine the cleanup range.
Closes #2706