-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid disconnecting all peers if user code is slow #1269
Avoid disconnecting all peers if user code is slow #1269
Conversation
let updates_available = | ||
channel_manager.await_persistable_update_timeout(Duration::from_millis(100)); | ||
if updates_available { | ||
let persist_start = Instant::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the 100ms timeout, may be simpler just to use one timer ending outside the if
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, are you saying just move this timer outside the if block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant we could potentially combine ev_handle_start
and persist_start
timers into a single timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know how long await_persistable_update_timeout
takes, though, and the goal here is (mostly) to measure how long it took as an indirect way to figure out whether we went to background on, eg, iOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... isn't that what the timeout is for? Maybe I'm misunderstanding how it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped all the addition stuff, it was actually incorrect cause of lack of rebase anyway, does the comment in the if block make sense?
Codecov Report
@@ Coverage Diff @@
## main #1269 +/- ##
==========================================
- Coverage 90.40% 90.39% -0.02%
==========================================
Files 70 70
Lines 38118 38120 +2
==========================================
- Hits 34462 34458 -4
- Misses 3656 3662 +6
Continue to review full report at Codecov.
|
// processing was slow at the top of the loop. For example, the sample client | ||
// may call Bitcoin Core RPCs during event handling, which very often takes | ||
// more than a handful of seconds to complete, and shouldn't disconnect all our | ||
// peers. | ||
log_trace!(logger, "Awoke after more than double our ping timer, disconnecting peers."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update reference to "double" in glorified comment. 😛 Likewise in the preceding comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just swapped the comparison back to 2xPING_TIMER, which I think is more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, nevermind, this is a great opportunity to increase our ping timer while still being able to disconnect quickly if we get background'd. WIll fix.
// Note that we have to take care to not get here just because user event | ||
// processing was slow at the top of the loop. For example, the sample client | ||
// may call Bitcoin Core RPCs during event handling, which very often takes | ||
// more than a handful of seconds to complete, and shouldn't disconnect all our | ||
// peers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment relevant now that we don't time event processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably yes, the point being that we time only the await, not the event processing.
In the sample client (and likely other downstream users), event processing may block on slow operations (e.g. Bitcoin Core RPCs) and ChannelManager persistence may take some time. This should be fine, except that we consider this a case of possible backgrounding and disconnect all of our peers when it happens. Instead, we here avoid considering event processing time in the time between PeerManager events.
Because many lightning nodes can take quite some time to respond to pings, the five second ping timer can sometimes cause spurious disconnects even though a peer is online. However, in part as a response to mobile users where a connection may be lost as result of only a short time with the app in a "paused" state, we had a rather aggressive ping time to ensure we would disconnect quickly. However, since we now just used a fixed time for the "went to sleep" detection, we can somewhat increase the ping timer. We still want to be fairly aggressive to avoid sending HTLCs to a peer that is offline, but the tradeoff between spurious disconnections and stuck payments is likely doesn't need to be quite as aggressive.
0b769f2
to
2d3a210
Compare
Squashed without diff from |
In the sample client (and likely other downstream users), event
processing may block on slow operations (e.g. Bitcoin Core RPCs)
and ChannelManager persistence may take some time. This should be
fine, except that we consider this a case of possible backgrounding
and disconnect all of our peers when it happens.
Instead, we here avoid considering event processing time in the
time between PeerManager events.
This is one commit extracted from #1023.