Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid disconnecting all peers if user code is slow #1269

Conversation

TheBlueMatt
Copy link
Collaborator

In the sample client (and likely other downstream users), event
processing may block on slow operations (e.g. Bitcoin Core RPCs)
and ChannelManager persistence may take some time. This should be
fine, except that we consider this a case of possible backgrounding
and disconnect all of our peers when it happens.

Instead, we here avoid considering event processing time in the
time between PeerManager events.

This is one commit extracted from #1023.

@TheBlueMatt TheBlueMatt added this to the 0.0.105 milestone Jan 20, 2022
Comment on lines 231 to 234
let updates_available =
channel_manager.await_persistable_update_timeout(Duration::from_millis(100));
if updates_available {
let persist_start = Instant::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the 100ms timeout, may be simpler just to use one timer ending outside the if block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, are you saying just move this timer outside the if block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant we could potentially combine ev_handle_start and persist_start timers into a single timer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know how long await_persistable_update_timeout takes, though, and the goal here is (mostly) to measure how long it took as an indirect way to figure out whether we went to background on, eg, iOS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... isn't that what the timeout is for? Maybe I'm misunderstanding how it works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped all the addition stuff, it was actually incorrect cause of lack of rebase anyway, does the comment in the if block make sense?

@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2022

Codecov Report

Merging #1269 (68de973) into main (d741fb1) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 68de973 differs from pull request most recent head 0b769f2. Consider uploading reports for the commit 0b769f2 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1269      +/-   ##
==========================================
- Coverage   90.40%   90.39%   -0.02%     
==========================================
  Files          70       70              
  Lines       38118    38120       +2     
==========================================
- Hits        34462    34458       -4     
- Misses       3656     3662       +6     
Impacted Files Coverage Δ
lightning-background-processor/src/lib.rs 93.04% <100.00%> (+0.06%) ⬆️
lightning/src/ln/functional_tests.rs 97.27% <0.00%> (-0.10%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d741fb1...0b769f2. Read the comment docs.

// processing was slow at the top of the loop. For example, the sample client
// may call Bitcoin Core RPCs during event handling, which very often takes
// more than a handful of seconds to complete, and shouldn't disconnect all our
// peers.
log_trace!(logger, "Awoke after more than double our ping timer, disconnecting peers.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update reference to "double" in glorified comment. 😛 Likewise in the preceding comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just swapped the comparison back to 2xPING_TIMER, which I think is more appropriate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, nevermind, this is a great opportunity to increase our ping timer while still being able to disconnect quickly if we get background'd. WIll fix.

Comment on lines +257 to +260
// Note that we have to take care to not get here just because user event
// processing was slow at the top of the loop. For example, the sample client
// may call Bitcoin Core RPCs during event handling, which very often takes
// more than a handful of seconds to complete, and shouldn't disconnect all our
// peers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment relevant now that we don't time event processing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably yes, the point being that we time only the await, not the event processing.

jkczyz
jkczyz previously approved these changes Jan 21, 2022
In the sample client (and likely other downstream users), event
processing may block on slow operations (e.g. Bitcoin Core RPCs)
and ChannelManager persistence may take some time. This should be
fine, except that we consider this a case of possible backgrounding
and disconnect all of our peers when it happens.

Instead, we here avoid considering event processing time in the
time between PeerManager events.
Because many lightning nodes can take quite some time to respond to
pings, the five second ping timer can sometimes cause spurious
disconnects even though a peer is online. However, in part as a
response to mobile users where a connection may be lost as result
of only a short time with the app in a "paused" state, we had a
rather aggressive ping time to ensure we would disconnect quickly.

However, since we now just used a fixed time for the "went to
sleep" detection, we can somewhat increase the ping timer. We still
want to be fairly aggressive to avoid sending HTLCs to a peer that
is offline, but the tradeoff between spurious disconnections and
stuck payments is likely doesn't need to be quite as aggressive.
@TheBlueMatt TheBlueMatt force-pushed the 022-01-no-disconnect-on-slow-persist branch from 0b769f2 to 2d3a210 Compare January 21, 2022 00:37
@TheBlueMatt
Copy link
Collaborator Author

Squashed without diff from 0b769f2a7 to 2d3a21089.

@valentinewallace valentinewallace merged commit b1bdba5 into lightningdevkit:main Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants