Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lnd take ages to start up #1648

Closed
slush0 opened this issue Jul 29, 2018 · 18 comments
Closed

lnd take ages to start up #1648

slush0 opened this issue Jul 29, 2018 · 18 comments

Comments

@slush0
Copy link

slush0 commented Jul 29, 2018

Background

lnd needs around 10-15 minutes after startup to actually connecting to peers. There's no unexpected activity (network or high CPU load). Then it always start talking to peer, but it looks like it is stuck on some timeouts (maybe waiting to timeouts to unreachable peers, serially?)

After this time, node normally start up with 300+ peers.

I started to notice this kind of behavior after it started maintaining high amount of peers, but it might be a coincidence.

Your environment

{
    "identity_pubkey": "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469",
    "alias": "tady je slushovo",
    "num_pending_channels": 4,
    "num_active_channels": 292,
    "num_peers": 285,
    "block_height": 534252,
    "block_hash": "00000000000000000015b1681f2db980643fcfe6c36bc330c56e20731bed7010",
    "synced_to_chain": true,
    "testnet": false,
    "chains": [
        "bitcoin"
    ],
    "uris": [
        "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469@109.239.79.181:9735"
    ],
    "best_header_timestamp": "1532860554",
    "version": "0.4.2-beta commit=92b0b10dc75de87be3a9f895c8dfc5a84a2aec7a"
}

Linux server01 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux

  • bitcoind 0.16.1

Steps to reproduce

Any restart of lnd.

Expected behaviour

After lnd restart, lnd should open connections to peers and start communicating.

Actual behaviour

Log is filling with lines like this, but no other activity can be seen.

2018-07-29 10:41:01.408 [INF] NTFN: New spend subscription: utxo=5f2274c650d5838f06c3dd213b0fb366462e31accdc82e8f0d7fd4cb91c603ec:0
2018-07-29 10:41:01.408 [INF] CNCT: Close observer for ChannelPoint(58374cb32884e7b03c643540c49217a28415bdd2523d4548b940d07a49214a56:0) active
2018-07-29 10:41:01.536 [INF] NTFN: New spend subscription: utxo=26d7408e94ab5f75d470fdde3790544f3510d7b126e94d223e983e963bd980ee:0
2018-07-29 10:41:01.536 [INF] CNCT: Close observer for ChannelPoint(5f2274c650d5838f06c3dd213b0fb366462e31accdc82e8f0d7fd4cb91c603ec:0) active
2018-07-29 10:41:01.610 [INF] NTFN: New spend subscription: utxo=a813a909de85fb573a8bb9c747114eb3c5f98903b1994235ee086a2edc24629a:1
2018-07-29 10:41:01.610 [INF] CNCT: Close observer for ChannelPoint(26d7408e94ab5f75d470fdde3790544f3510d7b126e94d223e983e963bd980ee:0) active
2018-07-29 10:41:01.710 [INF] NTFN: New spend subscription: utxo=6af2f741ba31a9e6e5ab1073e795f7145e8cda3dd5628b11ee5dc1192c8506b0:0
2018-07-29 10:41:01.710 [INF] CNCT: Close observer for ChannelPoint(a813a909de85fb573a8bb9c747114eb3c5f98903b1994235ee086a2edc24629a:1) active
2018-07-29 10:41:01.811 [INF] NTFN: New spend subscription: utxo=e2911f11feef0653be37c4ec1be4b799ca4cfdd302b6927d2776eb4f550e5d56:0
2018-07-29 10:41:01.811 [INF] CNCT: Close observer for ChannelPoint(6af2f741ba31a9e6e5ab1073e795f7145e8cda3dd5628b11ee5dc1192c8506b0:0) active

getinfo:

$ lncli getinfo
{
    "identity_pubkey": "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469",
    "alias": "tady je slushovo",
    "num_pending_channels": 4,
    "num_active_channels": 0,
    "num_peers": 0,
    "block_height": 534252,
    "block_hash": "00000000000000000015b1681f2db980643fcfe6c36bc330c56e20731bed7010",
    "synced_to_chain": true,
    "testnet": false,
    "chains": [
        "bitcoin"
    ],
    "uris": [
        "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469@109.239.79.181:9735"
    ],
    "best_header_timestamp": "1532860554",
    "version": "0.4.2-beta commit=92b0b10dc75de87be3a9f895c8dfc5a84a2aec7a"
}
@slush0
Copy link
Author

slush0 commented Jul 29, 2018

I just wrote simple reconnecting script using list of inactive channels and ConnectPeer gRPC call (as well as lncli connect <host> just hangs out for many unreachable IPs for really long time (minutes).

I think this may be relevant to this issue, especially if lnd doesn't do connection logic in parallel (?).

Some reasonable timeout for both RPC call and startup reconnection would be useful.

@slush0
Copy link
Author

slush0 commented Jul 29, 2018

Okay, I've found perm=True parameter of ConnectPeer, that solved timing out on RPC part. Also after script finished, amount of active channels jumped by 20. So obviously there were some reachable nodes, but lnd didn't try to connect to them aggressively enough.

@halfik
Copy link

halfik commented Jul 29, 2018

There is some problem with reconnecting to peers. Dunno why but it reconnects to some and never to others. This peer is a good example: 02ad6fb8d693dc1e4569bcedefadf5f72a931ae027dc0f0c544b34c1c6f3b9a02b

I lose connection to him at least daily and it never reconnect.

@Roasbeef
Copy link
Member

Related to #1569 fixed by #1628. The issue here is a series of RPC calls that are currently synchronous, but should actually be asynchronous.

@Roasbeef
Copy link
Member

Also after script finished, amount of active channels jumped by 20. So obviously there were some reachable nodes, but lnd didn't try to connect to them aggressively enough.

We'll try to connect to them, but then exponentially back off if a connection fails. There's a flaw right now though, where we won't reset the back off once the peer is know to be "stable".

@slush0
Copy link
Author

slush0 commented Aug 1, 2018

I'm running on latest master (commit=ad7f87ef18af2e26ec80b9d6d5be7086515fea8a) and the problem supposedly fixed by #1628 didn't improve. 10+ minutes of uptime and still 0 peers.

@halseth
Copy link
Contributor

halseth commented Aug 1, 2018

Sounds like an issue with peer bootstrapping, which is not altered by that PR.

@halseth halseth reopened this Aug 1, 2018
@Roasbeef
Copy link
Member

Roasbeef commented Aug 1, 2018

Oh yeah peer bootstrapping is #1658.

@Roasbeef
Copy link
Member

Roasbeef commented Aug 1, 2018

What do you see in the logs immediately after start up?

@Roasbeef
Copy link
Member

Roasbeef commented Aug 1, 2018

If you check out #1658, is the start up snappier? Basically right now we'll block the server on finishing the handshake with each peer, this instead makes all that async so we can continue to seek out and accept connections from other peers.

@Roasbeef
Copy link
Member

Now that #1658 this should be resolved. Let us know if yu're having any further issues in this area, and we'll re-open the issue! Thanks.

@slush0
Copy link
Author

slush0 commented Sep 11, 2018

I don't think this issue has been fully resolved. I was unable to try new version until today:

2018-09-11 16:35:23.934 [INF] LTND: Version 0.5.0-beta commit=1941353fb28755a170793e43595601d75c8f3dda

Now there're hundreds lines like (see timestamps) ended with RPC thread starting. So it already does something for one minute, even before touching P2P network :

2018-09-11 16:36:27.719 [INF] RPCS: RPC server listening on 127.0.0.1:10009
2018-09-11 16:36:27.722 [INF] RPCS: gRPC proxy started at 127.0.0.1:10010
2018-09-11 16:36:27.726 [INF] LTND: Waiting for chain backend to finish sync, start_height=540965
2018-09-11 16:36:27.735 [INF] LTND: Chain backend is fully synced (end_height=540965)!
2018-09-11 16:36:27.931 [INF] HSWC: Starting HTLC Switch

Then there's another flood of messages, we're almost 4 minutes of uptime:

2018-09-11 16:39:17.009 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.009 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.009 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:44:22.473 [INF] CNCT: ChannelArbitrator(af69b7957595c970dcfb82a0acef9d0c211fe6b1bd0ed5c986ea324a6fbba2df:1): starting state=StateDefault
2018-09-11 16:44:22.796 [INF] CNCT: ChannelArbitrator(cc766b5f44058117c403b7fdb2150b21948d6fbbc66db81c899814c55a6eb027:0): starting state=StateDefault
2018-09-11 16:44:23.340 [INF] CNCT: ChannelArbitrator(694b1352952104fc9d6a56dddb6f56281419f845b263b84ef63fe6f80a41c460:0): starting state=StateDefault
2018-09-11 16:44:23.855 [INF] CNCT: ChannelArbitrator(3e9b0a57d2489d583328457ed7011f43ce5ee111b79c3b02dd79c4732cd57d41:1): starting state=StateDefault
2018-09-11 16:44:24.154 [INF] CNCT: ChannelArbitrator(2145f0b05c7b7a5adffff567bf897355e777d2545ad0f05d524486e4736cacb6:0): starting state=StateDefault
2018-09-11 16:44:24.797 [INF] CNCT: ChannelArbitrator(4e54c28f07cbb45022beb07ebdad5f3e70658f90136918e149024a3f1ba93978:0): starting state=StateWaitingFullResolution
2018-09-11 16:44:25.490 [INF] CNCT: ChannelArbitrator(4e54c28f07cbb45022beb07ebdad5f3e70658f90136918e149024a3f1ba93978:0): still awaiting contract resolution
2018-09-11 16:44:25.491 [INF] CNCT: ChannelArbitrator(4e54c28f07cbb45022beb07ebdad5f3e70658f90136918e149024a3f1ba93978:0): relaunching 1 contract resolvers
2018-09-11 16:44:25.491 [INF] CNCT: ChannelArbitrator(c08b00315a4a2eeef269af76f3d4beae05358361d18877e8d30309f8d84d6e64:1): starting state=StateDefault

Listening on 9735 starts after 10 minutes of startup, and immediately we're receiving inbound connections.

2018-09-11 16:45:03.658 [INF] CMGR: Server listening on [::]:9735
2018-09-11 16:45:03.658 [ERR] CMGR: Can't accept connection: unable to accept connection from 92.53.89.122:42322: EOF
2018-09-11 16:45:03.659 [ERR] CMGR: Can't accept connection: unable to accept connection from 92.53.89.122:45200: EOF
2018-09-11 16:45:03.659 [INF] SRVR: New inbound connection from 87.123.103.103:60428
2018-09-11 16:45:03.659 [INF] SRVR: Finalizing connection to 0268bfc4460681df5c1358429ac091094be05cf73aa6fbcda9fd4f3ccaf162c1c9, inbound=true

So I expect the issue is not related to P2P timeouts, because it happen even before the node tries to talk to them.

@halseth
Copy link
Contributor

halseth commented Sep 25, 2018

Possibly related to chain rescans at startup, which we are working on improving. Will keep the issue open to track this!

@halseth halseth reopened this Sep 25, 2018
@wpaulino
Copy link
Contributor

All the chain rescan are async now, so this would probably point to something else slowing down startup.

@cfromknecht
Copy link
Contributor

Just seems like txindex is off. Performance improvements for nodes w/o are incoming

@Roasbeef
Copy link
Member

Roasbeef commented Oct 26, 2018

Should be resolved by the upcoming revival of the height hint cache!

@Roasbeef
Copy link
Member

Roasbeef commented Nov 1, 2018

Hi @slush0, we just merged the height hint cache? Can you give the current master a spin now? The first restart will still take some time, the second one however, should be much faster as the rescan information will now be cached.

@Roasbeef
Copy link
Member

Also we recently merged #2211, which should speed up the initial rescan if a channel was closed while your node was offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants