lnd take ages to start up #1648

slush0 · 2018-07-29T10:52:21Z

Background

lnd needs around 10-15 minutes after startup to actually connecting to peers. There's no unexpected activity (network or high CPU load). Then it always start talking to peer, but it looks like it is stuck on some timeouts (maybe waiting to timeouts to unreachable peers, serially?)

After this time, node normally start up with 300+ peers.

I started to notice this kind of behavior after it started maintaining high amount of peers, but it might be a coincidence.

Your environment

{
    "identity_pubkey": "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469",
    "alias": "tady je slushovo",
    "num_pending_channels": 4,
    "num_active_channels": 292,
    "num_peers": 285,
    "block_height": 534252,
    "block_hash": "00000000000000000015b1681f2db980643fcfe6c36bc330c56e20731bed7010",
    "synced_to_chain": true,
    "testnet": false,
    "chains": [
        "bitcoin"
    ],
    "uris": [
        "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469@109.239.79.181:9735"
    ],
    "best_header_timestamp": "1532860554",
    "version": "0.4.2-beta commit=92b0b10dc75de87be3a9f895c8dfc5a84a2aec7a"
}

Linux server01 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux

bitcoind 0.16.1

Steps to reproduce

Any restart of lnd.

Expected behaviour

After lnd restart, lnd should open connections to peers and start communicating.

Actual behaviour

Log is filling with lines like this, but no other activity can be seen.

2018-07-29 10:41:01.408 [INF] NTFN: New spend subscription: utxo=5f2274c650d5838f06c3dd213b0fb366462e31accdc82e8f0d7fd4cb91c603ec:0
2018-07-29 10:41:01.408 [INF] CNCT: Close observer for ChannelPoint(58374cb32884e7b03c643540c49217a28415bdd2523d4548b940d07a49214a56:0) active
2018-07-29 10:41:01.536 [INF] NTFN: New spend subscription: utxo=26d7408e94ab5f75d470fdde3790544f3510d7b126e94d223e983e963bd980ee:0
2018-07-29 10:41:01.536 [INF] CNCT: Close observer for ChannelPoint(5f2274c650d5838f06c3dd213b0fb366462e31accdc82e8f0d7fd4cb91c603ec:0) active
2018-07-29 10:41:01.610 [INF] NTFN: New spend subscription: utxo=a813a909de85fb573a8bb9c747114eb3c5f98903b1994235ee086a2edc24629a:1
2018-07-29 10:41:01.610 [INF] CNCT: Close observer for ChannelPoint(26d7408e94ab5f75d470fdde3790544f3510d7b126e94d223e983e963bd980ee:0) active
2018-07-29 10:41:01.710 [INF] NTFN: New spend subscription: utxo=6af2f741ba31a9e6e5ab1073e795f7145e8cda3dd5628b11ee5dc1192c8506b0:0
2018-07-29 10:41:01.710 [INF] CNCT: Close observer for ChannelPoint(a813a909de85fb573a8bb9c747114eb3c5f98903b1994235ee086a2edc24629a:1) active
2018-07-29 10:41:01.811 [INF] NTFN: New spend subscription: utxo=e2911f11feef0653be37c4ec1be4b799ca4cfdd302b6927d2776eb4f550e5d56:0
2018-07-29 10:41:01.811 [INF] CNCT: Close observer for ChannelPoint(6af2f741ba31a9e6e5ab1073e795f7145e8cda3dd5628b11ee5dc1192c8506b0:0) active

getinfo:

$ lncli getinfo
{
    "identity_pubkey": "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469",
    "alias": "tady je slushovo",
    "num_pending_channels": 4,
    "num_active_channels": 0,
    "num_peers": 0,
    "block_height": 534252,
    "block_hash": "00000000000000000015b1681f2db980643fcfe6c36bc330c56e20731bed7010",
    "synced_to_chain": true,
    "testnet": false,
    "chains": [
        "bitcoin"
    ],
    "uris": [
        "02cdf83ef8e45908b1092125d25c68dcec7751ca8d39f557775cd842e5bc127469@109.239.79.181:9735"
    ],
    "best_header_timestamp": "1532860554",
    "version": "0.4.2-beta commit=92b0b10dc75de87be3a9f895c8dfc5a84a2aec7a"
}

The text was updated successfully, but these errors were encountered:

slush0 · 2018-07-29T13:38:12Z

I just wrote simple reconnecting script using list of inactive channels and ConnectPeer gRPC call (as well as lncli connect <host> just hangs out for many unreachable IPs for really long time (minutes).

I think this may be relevant to this issue, especially if lnd doesn't do connection logic in parallel (?).

Some reasonable timeout for both RPC call and startup reconnection would be useful.

slush0 · 2018-07-29T14:04:22Z

Okay, I've found perm=True parameter of ConnectPeer, that solved timing out on RPC part. Also after script finished, amount of active channels jumped by 20. So obviously there were some reachable nodes, but lnd didn't try to connect to them aggressively enough.

halfik · 2018-07-29T21:22:09Z

There is some problem with reconnecting to peers. Dunno why but it reconnects to some and never to others. This peer is a good example: 02ad6fb8d693dc1e4569bcedefadf5f72a931ae027dc0f0c544b34c1c6f3b9a02b

I lose connection to him at least daily and it never reconnect.

Roasbeef · 2018-07-30T02:03:07Z

Related to #1569 fixed by #1628. The issue here is a series of RPC calls that are currently synchronous, but should actually be asynchronous.

Roasbeef · 2018-07-30T02:05:30Z

Also after script finished, amount of active channels jumped by 20. So obviously there were some reachable nodes, but lnd didn't try to connect to them aggressively enough.

We'll try to connect to them, but then exponentially back off if a connection fails. There's a flaw right now though, where we won't reset the back off once the peer is know to be "stable".

slush0 · 2018-08-01T09:49:14Z

I'm running on latest master (commit=ad7f87ef18af2e26ec80b9d6d5be7086515fea8a) and the problem supposedly fixed by #1628 didn't improve. 10+ minutes of uptime and still 0 peers.

halseth · 2018-08-01T11:00:10Z

Sounds like an issue with peer bootstrapping, which is not altered by that PR.

Roasbeef · 2018-08-01T20:21:02Z

Oh yeah peer bootstrapping is #1658.

Roasbeef · 2018-08-01T20:21:21Z

What do you see in the logs immediately after start up?

Roasbeef · 2018-08-01T20:24:10Z

If you check out #1658, is the start up snappier? Basically right now we'll block the server on finishing the handshake with each peer, this instead makes all that async so we can continue to seek out and accept connections from other peers.

Roasbeef · 2018-08-16T03:51:17Z

Now that #1658 this should be resolved. Let us know if yu're having any further issues in this area, and we'll re-open the issue! Thanks.

slush0 · 2018-09-11T16:58:45Z

I don't think this issue has been fully resolved. I was unable to try new version until today:

2018-09-11 16:35:23.934 [INF] LTND: Version 0.5.0-beta commit=1941353fb28755a170793e43595601d75c8f3dda

Now there're hundreds lines like (see timestamps) ended with RPC thread starting. So it already does something for one minute, even before touching P2P network :

2018-09-11 16:36:27.719 [INF] RPCS: RPC server listening on 127.0.0.1:10009
2018-09-11 16:36:27.722 [INF] RPCS: gRPC proxy started at 127.0.0.1:10010
2018-09-11 16:36:27.726 [INF] LTND: Waiting for chain backend to finish sync, start_height=540965
2018-09-11 16:36:27.735 [INF] LTND: Chain backend is fully synced (end_height=540965)!
2018-09-11 16:36:27.931 [INF] HSWC: Starting HTLC Switch

Then there's another flood of messages, we're almost 4 minutes of uptime:

2018-09-11 16:39:17.009 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.009 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.009 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription
2018-09-11 16:39:17.010 [INF] NTFN: New block epoch subscription

2018-09-11 16:44:22.473 [INF] CNCT: ChannelArbitrator(af69b7957595c970dcfb82a0acef9d0c211fe6b1bd0ed5c986ea324a6fbba2df:1): starting state=StateDefault
2018-09-11 16:44:22.796 [INF] CNCT: ChannelArbitrator(cc766b5f44058117c403b7fdb2150b21948d6fbbc66db81c899814c55a6eb027:0): starting state=StateDefault
2018-09-11 16:44:23.340 [INF] CNCT: ChannelArbitrator(694b1352952104fc9d6a56dddb6f56281419f845b263b84ef63fe6f80a41c460:0): starting state=StateDefault
2018-09-11 16:44:23.855 [INF] CNCT: ChannelArbitrator(3e9b0a57d2489d583328457ed7011f43ce5ee111b79c3b02dd79c4732cd57d41:1): starting state=StateDefault
2018-09-11 16:44:24.154 [INF] CNCT: ChannelArbitrator(2145f0b05c7b7a5adffff567bf897355e777d2545ad0f05d524486e4736cacb6:0): starting state=StateDefault
2018-09-11 16:44:24.797 [INF] CNCT: ChannelArbitrator(4e54c28f07cbb45022beb07ebdad5f3e70658f90136918e149024a3f1ba93978:0): starting state=StateWaitingFullResolution
2018-09-11 16:44:25.490 [INF] CNCT: ChannelArbitrator(4e54c28f07cbb45022beb07ebdad5f3e70658f90136918e149024a3f1ba93978:0): still awaiting contract resolution
2018-09-11 16:44:25.491 [INF] CNCT: ChannelArbitrator(4e54c28f07cbb45022beb07ebdad5f3e70658f90136918e149024a3f1ba93978:0): relaunching 1 contract resolvers
2018-09-11 16:44:25.491 [INF] CNCT: ChannelArbitrator(c08b00315a4a2eeef269af76f3d4beae05358361d18877e8d30309f8d84d6e64:1): starting state=StateDefault

Listening on 9735 starts after 10 minutes of startup, and immediately we're receiving inbound connections.

2018-09-11 16:45:03.658 [INF] CMGR: Server listening on [::]:9735
2018-09-11 16:45:03.658 [ERR] CMGR: Can't accept connection: unable to accept connection from 92.53.89.122:42322: EOF
2018-09-11 16:45:03.659 [ERR] CMGR: Can't accept connection: unable to accept connection from 92.53.89.122:45200: EOF
2018-09-11 16:45:03.659 [INF] SRVR: New inbound connection from 87.123.103.103:60428
2018-09-11 16:45:03.659 [INF] SRVR: Finalizing connection to 0268bfc4460681df5c1358429ac091094be05cf73aa6fbcda9fd4f3ccaf162c1c9, inbound=true

So I expect the issue is not related to P2P timeouts, because it happen even before the node tries to talk to them.

halseth · 2018-09-25T11:16:18Z

Possibly related to chain rescans at startup, which we are working on improving. Will keep the issue open to track this!

wpaulino · 2018-09-26T20:28:28Z

All the chain rescan are async now, so this would probably point to something else slowing down startup.

cfromknecht · 2018-09-27T01:14:45Z

Just seems like txindex is off. Performance improvements for nodes w/o are incoming

Roasbeef · 2018-10-26T00:21:37Z

Should be resolved by the upcoming revival of the height hint cache!

Roasbeef · 2018-11-01T02:07:09Z

Hi @slush0, we just merged the height hint cache? Can you give the current master a spin now? The first restart will still take some time, the second one however, should be much faster as the rescan information will now be cached.

Roasbeef · 2018-12-12T00:26:45Z

Also we recently merged #2211, which should speed up the initial rescan if a channel was closed while your node was offline.

wpaulino mentioned this issue Jul 30, 2018

chainntnfs: handle historical confs and spends asynchronously #1628

Merged

Roasbeef closed this as completed in #1628 Aug 1, 2018

halseth reopened this Aug 1, 2018

Roasbeef closed this as completed Aug 16, 2018

halseth reopened this Sep 25, 2018

Roasbeef closed this as completed Dec 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lnd take ages to start up #1648

lnd take ages to start up #1648

slush0 commented Jul 29, 2018

slush0 commented Jul 29, 2018

slush0 commented Jul 29, 2018

halfik commented Jul 29, 2018

Roasbeef commented Jul 30, 2018

Roasbeef commented Jul 30, 2018

slush0 commented Aug 1, 2018 •

edited

Loading

halseth commented Aug 1, 2018

Roasbeef commented Aug 1, 2018

Roasbeef commented Aug 1, 2018

Roasbeef commented Aug 1, 2018

Roasbeef commented Aug 16, 2018

slush0 commented Sep 11, 2018

halseth commented Sep 25, 2018

wpaulino commented Sep 26, 2018

cfromknecht commented Sep 27, 2018

Roasbeef commented Oct 26, 2018 •

edited

Loading

Roasbeef commented Nov 1, 2018

Roasbeef commented Dec 12, 2018

lnd take ages to start up #1648

lnd take ages to start up #1648

Comments

slush0 commented Jul 29, 2018

Background

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

slush0 commented Jul 29, 2018

slush0 commented Jul 29, 2018

halfik commented Jul 29, 2018

Roasbeef commented Jul 30, 2018

Roasbeef commented Jul 30, 2018

slush0 commented Aug 1, 2018 • edited Loading

halseth commented Aug 1, 2018

Roasbeef commented Aug 1, 2018

Roasbeef commented Aug 1, 2018

Roasbeef commented Aug 1, 2018

Roasbeef commented Aug 16, 2018

slush0 commented Sep 11, 2018

halseth commented Sep 25, 2018

wpaulino commented Sep 26, 2018

cfromknecht commented Sep 27, 2018

Roasbeef commented Oct 26, 2018 • edited Loading

Roasbeef commented Nov 1, 2018

Roasbeef commented Dec 12, 2018

slush0 commented Aug 1, 2018 •

edited

Loading

Roasbeef commented Oct 26, 2018 •

edited

Loading