Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node fails to restart: failed to find any peer in table (interopnet and testnet) #2062

Closed
jimpick opened this issue Jun 18, 2020 · 13 comments
Closed

Comments

@jimpick
Copy link
Contributor

jimpick commented Jun 18, 2020

Describe the bug

I've been running this node on the interopnet for a day or so, and now it fails to restart:

2020-06-18T04:53:00.437Z	�[34mINFO�[0m	main	lotus/daemon.go:153	lotus repo: /home/ubuntu/.lotus
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-5294475db5237a2e83c3e52fd6c2b03859a1831d45ed08c4f35dbf9a803165a9.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-0-0-sha256_hasher-ecd683648512ab1765faa2a5f14bab48f676e633467f0aa8aad4b55dcb0652bb.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-3ea05428c9d11689f23529cde32fd30aabd50f7d2c93657c1d3650bca3e8ea9e.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-0-559e581f022bb4e4ec6e719e563bf0e026ad6de42e56c18714a2c692b1b88d7e.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-0170db1f394b35d995252228ee359194b13199d259380541dc529fb0099096b0.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-50c7368dea9593ed0989e70974d28024efa9d156d585b7eea1be22b2e753f331.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-7d739b8cf60f1b0709eeebee7730e297683552e4b69cab6984ec0285663c5781.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-2-b62098629d07946e9028127e70295ed996fe3ed25b0f9f88eb610a0ab4385a3c.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-0-0-sha256_hasher-6babf46ce344ae495d558e7770a585b2382d54f225af8ed0397b8be7c3fcd472.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-0-0-0cfb4f178bbb71cf2ecfcd42accce558b27199ab4fb59cb78f2483fe21ef36d9.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-0-0-sha256_hasher-032d3138d22506ec0082ed72b2dcba18df18477904e35bafee82b3793b06832f.vk is ok
2020-06-18T04:53:00.438Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-2-sha256_hasher-96f1b4a04c5c51e4759bbf224bbc2ef5a42c7100f16ec0637123f16a845ddfb2.vk is ok
2020-06-18T04:53:00.441Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-2-2627e4006b67f99cef990c0a47d5426cb7ab0a0ad58fc1061547bf2d28b09def.vk is ok
2020-06-18T04:53:00.441Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:128	Parameter file /var/tmp/filecoin-proof-parameters/v27-proof-of-spacetime-fallback-merkletree-poseidon_hasher-8-8-0-0377ded656c6f524f1618760bffe4e0a1c51d5a70c4509eedae8a27555733edc.vk is ok
2020-06-18T04:53:00.441Z	�[34mINFO�[0m	build	[email protected]/paramfetch.go:147	parameter and key-fetching complete
2020-06-18T04:53:00.492Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	All 24 tables opened in 35ms

2020-06-18T04:53:00.494Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Replaying file id: 1 at offset: 55150118

2020-06-18T04:53:00.494Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Replay took: 3.16¬µs

2020-06-18T04:53:00.503Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	All 0 tables opened in 0s

2020-06-18T04:53:00.505Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Replaying file id: 0 at offset: 0

2020-06-18T04:53:00.505Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Replay took: 2.563¬µs

2020-06-18T04:53:00.506Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	All 1 tables opened in 0s

2020-06-18T04:53:00.508Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Replaying file id: 0 at offset: 13106

2020-06-18T04:53:00.508Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Replay took: 1.809¬µs

2020-06-18T04:53:00.512Z	�[34mINFO�[0m	dht/RtRefreshManager	rtrefresh/rt_refresh_manager.go:265	starting refreshing cpl 0 with key � ��6k���������������������������� (routing table size was 0)
2020-06-18T04:53:00.512Z	�[33mWARN�[0m	dht/RtRefreshManager	rtrefresh/rt_refresh_manager.go:128	failed when refreshing routing table2 errors occurred:
	* failed to query for self, err=failed to run refresh DHT query for key=�$��� Ï—b<‘`W*–‰~˜FdÃ≤√�ö˙Nk9^^�”�43Θ, err=failed to find any peer in table
	* failed to refresh cpl=0, err=failed to run refresh DHT query for key=� ��6k����������������������������, err=failed to find any peer in table


2020-06-18T04:53:00.512Z	�[34mINFO�[0m	p2pnode	lp2p/addrs.go:114	Swarm listening at: [/ip4/127.0.0.1/tcp/40065 /ip4/172.31.10.49/tcp/40065 /ip6/::1/tcp/43247]
2020-06-18T04:53:00.513Z	�[34mINFO�[0m	dht/RtRefreshManager	rtrefresh/rt_refresh_manager.go:265	starting refreshing cpl 0 with key � ��¬$���������������������������� (routing table size was 0)
2020-06-18T04:53:00.513Z	�[33mWARN�[0m	dht/RtRefreshManager	rtrefresh/rt_refresh_manager.go:191	failed when refreshing routing table	{"error": "2 errors occurred:\n\t* failed to query for self, err=failed to run refresh DHT query for key=\u0000$\u0008\u0001\u0012 \ufffd\ufffdb<\ufffd`W*\ufffd\ufffd~\ufffdFdÃ≤\ufffd\u0003\ufffd\ufffdNk9^^\u0006\ufffd\u001f43\ufffd\ufffd, err=failed to find any peer in table\n\t* failed to refresh cpl=0, err=failed to run refresh DHT query for key=\u0012 \u0000\u0000\ufffd$\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000, err=failed to find any peer in table\n\n"}
2020-06-18T04:53:00.938Z	�[33mWARN�[0m	sub	sub/incoming.go:390	error from message subscription: context canceled
2020-06-18T04:53:00.938Z	�[34mINFO�[0m	pubsub	[email protected]/pubsub.go:578	pubsub processloop shutting down
2020-06-18T04:53:00.938Z	�[33mWARN�[0m	sub	sub/incoming.go:392	quitting HandleIncomingMessages loop
2020-06-18T04:53:00.938Z	�[33mWARN�[0m	pubsub	[email protected]/comm.go:74	opening new stream to peer: context canceledQmTd6UvR47vUidRNZ1ZKXHrAFhqTJAD27rKL9XYghEKgKX
2020-06-18T04:53:00.938Z	�[33mWARN�[0m	sub	sub/incoming.go:47	quitting HandleIncomingBlocks loop
2020-06-18T04:53:00.938Z	�[34mINFO�[0m	chain	chain/sync_manager.go:277	sync scheduler shutting down
2020-06-18T04:53:00.938Z	�[33mWARN�[0m	blockstore	[email protected]/log.go:175	Cache rebuild closed by context finishing: context canceled
2020-06-18T04:53:00.938Z	�[34mINFO�[0m	basichost	basic/natmgr.go:96	DiscoverNAT error:context canceled
2020-06-18T04:53:00.938Z	�[33mWARN�[0m	net/identify	identify/peer_loop.go:89	failed to send Identify Delta	{"peer": "QmTd6UvR47vUidRNZ1ZKXHrAFhqTJAD27rKL9XYghEKgKX", "error": "failed to send push on delta message: failed to open push stream: context canceled"}
2020-06-18T04:53:00.938Z	�[33mWARN�[0m	net/identify	identify/peer_loop.go:89	failed to send Identify Delta	{"peer": "QmTd6UvR47vUidRNZ1ZKXHrAFhqTJAD27rKL9XYghEKgKX", "error": "failed to send push on delta message: failed to open push stream: context canceled"}
2020-06-18T04:53:00.939Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Got compaction priority: {level:0 score:1.73 dropPrefix:[]}
2020-06-18T04:53:00.941Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Got compaction priority: {level:0 score:1.73 dropPrefix:[]}
2020-06-18T04:53:00.942Z	�[34mINFO�[0m	badger	[email protected]/logger.go:46	Got compaction priority: {level:0 score:1.73 dropPrefix:[]}
2020-06-18T04:53:00.947Z	�[33mWARN�[0m	main	lotus/main.go:81	initializing node:
    main.glob..func2
        /home/ubuntu/lotus/cmd/lotus/daemon.go:236
  - starting node:
    github.com/filecoin-project/lotus/node.New
        /home/ubuntu/lotus/node/builder.go:497
  - failed to find any peer in table

Version (run lotus --version):

$ ./lotus --version
lotus version 0.4.0+git.bead3bf4.dirty
@jimpick
Copy link
Contributor Author

jimpick commented Jun 18, 2020

Looks similar to #2021

@christest123456
Copy link

Have you solved the problem? if yes, please share it, thanks

@jimpick
Copy link
Contributor Author

jimpick commented Jun 23, 2020

It happened again, this time on testnet...

2020-06-23T06:26:05.554Z        WARN    main    lotus/main.go:81        initializing node:
    main.glob..func2
        /home/ubuntu/lotus/cmd/lotus/daemon.go:236
  - starting node:
    github.com/filecoin-project/lotus/node.New
        /home/ubuntu/lotus/node/builder.go:497
  - failed to find any peer in table

commit ffa7be8

I'm going to leave the machine as-is until somebody on the team can inspect it.

@jimpick
Copy link
Contributor Author

jimpick commented Jun 23, 2020

I'll investigate a bit more in a few hours. I've also made an EBS snapshot.

@jimpick
Copy link
Contributor Author

jimpick commented Jun 24, 2020

I set up a new node to take the place of the one that had this happen, so I could investigate. But it just happened on the new node too. I suspect that because I'm using the node for doing a lot of DHT lookups as I scan all the miners, that's triggering something bad.

@christest123456
Copy link

for reference: ipfs/kubo#1941

@jimpick
Copy link
Contributor Author

jimpick commented Jun 24, 2020

I tried to figure out where in the code it was refusing to start up. Here's a quick workaround to get my node to start up again:

Screenshot 2020-06-23 23 48 20

I patched it into here:

https://github.com/filecoin-project/go-fil-markets/blob/e4d94036546be42d02cdc591f665d733bf4e4260/storagemarket/impl/client.go#L94

At first glance, it appears that it might be trying to make DHT queries before the DHT is ready.

@shmil11
Copy link

shmil11 commented Jun 24, 2020

It was executed as it was but failed.

@jimpick jimpick changed the title Node fails to restart: failed to find any peer in table (interopnet) Node fails to restart: failed to find any peer in table (interopnet and testnet) Jun 25, 2020
@jsign
Copy link
Contributor

jsign commented Jun 29, 2020

I can confirm this is happening. One of my Lotus nodes fails to restart indefinitely.

@jimpick
Copy link
Contributor Author

jimpick commented Jun 30, 2020

If the DHT is reporting no peers in the table, try lotus net peers or lotus-storage-miner net peers to see if it has any peers ... if not, try to add some with lotus net connect ... or lotus-storage-miner net connect .... Of course, if the node won't start, you can't do that. The quick patch against go-fil-markets above might help in that case... but this bug really should be fixed.

@jimpick
Copy link
Contributor Author

jimpick commented Jun 30, 2020

I think what is possibly happening is that the local list of peers in the DHT is getting zeroed out somehow, and then when the node restarts, the go-fil-markets module is trying to restart deals, but the DHT hasn’t been repopulated yet, and go-fil-markets can’t deal with the error so it just exits.

@arajasek
Copy link
Contributor

arajasek commented Jul 2, 2020

@jimpick Yeah, that sounds about right. In general, restartDeals shouldn't cause the node to fail to startup if it fails to get a deal going again (it could just make a note to try again later or something).

That certainly seems to be what's caused this node to go kaput, I wonder if the miner for the deal in question is just unavailable.

@jimpick
Copy link
Contributor Author

jimpick commented Jul 2, 2020

My node is starting again ... it appears #2239 is a working fix, so I'll close this now.

@jimpick jimpick closed this as completed Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants