Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChainMetadataService will not clear out stale metadata #5037

Closed
SWvheerden opened this issue Dec 9, 2022 · 0 comments · Fixed by #5039
Closed

ChainMetadataService will not clear out stale metadata #5037

SWvheerden opened this issue Dec 9, 2022 · 0 comments · Fixed by #5039
Labels
A-base_node Area - The Tari base node executable and libraries C-bug Category - fixes a bug, typically associated with an issue.

Comments

@SWvheerden
Copy link
Collaborator

The service relies on a vec called peer_chain_metadata.
This vec stores all the received metadata pings and pong replies from nodes.
This vec is never cleared of stale information. A node will only be removed when it's disconnected or banned, if this does not happen, a node will keep and make logical decisions based on an old node's metadata forever.

We need to try and implement this vec as some TTL cache so that old information gets removed automatically.

I encountered the following behaviour on my node

  • a Node sent my node chain metadata that was much higher than my current
  • Before my node could sync to that node, that node banned my local node
  • My node kept trying to sync to that node as it had the highest pow chain stored in the chain metadata
  • Syncing kept failing with Connectivity Error: ConnectionFailed: Identity protocol failed: IoError: Broken pipe (os error 32)
  • My node was stuck at a much lower height, as it could never ban that node or sync to that node
@SWvheerden SWvheerden added the A-base_node Area - The Tari base node executable and libraries label Dec 9, 2022
@SWvheerden SWvheerden added the C-bug Category - fixes a bug, typically associated with an issue. label Dec 9, 2022
stringhandler pushed a commit that referenced this issue Dec 12, 2022
…5039)

Description
---
- Removes "caching" of peer metadata
- Send only one peer chain metadata at a time to the listening state
- increase chain metadata event channel to 20 (so that 20 of the last chain metadata received can be read by listening)

Motivation and Context
---
Fixes #5037 and #5030 (needs to be confirmed since this case is not easy to reproduce)

What I've observed is that the peer is banned twice. The first time, for 30 minutes (between 04:07:23.802123900 and 2022-12-09 04:41:26.365429300) then again permanently (bad metadata sig). It's clear in the logs that the pong is received before being banned, the node is banned (chain metadata service clears the peer's metadata) but the message is already in the domain pipeline so another one is received. The chain metadata service now keeps the peer's chain metadata and sends it to the listening state every time because it is not cleared.

TL;DR classic race condition.

Order of events:

- peer is banned for 30 minutes
- peer ban expires
- we receive a ping from the peer
- at almost the same time peer is banned (see logs in [this comment](#5030 (comment)))
- chain metadata service clears the peer from peer_chain_metadata
- the ping/pong is received (already in the pipeline from before the ban)
- the chain metadata is added to the vec, and is never cleared (because the peer is not banned again)
- the peer stays in the list despite being banned and NOT connected, so the header sync continues to try to connect to it


How Has This Been Tested?
---
Manually: rewind-blockchain and sync, enters sync mode timeously
Removed some tests that test removed functions
Added a new unit test for determine_sync_state
@stringhandler stringhandler moved this from Done to In Review in Tari Esme Testnet Dec 12, 2022
@stringhandler stringhandler moved this from In Review to Done in Tari Esme Testnet Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-base_node Area - The Tari base node executable and libraries C-bug Category - fixes a bug, typically associated with an issue.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant