-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Handle early blocks #2155
Conversation
This suggests that we're being too harsh on unknown-block attestations, but perhaps I'm missing something. Do you have thoughts on this @divagant-martian? :) |
This is supposed to be handled by caching attestations. Each attestation with an unknown block should get cached and a request via RPC to find the block should take place. We should only penalize after the peer fails to send the block via the RPC. #635 - @divagant-martian was working on this |
I agree it would help us here (and many other places), but for this specific scenario I think queuing blocks is a more efficient solution in the long term. No need for an RPC call to re-get the block and no need to use memory to cache attestations. |
~~FYI I haven't seen this pass CI yet, but it works locally. ~~ Not sure why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Great work on the tests particularly, I know how tricksy they were, but they're really valuable ❤️
Just a few typo corrections and then we're ready to merge I think!
|
||
if let Some(duration_till_slot) = slot_clock.duration_to_slot(block_slot) { | ||
// Check to ensure this won't over-fill the queue. | ||
if queued_block_roots.len() > MAXIMUM_QUEUED_BLOCKS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be >=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off by one! Nice catch 🙏
Co-authored-by: Michael Sproul <[email protected]>
All comments addressed! Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
bors r+ |
## Issue Addressed NA ## Problem this PR addresses There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events: 1. Gossip block 0xabc arrives ~200ms early - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets). - However, it is not imported to our database since the block is early. 2. Attestations for 0xabc arrive, but the block was not imported. - The peer that sent the attestation is down-voted. - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100. - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc). ## Potential solutions I can think of three solutions to this: 1. Wait for attestation-queuing (#635) to arrive and solve this. - Easy - Not immediate fix. - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better. 1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`. - Easy - ~~I have implemented this, for now.~~ 1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot. - More difficult - Feels like the best solution, I will try to implement this. **This PR takes approach (3).** ## Changes included - Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them. - Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event. - In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic). - Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester. - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect. - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet. - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`. - I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
Pull request successfully merged into unstable. Build succeeded: |
## Issue Addressed NA ## Problem this PR addresses There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events: 1. Gossip block 0xabc arrives ~200ms early - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets). - However, it is not imported to our database since the block is early. 2. Attestations for 0xabc arrive, but the block was not imported. - The peer that sent the attestation is down-voted. - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100. - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc). ## Potential solutions I can think of three solutions to this: 1. Wait for attestation-queuing (#635) to arrive and solve this. - Easy - Not immediate fix. - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better. 1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`. - Easy - ~~I have implemented this, for now.~~ 1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot. - More difficult - Feels like the best solution, I will try to implement this. **This PR takes approach (3).** ## Changes included - Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them. - Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event. - In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic). - Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester. - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect. - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet. - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`. - I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
This reverts commit 8608dc1.
Issue Addressed
NA
Problem this PR addresses
There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:
MAXIMUM_GOSSIP_CLOCK_DISPARITY
.Potential solutions
I can think of three solutions to this:
MAXIMUM_GOSSIP_CLOCK_DISPARITY
.I have implemented this, for now.MAXIMUM_GOSSIP_CLOCK_DISPARITY
, then queue it to be processed at the start of the appropriate slot.This PR takes approach (3).
Changes included
block_delay_queue
, based upon aDelayQueue
which can store blocks until it's time to import them.DelayedImportBlock
variant to thebeacon_processor::WorkEvent
enum to handle this new event.BeaconProcessor
, refactor atokio::select!
to a struct with an explicitStream
implementation. I experienced some issues withtokio::select!
in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact thattokio::select!
randomly chooses futures to poll, whereas now we're deterministic).beacon_processor
module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from thehttp_api
tester.work_journal_tx
to theBeaconProcessor
which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.ManualSlotClock
.beacon_processor::Worker
aToolbox
for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.