[Merged by Bors] - Handle early blocks #2155

paulhauner · 2021-01-14T03:59:26Z

Issue Addressed

NA

Problem this PR addresses

There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:

Gossip block 0xabc arrives ~200ms early
- It is propagated across the network, with respect to MAXIMUM_GOSSIP_CLOCK_DISPARITY.
- However, it is not imported to our database since the block is early.
Attestations for 0xabc arrive, but the block was not imported.
- The peer that sent the attestation is down-voted.
  - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
  - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

Potential solutions

I can think of three solutions to this:

Wait for attestation-queuing (Maintain Attestations who reference unknown blocks #635) to arrive and solve this.
- Easy
- Not immediate fix.
- Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
Allow importing blocks with a tolerance of MAXIMUM_GOSSIP_CLOCK_DISPARITY.
- Easy
- ~~I have implemented this, for now.~~
If a block is verified for gossip propagation (i.e., signature verified) and it's within MAXIMUM_GOSSIP_CLOCK_DISPARITY, then queue it to be processed at the start of the appropriate slot.
- More difficult
- Feels like the best solution, I will try to implement this.

This PR takes approach (3).

Changes included

Implement the block_delay_queue, based upon a DelayQueue which can store blocks until it's time to import them.
Add a new DelayedImportBlock variant to the beacon_processor::WorkEvent enum to handle this new event.
In the BeaconProcessor, refactor a tokio::select! to a struct with an explicit Stream implementation. I experienced some issues with tokio::select! in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that tokio::select! randomly chooses futures to poll, whereas now we're deterministic).
Add a testing framework to the beacon_processor module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the http_api tester.
- To implement these tests I added the concept of an optional work_journal_tx to the BeaconProcessor which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
- The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed any failures due to races on my machine or on CI yet.
- To assist with testing I allowed for directly setting the time on the ManualSlotClock.
I gave the beacon_processor::Worker a Toolbox for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.

paulhauner · 2021-01-14T05:12:30Z

Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.

When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

This suggests that we're being too harsh on unknown-block attestations, but perhaps I'm missing something. Do you have thoughts on this @divagant-martian? :)

AgeManning · 2021-01-17T22:44:24Z

This is supposed to be handled by caching attestations.

Each attestation with an unknown block should get cached and a request via RPC to find the block should take place. We should only penalize after the peer fails to send the block via the RPC. #635 - @divagant-martian was working on this

paulhauner · 2021-01-19T00:52:39Z

This is supposed to be handled by caching attestations.

I agree it would help us here (and many other places), but for this specific scenario I think queuing blocks is a more efficient solution in the long term. No need for an RPC call to re-get the block and no need to use memory to cache attestations.

This reverts commit 09ddeff.

This reverts commit f61c4e0.

paulhauner · 2021-02-23T07:39:03Z

~~FYI I haven't seen this pass CI yet, but it works locally. ~~

Not sure why ~~strikethrough~~ isn't working above... But consider that statement retracted.

michaelsproul

Looks good!

Great work on the tests particularly, I know how tricksy they were, but they're really valuable ❤️

Just a few typo corrections and then we're ready to merge I think!

beacon_node/network/src/beacon_processor/block_delay_queue.rs

michaelsproul · 2021-02-23T23:17:51Z

beacon_node/network/src/beacon_processor/block_delay_queue.rs

+
+                    if let Some(duration_till_slot) = slot_clock.duration_to_slot(block_slot) {
+                        // Check to ensure this won't over-fill the queue.
+                        if queued_block_roots.len() > MAXIMUM_QUEUED_BLOCKS {


Should this be >=?

Off by one! Nice catch 🙏

beacon_node/network/src/beacon_processor/tests.rs

Co-authored-by: Michael Sproul <[email protected]>

paulhauner · 2021-02-24T02:25:33Z

All comments addressed! Thank you!

michaelsproul

👌

paulhauner · 2021-02-24T03:08:37Z

bors r+

## Issue Addressed NA ## Problem this PR addresses There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events: 1. Gossip block 0xabc arrives ~200ms early - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets). - However, it is not imported to our database since the block is early. 2. Attestations for 0xabc arrive, but the block was not imported. - The peer that sent the attestation is down-voted. - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100. - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc). ## Potential solutions I can think of three solutions to this: 1. Wait for attestation-queuing (#635) to arrive and solve this. - Easy - Not immediate fix. - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better. 1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`. - Easy - ~~I have implemented this, for now.~~ 1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot. - More difficult - Feels like the best solution, I will try to implement this. **This PR takes approach (3).** ## Changes included - Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them. - Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event. - In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic). - Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester. - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect. - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet. - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`. - I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.

bors · 2021-02-24T04:15:07Z

Pull request successfully merged into unstable.

Build succeeded:

## Issue Addressed NA ## Problem this PR addresses There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events: 1. Gossip block 0xabc arrives ~200ms early - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets). - However, it is not imported to our database since the block is early. 2. Attestations for 0xabc arrive, but the block was not imported. - The peer that sent the attestation is down-voted. - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100. - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc). ## Potential solutions I can think of three solutions to this: 1. Wait for attestation-queuing (#635) to arrive and solve this. - Easy - Not immediate fix. - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better. 1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`. - Easy - ~~I have implemented this, for now.~~ 1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot. - More difficult - Feels like the best solution, I will try to implement this. **This PR takes approach (3).** ## Changes included - Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them. - Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event. - In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic). - Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester. - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect. - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet. - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`. - I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.

This reverts commit 8608dc1.

Ensure blocks can be imported with the clock disp.

66cafda

paulhauner added the work-in-progress PR is a work-in-progress label Jan 14, 2021

paulhauner changed the base branch from stable to unstable January 14, 2021 03:59

Make clippy happy

c85f9ea

paulhauner mentioned this pull request Jan 14, 2021

We experienced some missed attestations this morning (Raspberry Pi 4 8 GB) #2063

Closed

paulhauner added 3 commits January 15, 2021 11:36

Swap EthSpec -> BeaconChainTypes

7ad1720

Thread delayed block

10f181d

Start adding slot stream

f61c4e0

paulhauner self-assigned this Jan 19, 2021

paulhauner added 18 commits January 20, 2021 11:06

Add progress

09ddeff

Revert "Add progress"

ce8045a

This reverts commit 09ddeff.

Revert "Start adding slot stream"

8d8f1b0

This reverts commit f61c4e0.

First compiling delayqueue impl

c55bd56

Start threading new queues

620006b

Merge branch 'unstable' into block-clock-disp

4fc8643

Fix merge conflicts

3cc9a55

Fix misc compile errors

8d6ed40

Tidy, improve

65601a9

Tidy, add metrics and logging

c6e1748

Start adding test framework

03d204c

Fix test rig compile errors

6adb811

Get test passing

051ef06

Add toolbox, constants

f3f8b1a

Add some more tests

cd7a754

Update tokio-util

c905e63

Try new polling type

8051f17

Fix tests

b80d64e

paulhauner added 8 commits February 23, 2021 16:54

Merge branch 'unstable' into block-clock-disp

59b6326

Update cargo.lock

5e66259

Don't allow importing blocks before their slot

9bfd1a0

Handle delay queue error

198dae6

Add comments

554cf36

Remove unused dep

bf1a9e7

Improve readability

ec22ade

Tidy, add comments

3167c16

paulhauner marked this pull request as ready for review February 23, 2021 07:38

paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Feb 23, 2021

paulhauner requested a review from michaelsproul February 23, 2021 07:38

Add metric to see how early blocks arrived

1f4be66

michaelsproul reviewed Feb 24, 2021

View reviewed changes

paulhauner and others added 2 commits February 24, 2021 13:24

Apply suggestions from code review

6800d8e

Co-authored-by: Michael Sproul <[email protected]>

Fix length check

215b6be

michaelsproul approved these changes Feb 24, 2021

View reviewed changes

paulhauner added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Feb 24, 2021

bors bot changed the title ~~Handle early blocks~~ [Merged by Bors] - Handle early blocks Feb 24, 2021

bors bot closed this Feb 24, 2021

michaelsproul added a commit that referenced this pull request Mar 10, 2021

Revert "Handle early blocks (#2155)"

b0586d2

This reverts commit 8608dc1.

paulhauner mentioned this pull request Mar 15, 2021

Rethink FFG target block ethereum/consensus-specs#2174

Open

paulhauner deleted the block-clock-disp branch March 17, 2021 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Handle early blocks #2155

[Merged by Bors] - Handle early blocks #2155

paulhauner commented Jan 14, 2021 •

edited

Loading

paulhauner commented Jan 14, 2021

AgeManning commented Jan 17, 2021

paulhauner commented Jan 19, 2021

paulhauner commented Feb 23, 2021 •

edited

Loading

michaelsproul left a comment

michaelsproul Feb 23, 2021

paulhauner Feb 24, 2021

paulhauner commented Feb 24, 2021 •

edited

Loading

michaelsproul left a comment

paulhauner commented Feb 24, 2021

bors bot commented Feb 24, 2021

[Merged by Bors] - Handle early blocks #2155

[Merged by Bors] - Handle early blocks #2155

Conversation

paulhauner commented Jan 14, 2021 • edited Loading

Issue Addressed

Problem this PR addresses

Potential solutions

Changes included

paulhauner commented Jan 14, 2021

AgeManning commented Jan 17, 2021

paulhauner commented Jan 19, 2021

paulhauner commented Feb 23, 2021 • edited Loading

michaelsproul left a comment

Choose a reason for hiding this comment

michaelsproul Feb 23, 2021

Choose a reason for hiding this comment

paulhauner Feb 24, 2021

Choose a reason for hiding this comment

paulhauner commented Feb 24, 2021 • edited Loading

michaelsproul left a comment

Choose a reason for hiding this comment

paulhauner commented Feb 24, 2021

bors bot commented Feb 24, 2021

paulhauner commented Jan 14, 2021 •

edited

Loading

paulhauner commented Feb 23, 2021 •

edited

Loading

paulhauner commented Feb 24, 2021 •

edited

Loading