Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Handle early blocks #2155

Closed
wants to merge 39 commits into from
Closed

Conversation

paulhauner
Copy link
Member

@paulhauner paulhauner commented Jan 14, 2021

Issue Addressed

NA

Problem this PR addresses

There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:

  1. Gossip block 0xabc arrives ~200ms early
    • It is propagated across the network, with respect to MAXIMUM_GOSSIP_CLOCK_DISPARITY.
    • However, it is not imported to our database since the block is early.
  2. Attestations for 0xabc arrive, but the block was not imported.
    • The peer that sent the attestation is down-voted.
      • Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
      • When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

Potential solutions

I can think of three solutions to this:

  1. Wait for attestation-queuing (Maintain Attestations who reference unknown blocks #635) to arrive and solve this.
    • Easy
    • Not immediate fix.
    • Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
  2. Allow importing blocks with a tolerance of MAXIMUM_GOSSIP_CLOCK_DISPARITY.
    • Easy
    • I have implemented this, for now.
  3. If a block is verified for gossip propagation (i.e., signature verified) and it's within MAXIMUM_GOSSIP_CLOCK_DISPARITY, then queue it to be processed at the start of the appropriate slot.
    • More difficult
    • Feels like the best solution, I will try to implement this.

This PR takes approach (3).

Changes included

  • Implement the block_delay_queue, based upon a DelayQueue which can store blocks until it's time to import them.
  • Add a new DelayedImportBlock variant to the beacon_processor::WorkEvent enum to handle this new event.
  • In the BeaconProcessor, refactor a tokio::select! to a struct with an explicit Stream implementation. I experienced some issues with tokio::select! in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that tokio::select! randomly chooses futures to poll, whereas now we're deterministic).
  • Add a testing framework to the beacon_processor module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the http_api tester.
    • To implement these tests I added the concept of an optional work_journal_tx to the BeaconProcessor which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
    • The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed any failures due to races on my machine or on CI yet.
    • To assist with testing I allowed for directly setting the time on the ManualSlotClock.
  • I gave the beacon_processor::Worker a Toolbox for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.

@paulhauner paulhauner added the work-in-progress PR is a work-in-progress label Jan 14, 2021
@paulhauner paulhauner changed the base branch from stable to unstable January 14, 2021 03:59
@paulhauner
Copy link
Member Author

Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.

When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

This suggests that we're being too harsh on unknown-block attestations, but perhaps I'm missing something. Do you have thoughts on this @divagant-martian? :)

@AgeManning
Copy link
Member

This is supposed to be handled by caching attestations.

Each attestation with an unknown block should get cached and a request via RPC to find the block should take place. We should only penalize after the peer fails to send the block via the RPC. #635 - @divagant-martian was working on this

@paulhauner
Copy link
Member Author

This is supposed to be handled by caching attestations.

I agree it would help us here (and many other places), but for this specific scenario I think queuing blocks is a more efficient solution in the long term. No need for an RPC call to re-get the block and no need to use memory to cache attestations.

@paulhauner paulhauner self-assigned this Jan 19, 2021
@paulhauner paulhauner marked this pull request as ready for review February 23, 2021 07:38
@paulhauner paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Feb 23, 2021
@paulhauner
Copy link
Member Author

paulhauner commented Feb 23, 2021

~~FYI I haven't seen this pass CI yet, but it works locally. ~~

Not sure why strikethrough isn't working above... But consider that statement retracted.

Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Great work on the tests particularly, I know how tricksy they were, but they're really valuable ❤️

Just a few typo corrections and then we're ready to merge I think!


if let Some(duration_till_slot) = slot_clock.duration_to_slot(block_slot) {
// Check to ensure this won't over-fill the queue.
if queued_block_roots.len() > MAXIMUM_QUEUED_BLOCKS {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be >=?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off by one! Nice catch 🙏

@paulhauner
Copy link
Member Author

paulhauner commented Feb 24, 2021

All comments addressed! Thank you!

Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@paulhauner
Copy link
Member Author

bors r+

@paulhauner paulhauner added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Feb 24, 2021
bors bot pushed a commit that referenced this pull request Feb 24, 2021
## Issue Addressed

NA

## Problem this PR addresses

There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:

1. Gossip block 0xabc arrives ~200ms early
    - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets).
    - However, it is not imported to our database since the block is early.
2. Attestations for 0xabc arrive, but the block was not imported.
    - The peer that sent the attestation is down-voted.
        - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
        - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

## Potential solutions

I can think of three solutions to this:

1. Wait for attestation-queuing (#635) to arrive and solve this.
    - Easy
    - Not immediate fix.
    - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`.
    - Easy
    - ~~I have implemented this, for now.~~
1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot.
    - More difficult
    - Feels like the best solution, I will try to implement this.
    
    
**This PR takes approach (3).**

## Changes included

- Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them.
- Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event.
- In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic).
- Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester.
    - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
    - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet.
    - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`.
- I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
@bors bors bot changed the title Handle early blocks [Merged by Bors] - Handle early blocks Feb 24, 2021
@bors bors bot closed this Feb 24, 2021
michaelsproul pushed a commit that referenced this pull request Mar 10, 2021
## Issue Addressed

NA

## Problem this PR addresses

There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:

1. Gossip block 0xabc arrives ~200ms early
    - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets).
    - However, it is not imported to our database since the block is early.
2. Attestations for 0xabc arrive, but the block was not imported.
    - The peer that sent the attestation is down-voted.
        - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
        - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

## Potential solutions

I can think of three solutions to this:

1. Wait for attestation-queuing (#635) to arrive and solve this.
    - Easy
    - Not immediate fix.
    - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`.
    - Easy
    - ~~I have implemented this, for now.~~
1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot.
    - More difficult
    - Feels like the best solution, I will try to implement this.
    
    
**This PR takes approach (3).**

## Changes included

- Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them.
- Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event.
- In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic).
- Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester.
    - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
    - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet.
    - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`.
- I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
michaelsproul added a commit that referenced this pull request Mar 10, 2021
@paulhauner paulhauner deleted the block-clock-disp branch March 17, 2021 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants