Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause sync when execution layer is offline #3094

Closed
wants to merge 38 commits into from

Conversation

pawanjay176
Copy link
Member

Issue Addressed

Resolves #3032

Proposed Changes

This PR fixes 2 issues with post-merge sync:

  1. Execution engine offline errors were treated similar to other BeaconChainErrors in range sync which caused peers to get downscored and kicked for potentially no fault of theirs. This PR pauses RangeSync when the EE is offline and only resumes after the EE is back online. It doesn't do any leftover block processing until the EE is back online and hence doesn't disconnect from any connected peers. It polls the EE to check if it's online every 5 seconds and resumes the chains once it is back online.

  2. If the EE goes offline when the BN is fully synced, processing parent chain lookups returns EE errors which causes the parent chain to get added to the failed_chains list .Any subsequent request for the same parent chain gets ignored. Here, even a single slot EE outage causes a full epoch of beacon chain outage because failed chains aren't processed until range sync kicks in after an epoch. To fix this, we don't add the parent lookup to the list of failed_chains on EE offline errors.

Additional info

Currently rebased on #3036 until it gets merged to unstable.

@paulhauner paulhauner added work-in-progress PR is a work-in-progress bellatrix Required to support the Bellatrix Upgrade labels Mar 21, 2022
@pawanjay176 pawanjay176 force-pushed the merge-sync branch 2 times, most recently from 194426c to 1f45341 Compare March 30, 2022 20:39
Copy link
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I've only skimmed the range part and I think there are some problems, I made a test in case you want to include it/ check what I did divagant-martian@73dcdca

beacon_node/network/src/sync/manager.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/range_sync/batch.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/range_sync/range.rs Outdated Show resolved Hide resolved
@divagant-martian
Copy link
Collaborator

divagant-martian commented Apr 6, 2022

So I skimmed now the parent lookup part and have a couple of questions
Wasn't really able to tell the difference between waiting_blocks and waiting_execution since both seem to have a copy of blocks currently processing
Also, until now we would only send a parent chain for processing when a block was already known and here we seem to do it much more often, for example when we get a single block lookup that matches with the chain from the first block. We need to match the chain with the db on the last block so I'm not sure why we do this

@divagant-martian divagant-martian added the under-review A reviewer has only partially completed a review. label May 22, 2022
Comment on lines 109 to 110
/// The batch is waiting for the execution layer to resume validation.
WaitingOnExecution,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't know why clippy does not pick this up, but BatchState::AwaitingExecution is not created anywhere. When we get a stall_execution = true we are using BatchState::AwaitingDownload without counting an additional failed attempt. I think this is the way to go and the new state is not necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to remove this one. I think clippy didn't pick this up because the enum variant was checked in match statements.

Comment on lines 766 to 781
pub fn execution_stalled(&mut self) {
self.state = ChainSyncingState::ExecutionStalled;
}

pub fn execution_resumed(
&mut self,
network: &mut SyncNetworkContext<T::EthSpec>,
) -> ProcessingResult {
if let ChainSyncingState::ExecutionStalled = self.state {
self.state = ChainSyncingState::Syncing;
return self.request_batches(network);
}
Err(RemoveChain::WrongBatchState(
"Invalid batch state".to_string(),
))
}
Copy link
Collaborator

@divagant-martian divagant-martian May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find references to the execution_stalled function. I would expect this to be called fr other chains because the failure is identified at the chain level but rn not propagated up. So, if one chain gets the error, it stalls, and on resume, all chains are resumed, they find themselves in a wrong state and ask to be removed. I think this is hard to notice is tests / masked since sync is relatively self-resilient

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. My thinking was each chain will get stalled once they get an offline execution error on each of the chains. But didn't consider that they might go into an invalid state if it we resume before it goes into waiting.

@@ -71,6 +74,7 @@ impl SyncState {
SyncState::BackFillSyncing { .. } => false,
SyncState::Synced => false,
SyncState::Stalled => false,
SyncState::WaitingOnExecution => false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects some things like work we do on the beacon processor. Here we are known to be not-synced but not sending batches for processing. I'm not sure of the right value here @paulhauner

Copy link
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have not finished checking everything, but will continue as we progress.
Overall my main concern is how we tell the rest (other chains in range, backfill, parent lookups) that they should stop processing. Maybe I'm overlooking something, but I would guess a shared state would be more reliable / easier to handle vs having a new state in each form of sync we have

Ok(())
}
SyncState::Stalled | SyncState::WaitingOnExecution => Err(
warp_utils::reject::not_synced("sync is stalled".to_string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should differentiate these two. The Display impl for SyncState might be enough for this error

beacon_node/network/src/sync/range_sync/batch.rs Outdated Show resolved Hide resolved
@@ -109,6 +109,8 @@ pub enum ChainSyncingState {
Stopped,
/// The chain is undergoing syncing.
Syncing,
/// The chain sync is stalled because the execution layer is offline.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking after reading the rest of the code, that this is not needed. The Stopped chain state already handles preventing batches from being sent for processing, and preventing requesting additional batches. To inform of the general state to all chains, without worrying of propagating up or down I think we could have the Engines state in some form of shared mutable state (an arc and a lock?) That way newly created chains can rely on that state

beacon_node/network/src/sync/range_sync/chain.rs Outdated Show resolved Hide resolved
Comment on lines 738 to 741
BatchState::Failed
| BatchState::AwaitingDownload
| BatchState::Processing(_)
| BatchState::AwaitingValidation(_) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these grouped now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn I think this got deleted in some bad merge. Thanks!

Comment on lines 94 to 96
if let BlockLookupStatus::WaitingOnExecution = self.status() {
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still request blocks, taking care of limiting how many of those we get

Comment on lines 237 to 239
if let BlockLookupStatus::WaitingOnExecution = self.status() {
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we dropping the block we just requested?

Comment on lines 415 to 418
) -> BlockLookupStatus {
if let BlockLookupStatus::WaitingOnExecution = self.status() {
return self.status();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if processing succeeded here I think we still want to handle that case

Comment on lines 151 to 154
struct WaitingOnExecution {
range: bool,
block_lookup: bool,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what scenario would happen that one is stalled and the other is not, beyond the time in which one is aware of it but the other is not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated them so that we don't resume sync on RangeSync if range sync had not even started and then subsequently stalled. Basically this scenario:

  1. Node is fully synced/undergoing historical sync, now execution goes offline
  2. ParentLookup gets paused on execution
  3. Execution comes back online, ParentLookup status is set to Activated and execution_ready is sent to RangeSync even though syncing is stopped.

Maybe we can handle this better in RangeSync such that there's no invalid state if ChainState == Idle?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work, so that we handle two separate states. One is the sync state, that imo stays the same, and the EE online status. A stopped chain should not do anything even if the EE is back online (this should be equivalent a the era pre EL, CL division.. basically the chain assumes it can send chains for processing, but doesn't) I see it similar to the rest of syncing forms.. basically if we don't see that we can send blocks/batches/chains for processing, we don't. As far as I remember, every sync form should already handle the case in which we store blocks but don't send for some reason, except maybe single block lookups

bors bot pushed a commit that referenced this pull request Jun 7, 2022
## Issue Addressed
currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094 

## Proposed Changes
If a batch fails but the peer is not at fault, do not count the attempt
Also removes some annoying logs

## Additional Info
We still get a counter on ignored attempts.. just in case
bors bot pushed a commit that referenced this pull request Jun 19, 2022
## Issue Addressed

Partly resolves #3032

## Proposed Changes

Extracts some of the functionality of #3094 into a separate PR as the original PR requires a bit more work.
Do not unnecessarily penalize peers when we fail to validate received execution payloads because our execution layer is offline.
@paulhauner paulhauner added the v2.5.0 Required for Goerli merge release label Jul 20, 2022
@paulhauner
Copy link
Member

I'm dropping the v2.5.0 tag after a chat with @AgeManning. I think it's totally fine that this isn't in v2.5.0 and I wouldn't like to rush this or delay the release ☺️

@pawanjay176
Copy link
Member Author

Closing this in favor of #3428

bors bot pushed a commit that referenced this pull request Aug 24, 2022
## Issue Addressed

#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439 

Co-authored-by: Pawan Dhananjay <[email protected]>
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

sigp#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of sigp#3094

Will work best with sigp#3439 

Co-authored-by: Pawan Dhananjay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bellatrix Required to support the Bellatrix Upgrade under-review A reviewer has only partially completed a review. v3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants