Pause sync when execution layer is offline #3094

pawanjay176 · 2022-03-16T13:16:11Z

Issue Addressed

Resolves #3032

Proposed Changes

This PR fixes 2 issues with post-merge sync:

Execution engine offline errors were treated similar to other BeaconChainErrors in range sync which caused peers to get downscored and kicked for potentially no fault of theirs. This PR pauses RangeSync when the EE is offline and only resumes after the EE is back online. It doesn't do any leftover block processing until the EE is back online and hence doesn't disconnect from any connected peers. It polls the EE to check if it's online every 5 seconds and resumes the chains once it is back online.
If the EE goes offline when the BN is fully synced, processing parent chain lookups returns EE errors which causes the parent chain to get added to the failed_chains list .Any subsequent request for the same parent chain gets ignored. Here, even a single slot EE outage causes a full epoch of beacon chain outage because failed chains aren't processed until range sync kicks in after an epoch. To fix this, we don't add the parent lookup to the list of failed_chains on EE offline errors.

Additional info

Currently rebased on #3036 until it gets merged to unstable.

divagant-martian

So far I've only skimmed the range part and I think there are some problems, I made a test in case you want to include it/ check what I did divagant-martian@73dcdca

beacon_node/network/src/sync/manager.rs

beacon_node/network/src/sync/range_sync/batch.rs

beacon_node/network/src/sync/range_sync/range.rs

divagant-martian · 2022-04-06T16:20:40Z

So I skimmed now the parent lookup part and have a couple of questions
Wasn't really able to tell the difference between waiting_blocks and waiting_execution since both seem to have a copy of blocks currently processing
Also, until now we would only send a parent chain for processing when a block was already known and here we seem to do it much more often, for example when we get a single block lookup that matches with the chain from the first block. We need to match the chain with the db on the last block so I'm not sure why we do this

divagant-martian · 2022-05-24T19:49:54Z

beacon_node/network/src/sync/range_sync/batch.rs

+    /// The batch is waiting for the execution layer to resume validation.
+    WaitingOnExecution,


I still don't know why clippy does not pick this up, but BatchState::AwaitingExecution is not created anywhere. When we get a stall_execution = true we are using BatchState::AwaitingDownload without counting an additional failed attempt. I think this is the way to go and the new state is not necessary.

Sorry, forgot to remove this one. I think clippy didn't pick this up because the enum variant was checked in match statements.

divagant-martian · 2022-05-24T21:15:29Z

beacon_node/network/src/sync/range_sync/chain.rs

+    pub fn execution_stalled(&mut self) {
+        self.state = ChainSyncingState::ExecutionStalled;
+    }
+
+    pub fn execution_resumed(
+        &mut self,
+        network: &mut SyncNetworkContext<T::EthSpec>,
+    ) -> ProcessingResult {
+        if let ChainSyncingState::ExecutionStalled = self.state {
+            self.state = ChainSyncingState::Syncing;
+            return self.request_batches(network);
+        }
+        Err(RemoveChain::WrongBatchState(
+            "Invalid batch state".to_string(),
+        ))
+    }


I didn't find references to the execution_stalled function. I would expect this to be called fr other chains because the failure is identified at the chain level but rn not propagated up. So, if one chain gets the error, it stalls, and on resume, all chains are resumed, they find themselves in a wrong state and ask to be removed. I think this is hard to notice is tests / masked since sync is relatively self-resilient

Nice catch. My thinking was each chain will get stalled once they get an offline execution error on each of the chains. But didn't consider that they might go into an invalid state if it we resume before it goes into waiting.

divagant-martian · 2022-05-24T21:25:57Z

beacon_node/lighthouse_network/src/types/sync_state.rs

@@ -71,6 +74,7 @@ impl SyncState {
            SyncState::BackFillSyncing { .. } => false,
            SyncState::Synced => false,
            SyncState::Stalled => false,
+            SyncState::WaitingOnExecution => false,


This affects some things like work we do on the beacon processor. Here we are known to be not-synced but not sending batches for processing. I'm not sure of the right value here @paulhauner

…cution

divagant-martian

I still have not finished checking everything, but will continue as we progress.
Overall my main concern is how we tell the rest (other chains in range, backfill, parent lookups) that they should stop processing. Maybe I'm overlooking something, but I would guess a shared state would be more reliable / easier to handle vs having a new state in each form of sync we have

divagant-martian · 2022-05-26T20:27:35Z

beacon_node/http_api/src/lib.rs

+                            Ok(())
+                        }
+                        SyncState::Stalled | SyncState::WaitingOnExecution => Err(
+                            warp_utils::reject::not_synced("sync is stalled".to_string()),


We should differentiate these two. The Display impl for SyncState might be enough for this error

beacon_node/network/src/sync/range_sync/batch.rs

divagant-martian · 2022-05-26T20:56:29Z

beacon_node/network/src/sync/range_sync/chain.rs

@@ -109,6 +109,8 @@ pub enum ChainSyncingState {
    Stopped,
    /// The chain is undergoing syncing.
    Syncing,
+    /// The chain sync is stalled because the execution layer is offline.


I'm thinking after reading the rest of the code, that this is not needed. The Stopped chain state already handles preventing batches from being sent for processing, and preventing requesting additional batches. To inform of the general state to all chains, without worrying of propagating up or down I think we could have the Engines state in some form of shared mutable state (an arc and a lock?) That way newly created chains can rely on that state

beacon_node/network/src/sync/range_sync/chain.rs

beacon_node/network/src/sync/range_sync/chain_collection.rs

divagant-martian · 2022-05-26T21:08:29Z

beacon_node/network/src/sync/backfill_sync/mod.rs

+                BatchState::Failed
+                | BatchState::AwaitingDownload
+                | BatchState::Processing(_)
+                | BatchState::AwaitingValidation(_) => {


Why are these grouped now?

damn I think this got deleted in some bad merge. Thanks!

divagant-martian · 2022-05-26T21:13:57Z

beacon_node/network/src/sync/block_lookups/mod.rs

+        if let BlockLookupStatus::WaitingOnExecution = self.status() {
+            return;
+        }


I think we can still request blocks, taking care of limiting how many of those we get

divagant-martian · 2022-05-26T21:14:54Z

beacon_node/network/src/sync/block_lookups/mod.rs

+        if let BlockLookupStatus::WaitingOnExecution = self.status() {
+            return;
+        }


aren't we dropping the block we just requested?

divagant-martian · 2022-05-26T21:15:46Z

beacon_node/network/src/sync/block_lookups/mod.rs

+    ) -> BlockLookupStatus {
+        if let BlockLookupStatus::WaitingOnExecution = self.status() {
+            return self.status();
+        }


if processing succeeded here I think we still want to handle that case

divagant-martian · 2022-05-26T21:25:06Z

beacon_node/network/src/sync/manager.rs

+struct WaitingOnExecution {
+    range: bool,
+    block_lookup: bool,
+}


in what scenario would happen that one is stalled and the other is not, beyond the time in which one is aware of it but the other is not?

I separated them so that we don't resume sync on RangeSync if range sync had not even started and then subsequently stalled. Basically this scenario:

Node is fully synced/undergoing historical sync, now execution goes offline

ParentLookup gets paused on execution

Execution comes back online, ParentLookup status is set to Activated and execution_ready is sent to RangeSync even though syncing is stopped.

Maybe we can handle this better in RangeSync such that there's no invalid state if ChainState == Idle?

I think this would work, so that we handle two separate states. One is the sync state, that imo stays the same, and the EE online status. A stopped chain should not do anything even if the EE is back online (this should be equivalent a the era pre EL, CL division.. basically the chain assumes it can send chains for processing, but doesn't) I see it similar to the rest of syncing forms.. basically if we don't see that we can send blocks/batches/chains for processing, we don't. As far as I remember, every sync form should already handle the case in which we store blocks but don't send for some reason, except maybe single block lookups

## Issue Addressed currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094 ## Proposed Changes If a batch fails but the peer is not at fault, do not count the attempt Also removes some annoying logs ## Additional Info We still get a counter on ignored attempts.. just in case

## Issue Addressed Partly resolves #3032 ## Proposed Changes Extracts some of the functionality of #3094 into a separate PR as the original PR requires a bit more work. Do not unnecessarily penalize peers when we fail to validate received execution payloads because our execution layer is offline.

paulhauner · 2022-07-27T04:47:17Z

I'm dropping the v2.5.0 tag after a chat with @AgeManning. I think it's totally fine that this isn't in v2.5.0 and I wouldn't like to rush this or delay the release ☺️

pawanjay176 · 2022-08-17T20:46:32Z

Closing this in favor of #3428

@pawanjay176

## Issue Addressed #3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of #3094 Will work best with #3439 Co-authored-by: Pawan Dhananjay <[email protected]>

@pawanjay176

## Issue Addressed sigp#3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of sigp#3094 Will work best with sigp#3439 Co-authored-by: Pawan Dhananjay <[email protected]>

paulhauner added work-in-progress PR is a work-in-progress bellatrix Required to support the Bellatrix Upgrade labels Mar 21, 2022

pawanjay176 force-pushed the merge-sync branch 2 times, most recently from 194426c to 1f45341 Compare March 30, 2022 20:39

pawanjay176 force-pushed the merge-sync branch from 1f45341 to a7bfa1c Compare April 5, 2022 14:51

divagant-martian suggested changes Apr 5, 2022

View reviewed changes

beacon_node/network/src/sync/manager.rs Outdated Show resolved Hide resolved

beacon_node/network/src/sync/range_sync/batch.rs Outdated Show resolved Hide resolved

beacon_node/network/src/sync/range_sync/range.rs Outdated Show resolved Hide resolved

pawanjay176 added 9 commits April 11, 2022 11:11

Initial commit

431fd46

Broadcast updates engine state

ad1b7c3

Poll execution chain from sync manager

29317fa

Don't add to failed chains on execution errors

9384c37

Fix minor errors from rebase

d9ed9dc

Further progress

bd18efb

Reduce redundancy of parent requests

b66ebe6

Further reduce parent request redundancy

8956caf

Fix range sync batch status

c2ba263

pawanjay176 force-pushed the merge-sync branch from a7bfa1c to c2ba263 Compare April 11, 2022 14:23

pawanjay176 added 13 commits April 12, 2022 19:48

Apply suggestions from review

277112a

Fix old tests

7932e37

Return block with offline execution layer errors

df708dd

Fix inconsistencies in parent_lookup state; tidy

09f348d

Don't penalize gossip peer when execution node is offline

a737f52

Add all parent chain blocks to waiting execution

78669f5

Merge branch 'unstable' into merge-sync

45af898

Fix insert_block

6471117

Merge branch 'unstable' into merge-sync

f9a810f

notify sync when EL comes back online instead of polling every interval

b8b4dee

cleanup

831904e

Refactor block lookup post merge handling

71bb87f

more cleanup

04c6a9a

divagant-martian added the under-review A reviewer has only partially completed a review. label May 22, 2022

divagant-martian reviewed May 24, 2022

View reviewed changes

pawanjay176 added 2 commits May 25, 2022 20:18

Remove unused code; stall all syncing chains if one is waiting on exe…

8531bba

…cution

Merge branch 'unstable' into merge-sync

a4007af

divagant-martian suggested changes May 26, 2022

View reviewed changes

divagant-martian mentioned this pull request May 27, 2022

more sync tests #3227

Closed

pawanjay176 added 7 commits May 30, 2022 17:33

Fix some issues raised in review

e92ef5e

Use shared mutable state for execution layer status

817709b

Notify sync manager when execution status changes

f7f5254

Add docs

6880ec9

Remove unnecessary async code

076e646

Fix tests

5da5769

Merge branch 'unstable' into merge-sync

efc2d89

divagant-martian mentioned this pull request Jun 6, 2022

[Merged by Bors] - do not count sync batch attempts when peer is not at fault #3245

Closed

pawanjay176 mentioned this pull request Jun 9, 2022

[Merged by Bors] - Do not penalize peers on execution layer offline errors #3258

Closed

paulhauner added the v2.5.0 Required for Goerli merge release label Jul 20, 2022

paulhauner removed the v2.5.0 Required for Goerli merge release label Jul 27, 2022

pawanjay176 mentioned this pull request Jul 29, 2022

Sync stuck on execution engine delayed restart #3390

Closed

paulhauner mentioned this pull request Aug 1, 2022

Pause sync whilst EE offline #3032

Closed

divagant-martian mentioned this pull request Aug 8, 2022

[Merged by Bors] - Pause sync when EE is offline #3428

Closed

paulhauner added the v3.1.0 label Aug 12, 2022

pawanjay176 closed this Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pause sync when execution layer is offline #3094

Pause sync when execution layer is offline #3094

pawanjay176 commented Mar 16, 2022

divagant-martian left a comment

divagant-martian commented Apr 6, 2022 •

edited

Loading

divagant-martian May 24, 2022

pawanjay176 May 25, 2022

divagant-martian May 24, 2022 •

edited

Loading

pawanjay176 May 25, 2022

divagant-martian May 24, 2022

divagant-martian left a comment •

edited

Loading

divagant-martian May 26, 2022

divagant-martian May 26, 2022

divagant-martian May 26, 2022

pawanjay176 May 27, 2022

divagant-martian May 26, 2022

divagant-martian May 26, 2022

divagant-martian May 26, 2022

divagant-martian May 26, 2022

pawanjay176 May 30, 2022

divagant-martian May 31, 2022

paulhauner commented Jul 27, 2022

pawanjay176 commented Aug 17, 2022

		/// The batch is waiting for the execution layer to resume validation.
		WaitingOnExecution,

Pause sync when execution layer is offline #3094

Pause sync when execution layer is offline #3094

Conversation

pawanjay176 commented Mar 16, 2022

Issue Addressed

Proposed Changes

Additional info

divagant-martian left a comment

Choose a reason for hiding this comment

divagant-martian commented Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

divagant-martian May 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

divagant-martian left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulhauner commented Jul 27, 2022

pawanjay176 commented Aug 17, 2022

divagant-martian commented Apr 6, 2022 •

edited

Loading

divagant-martian May 24, 2022 •

edited

Loading

divagant-martian left a comment •

edited

Loading