-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pause sync when execution layer is offline #3094
Conversation
194426c
to
1f45341
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I've only skimmed the range part and I think there are some problems, I made a test in case you want to include it/ check what I did divagant-martian@73dcdca
So I skimmed now the parent lookup part and have a couple of questions |
/// The batch is waiting for the execution layer to resume validation. | ||
WaitingOnExecution, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't know why clippy does not pick this up, but BatchState::AwaitingExecution
is not created anywhere. When we get a stall_execution = true
we are using BatchState::AwaitingDownload
without counting an additional failed attempt. I think this is the way to go and the new state is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to remove this one. I think clippy didn't pick this up because the enum variant was checked in match statements.
pub fn execution_stalled(&mut self) { | ||
self.state = ChainSyncingState::ExecutionStalled; | ||
} | ||
|
||
pub fn execution_resumed( | ||
&mut self, | ||
network: &mut SyncNetworkContext<T::EthSpec>, | ||
) -> ProcessingResult { | ||
if let ChainSyncingState::ExecutionStalled = self.state { | ||
self.state = ChainSyncingState::Syncing; | ||
return self.request_batches(network); | ||
} | ||
Err(RemoveChain::WrongBatchState( | ||
"Invalid batch state".to_string(), | ||
)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find references to the execution_stalled
function. I would expect this to be called fr other chains because the failure is identified at the chain level but rn not propagated up. So, if one chain gets the error, it stalls, and on resume, all chains are resumed, they find themselves in a wrong state and ask to be removed. I think this is hard to notice is tests / masked since sync is relatively self-resilient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. My thinking was each chain will get stalled once they get an offline execution error on each of the chains. But didn't consider that they might go into an invalid state if it we resume before it goes into waiting.
@@ -71,6 +74,7 @@ impl SyncState { | |||
SyncState::BackFillSyncing { .. } => false, | |||
SyncState::Synced => false, | |||
SyncState::Stalled => false, | |||
SyncState::WaitingOnExecution => false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This affects some things like work we do on the beacon processor. Here we are known to be not-synced but not sending batches for processing. I'm not sure of the right value here @paulhauner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have not finished checking everything, but will continue as we progress.
Overall my main concern is how we tell the rest (other chains in range, backfill, parent lookups) that they should stop processing. Maybe I'm overlooking something, but I would guess a shared state would be more reliable / easier to handle vs having a new state in each form of sync we have
beacon_node/http_api/src/lib.rs
Outdated
Ok(()) | ||
} | ||
SyncState::Stalled | SyncState::WaitingOnExecution => Err( | ||
warp_utils::reject::not_synced("sync is stalled".to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should differentiate these two. The Display
impl for SyncState
might be enough for this error
@@ -109,6 +109,8 @@ pub enum ChainSyncingState { | |||
Stopped, | |||
/// The chain is undergoing syncing. | |||
Syncing, | |||
/// The chain sync is stalled because the execution layer is offline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking after reading the rest of the code, that this is not needed. The Stopped
chain state already handles preventing batches from being sent for processing, and preventing requesting additional batches. To inform of the general state to all chains, without worrying of propagating up or down I think we could have the Engines state in some form of shared mutable state (an arc and a lock?) That way newly created chains can rely on that state
BatchState::Failed | ||
| BatchState::AwaitingDownload | ||
| BatchState::Processing(_) | ||
| BatchState::AwaitingValidation(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these grouped now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
damn I think this got deleted in some bad merge. Thanks!
if let BlockLookupStatus::WaitingOnExecution = self.status() { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still request blocks, taking care of limiting how many of those we get
if let BlockLookupStatus::WaitingOnExecution = self.status() { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we dropping the block we just requested?
) -> BlockLookupStatus { | ||
if let BlockLookupStatus::WaitingOnExecution = self.status() { | ||
return self.status(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if processing succeeded here I think we still want to handle that case
struct WaitingOnExecution { | ||
range: bool, | ||
block_lookup: bool, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what scenario would happen that one is stalled and the other is not, beyond the time in which one is aware of it but the other is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated them so that we don't resume sync on RangeSync
if range sync had not even started and then subsequently stalled. Basically this scenario:
- Node is fully synced/undergoing historical sync, now execution goes offline
ParentLookup
gets paused on execution- Execution comes back online,
ParentLookup
status is set toActivated
andexecution_ready
is sent toRangeSync
even though syncing is stopped.
Maybe we can handle this better in RangeSync
such that there's no invalid state if ChainState == Idle
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would work, so that we handle two separate states. One is the sync state, that imo stays the same, and the EE online status. A stopped chain should not do anything even if the EE is back online (this should be equivalent a the era pre EL, CL division.. basically the chain assumes it can send chains for processing, but doesn't) I see it similar to the rest of syncing forms.. basically if we don't see that we can send blocks/batches/chains for processing, we don't. As far as I remember, every sync form should already handle the case in which we store blocks but don't send for some reason, except maybe single block lookups
## Issue Addressed currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094 ## Proposed Changes If a batch fails but the peer is not at fault, do not count the attempt Also removes some annoying logs ## Additional Info We still get a counter on ignored attempts.. just in case
I'm dropping the |
Closing this in favor of #3428 |
## Issue Addressed #3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of #3094 Will work best with #3439 Co-authored-by: Pawan Dhananjay <[email protected]>
## Issue Addressed sigp#3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of sigp#3094 Will work best with sigp#3439 Co-authored-by: Pawan Dhananjay <[email protected]>
Issue Addressed
Resolves #3032
Proposed Changes
This PR fixes 2 issues with post-merge sync:
Execution engine offline errors were treated similar to other
BeaconChainError
s in range sync which caused peers to get downscored and kicked for potentially no fault of theirs. This PR pausesRangeSync
when the EE is offline and only resumes after the EE is back online. It doesn't do any leftover block processing until the EE is back online and hence doesn't disconnect from any connected peers. It polls the EE to check if it's online every 5 seconds and resumes the chains once it is back online.If the EE goes offline when the BN is fully synced, processing parent chain lookups returns EE errors which causes the parent chain to get added to the
failed_chains
list .Any subsequent request for the same parent chain gets ignored. Here, even a single slot EE outage causes a full epoch of beacon chain outage because failed chains aren't processed until range sync kicks in after an epoch. To fix this, we don't add the parent lookup to the list offailed_chains
on EE offline errors.Additional info
Currently rebased on #3036 until it gets merged to unstable.