Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(en): gracefully shutdown en waiting for reorg detector #270

Merged

Conversation

montekki
Copy link
Contributor

@montekki montekki commented Oct 19, 2023

This PR changes the shutdown order of the external node to a more graceful one.

Why?

Currently sutting down the EN because of reorgs and inconsistency checks may be raceful and a shutting down consistency checker may bring down the reorg detector with it which would prevent the reorg detector from doing it's job in the legitimate cases of reorgs where the node actually needs to be rolled back to some previous state.

What happens currently?

The concerns have been focused around the logic of consistency checker that used to panic when an inconsistency was found. That is no longer the case (anyhow::bail! used to be panic):

Ok(false) => {
anyhow::bail!("Batch {} is inconsistent with L1", batch_number.0);
}

however, this error would still unconditionally bring the node down because currently the handler for ConsistencyChecker is a part of wait_tasks that would resolve if any of the tasks has resolved:

tokio::select! {
_ = wait_for_tasks(task_handles, particular_crypto_alerts, graceful_shutdown, tasks_allowed_to_finish) => {},
_ = sigint_receiver => {
tracing::info!("Stop signal received, shutting down");

The proposed logic works as follows:

The handle to reorg detector is fused and on the node shutdown, after all the components are gracefully shutdown we manually check the result of the reorg detector. This logic handles both situations:

  1. node is shutting down
  2. reorg detector has found a reorg and is performing a rollback and shutdown.

@montekki montekki requested a review from a team as a code owner October 19, 2023 20:55
@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (ec41692) 35.61% compared to head (a50031f) 35.62%.
Report is 9 commits behind head on main.

❗ Current head a50031f differs from pull request most recent head f8d670d. Consider uploading reports for the commit f8d670d to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #270   +/-   ##
=======================================
  Coverage   35.61%   35.62%           
=======================================
  Files         520      520           
  Lines       28352    28348    -4     
=======================================
  Hits        10098    10098           
+ Misses      18254    18250    -4     
Files Coverage Δ
core/bin/external_node/src/config/mod.rs 21.42% <ø> (ø)
core/bin/external_node/src/main.rs 0.00% <ø> (ø)
core/lib/zksync_core/src/reorg_detector/mod.rs 46.15% <0.00%> (ø)

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

popzxc
popzxc previously approved these changes Oct 20, 2023
core/bin/external_node/src/main.rs Show resolved Hide resolved
@montekki montekki changed the title fix: gracefully shutdown en waiting for reorg detector fix(en): gracefully shutdown en waiting for reorg detector Oct 20, 2023
popzxc
popzxc previously approved these changes Oct 20, 2023
Deniallugo
Deniallugo previously approved these changes Oct 20, 2023
@montekki montekki dismissed stale reviews from Deniallugo and popzxc via 3ad154c October 20, 2023 09:50
@montekki montekki requested a review from a team as a code owner October 20, 2023 09:50
Deniallugo
Deniallugo previously approved these changes Oct 20, 2023
core/bin/external_node/src/main.rs Outdated Show resolved Hide resolved
core/bin/external_node/src/main.rs Outdated Show resolved Hide resolved
Deniallugo
Deniallugo previously approved these changes Oct 20, 2023
@RomanBrodetski
Copy link
Collaborator

Taking a step back - why do we need such an (arguably) complicated logic? It's partly because we need to run the reorg detector in parallel. We have a similar case with Circuit Breaker on the main node. Is there any potential to extract some abstraction?

@popzxc
Copy link
Member

popzxc commented Oct 25, 2023

@montekki I think it's a good solution for now. A proper task management entity is in the plans too.

@montekki montekki enabled auto-merge October 26, 2023 10:00
@montekki montekki added this pull request to the merge queue Oct 26, 2023
Merged via the queue into main with commit f048485 Oct 26, 2023
@montekki montekki deleted the fvs-pla-514-reorg-detector-and-consistency-checker-are-in-a-race branch October 26, 2023 10:36
github-merge-queue bot pushed a commit that referenced this pull request Oct 26, 2023
🤖 I have created a release *beep* *boop*
---


##
[16.2.0](core-v16.1.0...core-v16.2.0)
(2023-10-26)


### Features

* **basic_witness_producer_input:** Add Basic Witness Producer Input
component ([#156](#156))
([3cd24c9](3cd24c9))
* **core:** adding pubdata to statekeeper and merkle tree
([#259](#259))
([1659c84](1659c84))


### Bug Fixes

* **db:** Fix root cause of RocksDB misbehavior
([#301](#301))
([d6c30ab](d6c30ab))
* **en:** gracefully shutdown en waiting for reorg detector
([#270](#270))
([f048485](f048485))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants