Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debugging fields to archiver errors #3377

Closed
wants to merge 2 commits into from
Closed

Conversation

teor2345
Copy link
Member

We had a report of archiver errors on a timekeeper:

I noticed a timekeeper in an endless restart loop and always throwing this panic
archival-node-1 | 2025-02-07T17:05:49.143589Z INFO Consensus: sc_consensus_subspace::archiver: Resuming archiver from last archived block last_archived_block_number=2358
archival-node-1 | thread 'tokio-runtime-worker' panicked at /code/crates/sc-consensus-subspace/src/archiver.rs:713:14:
archival-node-1 | Incorrect parameters for archiver: InvalidBlockSmallSize { block_bytes: 874, archived_block_bytes: 1751094 }
archival-node-1 | note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Things i did:

  • I switched cores for the timekeeper thread to run on (no luck),
  • I then removed the timekeeper flags and tried to start it but no luck.
  • I then removed the timekeeper flags and wiped the node and it started.

Looking at the code, there's nowhere that changes the segment header (last archived block) or block data. So this seems like corruption in the segment or block stores due to overclocking. This PR adds debugging fields to archiver instantiation errors, to help diagnose similar errors in future.

It also adds a PartialEq implementation for Archiver, which simplifies existing tests. I plan to use it in future archiver object mapping or retrieval tests. (Unfortunately this impl can't be cfg(test), because it's used in integration tests.)

Finally, this PR changes an incorrect (but harmless) BlockNumber type in that error to u32.

Code contributor checklist:

@teor2345 teor2345 added the improvement it is already working, but can be better label Feb 10, 2025
@teor2345 teor2345 self-assigned this Feb 10, 2025
@teor2345 teor2345 requested a review from nazar-pc as a code owner February 10, 2025 06:18
@@ -247,6 +262,21 @@ pub struct Archiver {
last_archived_block: LastArchivedBlock,
}

// Equality, mainly for use in integration tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically this doesn't really make a lot of sense IMO, which is why tests did not use it. There should be no need and it does not make sense to compare two archiver instances.

Comment on lines +205 to +216
#[error(
"Invalid last archived block, its size {block_bytes} bytes is the same as encoded \
block, archived in segment: {prev_segment_index:?} {prev_segment_header_hash:?}"
)]
InvalidLastArchivedBlock {
/// Already archived block size, which is equal to the full block size
block_bytes: u32,
/// The segment index for the segment that already archived the block
prev_segment_index: SegmentIndex,
/// The segment header hash for the segment that already archived the block
prev_segment_header_hash: Blake3Hash,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this actually is, we already have this information on higher level.

I'd probably replace "Incorrect parameters for archiver" panic with an error message that also logs the whole last_segment_header, which will have the same effect.

@nazar-pc
Copy link
Member

Also test seems to fail due to prev_segment_header_hash check that was initialized to zero bytes, but the thing is, we do not actually care about it, we care about bytes. Previous version of the test explicitly checked just bytes, partially acting as self-documentation by clearly describing things that are relevant to the test and ignore those that are not.

@teor2345
Copy link
Member Author

Since you’ve asked me to rewrite almost the whole PR, I think it would be less confusing to close this one, and open another one if I get time.

@teor2345 teor2345 closed this Feb 10, 2025
@teor2345 teor2345 deleted the archiver-debug branch February 11, 2025 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement it is already working, but can be better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants