Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New historic state cache #6475

Merged

Conversation

michaelsproul
Copy link
Member

@michaelsproul michaelsproul commented Oct 9, 2024

Proposed Changes

Attempt to unify the new HDiffBuffer cache with the old "historic state cache" by creating a cache that holds either or both kinds of states, and lazily converts between the formats as required.

Additional Info

Performance testing on mainnet with --hierarchy-exponents 5,7,11 shows much improved performance for sequential queries. Most queries after the first are served in 400-700ms.

@michaelsproul michaelsproul added work-in-progress PR is a work-in-progress optimization Something to make Lighthouse run more efficiently. tree-states Upcoming state and database overhaul labels Oct 9, 2024
@michaelsproul
Copy link
Member Author

This is working quite well for the sequential query case. The tricky part is managing when to build caches. Previously we were re-building them for every state that required replay, which was obviously sub-optimal. Now we ensure that any state in the historic state cache has all caches built, which makes replay faster at the expense of slowing down the loading of snapshot/diff states.

On the test machine the loads for snapshot/diff states are now around 6-7s (ouch), but each subsequent state is more like 0.5-1s (on average 700ms).

I thought about moving the cache build from when the snapshot/diff state is loaded to the next slot, which would keep the total amortised time the same, but reduce the worst-case time for the diff state by several seconds (maybe from 6-7s down to 3s). It seems cache building only takes 1.8-2.5s in the worst case, so there's still some time going missing for the snapshot state.

@michaelsproul
Copy link
Member Author

Block replay is quick now with built caches, although it's a bit slower at epoch boundaries:

Oct 11 06:27:33.922 DEBG Replayed blocks, replay_time_ms: 10, target_slot: 8999423, service: freezer_db, module: store::hot_cold_store:1865
Oct 11 06:27:35.536 DEBG Replayed blocks, replay_time_ms: 836, target_slot: 8999424, service: freezer_db, module: store::hot_cold_store:1865

In this commit (52a9066) I made it so that we replay 1 block in preference to applying a diff for states that would normally be constructed by diffing. This may be the wrong choice, as applying a diff should be ~150ms, and converting HDiffBuffer -> state should be only ~400ms or so, for 550ms total. Although this doesn't mesh with the benchmark results from prior to this commit, where diff states were taking around 4-6s! Maybe cloning the several hundred MB of hdiff buffer is taking more time than we think (could be good to measure this). Another advantage of favouring replay where possible (i.e. for sequential access) is that it minimises memory fragmentation.

@michaelsproul michaelsproul added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Oct 14, 2024
@@ -1992,6 +1992,8 @@ pub fn scrape_for_metrics<T: BeaconChainTypes>(beacon_chain: &BeaconChain<T>) {
.canonical_head
.fork_choice_read_lock()
.scrape_for_metrics();

beacon_chain.store.register_metrics();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a regression from a looong time ago.

@michaelsproul
Copy link
Member Author

I think this is ready for review.

Some further optimisations are described here, but I think current performance is now very good.

beacon_node/store/src/historic_state_cache.rs Show resolved Hide resolved
beacon_node/store/src/historic_state_cache.rs Show resolved Hide resolved
beacon_node/store/src/hdiff.rs Outdated Show resolved Hide resolved
beacon_node/store/src/historic_state_cache.rs Show resolved Hide resolved
Comment on lines +1777 to +1782
metrics::inc_counter(&metrics::STORE_BEACON_HISTORIC_STATE_CACHE_MISS);

return self.load_cold_state_by_slot_using_replay(cached_state, slot);
}

metrics::inc_counter(&metrics::STORE_BEACON_HISTORIC_STATE_CACHE_MISS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two misses are different. Technically we got a hit as the cache is allowing us to skip some work. Maybe we can have a metric STORE_BEACON_HISTORIC_STATE_CACHE_RESULT with a label:

  • hit_exact
  • hit_replay
  • miss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think about this some more. Will merge as-is for now

@michaelsproul michaelsproul merged commit ab9c275 into sigp:tree-states-archive Oct 16, 2024
22 of 23 checks passed
@michaelsproul michaelsproul deleted the new-historic-state-cache branch October 16, 2024 23:19
mergify bot pushed a commit that referenced this pull request Nov 18, 2024
* Start extracting freezer changes for tree-states

* Remove unused config args

* Add comments

* Remove unwraps

* Subjective more clear implementation

* Clean up hdiff

* Update xdelta3

* Tree states archive metrics (#6040)

* Add store cache size metrics

* Add compress timer metrics

* Add diff apply compute timer metrics

* Add diff buffer cache hit metrics

* Add hdiff buffer load times

* Add blocks replayed metric

* Move metrics to store

* Future proof some metrics

---------

Co-authored-by: Michael Sproul <[email protected]>

* Port and clean up forwards iterator changes

* Add and polish hierarchy-config flag

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Cleaner errors

* Fix beacon_chain test compilation

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Patch a few more freezer block roots

* Fix genesis block root bug

* Fix test failing due to pending updates

* Beacon chain tests passing

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix doc lint

* Implement DB schema upgrade for hierarchical state diffs (#6193)

* DB upgrade

* Add flag

* Delete RestorePointHash

* Update docs

* Update docs

* Implement hierarchical state diffs config migration (#6245)

* Implement hierarchical state diffs config migration

* Review PR

* Remove TODO

* Set CURRENT_SCHEMA_VERSION correctly

* Fix genesis state loading

* Re-delete some PartialBeaconState stuff

---------

Co-authored-by: Michael Sproul <[email protected]>

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix test compilation

* Update schema downgrade test

* Fix tests

* Fix null anchor migration

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix tree states upgrade migration (#6328)

* Towards crash safety

* Fix compilation

* Move cold summaries and state roots to new columns

* Rename StateRoots chunked field

* Update prune states

* Clean hdiff CLI flag and metrics

* Fix "staged reconstruction"

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix alloy issues

* Fix staged reconstruction logic

* Prevent weird slot drift

* Remove "allow" flag

* Update CLI help

* Remove FIXME about downgrade

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Remove some unnecessary error variants

* Fix new test

* Tree states archive - review comments and metrics (#6386)

* Review PR comments and metrics

* Comments

* Add anchor metrics

* drop prev comment

* Update metadata.rs

* Apply suggestions from code review

---------

Co-authored-by: Michael Sproul <[email protected]>

* Update beacon_node/store/src/hot_cold_store.rs

Co-authored-by: Lion - dapplion <[email protected]>

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Clarify comment and remove anchor_slot garbage

* Simplify database anchor (#6397)

* Simplify database anchor

* Update beacon_node/store/src/reconstruct.rs

* Add migration for anchor

* Fix and simplify light_client store tests

* Fix incompatible config test

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* More metrics

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* New historic state cache (#6475)

* New historic state cache

* Add more metrics

* State cache hit rate metrics

* Fix store metrics

* More logs and metrics

* Fix logger

* Ensure cached states have built caches :O

* Replay blocks in preference to diffing

* Two separate caches

* Distribute cache build time to next slot

* Re-plumb historic-state-cache flag

* Clean up metrics

* Update book

* Update beacon_node/store/src/hdiff.rs

Co-authored-by: Lion - dapplion <[email protected]>

* Update beacon_node/store/src/historic_state_cache.rs

Co-authored-by: Lion - dapplion <[email protected]>

---------

Co-authored-by: Lion - dapplion <[email protected]>

* Update database docs

* Update diagram

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Update lockbud to work with bindgen/etc

* Correct pkg name for Debian

* Remove vestigial epochs_per_state_diff

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Markdown lint

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Address Jimmy's review comments

* Simplify ReplayFrom case

* Fix and document genesis_state_root

* Typo

Co-authored-by: Jimmy Chen <[email protected]>

* Merge branch 'unstable' into tree-states-archive

* Compute diff of validators list manually (#6556)

* Split hdiff computation

* Dedicated logic for historical roots and summaries

* Benchmark against real states

* Mutated source?

* Version the hdiff

* Add lighthouse DB config for hierarchy exponents

* Tidy up hierarchy exponents flag

* Apply suggestions from code review

Co-authored-by: Michael Sproul <[email protected]>

* Address PR review

* Remove hardcoded paths in benchmarks

* Delete unused function in benches

* lint

---------

Co-authored-by: Michael Sproul <[email protected]>

* Test hdiff binary format stability (#6585)

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Add deprecation warning for SPRP

* Update xdelta to get rid of duplicate deps

* Document test
chong-he pushed a commit to chong-he/lighthouse that referenced this pull request Nov 26, 2024
* Start extracting freezer changes for tree-states

* Remove unused config args

* Add comments

* Remove unwraps

* Subjective more clear implementation

* Clean up hdiff

* Update xdelta3

* Tree states archive metrics (sigp#6040)

* Add store cache size metrics

* Add compress timer metrics

* Add diff apply compute timer metrics

* Add diff buffer cache hit metrics

* Add hdiff buffer load times

* Add blocks replayed metric

* Move metrics to store

* Future proof some metrics

---------

Co-authored-by: Michael Sproul <[email protected]>

* Port and clean up forwards iterator changes

* Add and polish hierarchy-config flag

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Cleaner errors

* Fix beacon_chain test compilation

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Patch a few more freezer block roots

* Fix genesis block root bug

* Fix test failing due to pending updates

* Beacon chain tests passing

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix doc lint

* Implement DB schema upgrade for hierarchical state diffs (sigp#6193)

* DB upgrade

* Add flag

* Delete RestorePointHash

* Update docs

* Update docs

* Implement hierarchical state diffs config migration (sigp#6245)

* Implement hierarchical state diffs config migration

* Review PR

* Remove TODO

* Set CURRENT_SCHEMA_VERSION correctly

* Fix genesis state loading

* Re-delete some PartialBeaconState stuff

---------

Co-authored-by: Michael Sproul <[email protected]>

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix test compilation

* Update schema downgrade test

* Fix tests

* Fix null anchor migration

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix tree states upgrade migration (sigp#6328)

* Towards crash safety

* Fix compilation

* Move cold summaries and state roots to new columns

* Rename StateRoots chunked field

* Update prune states

* Clean hdiff CLI flag and metrics

* Fix "staged reconstruction"

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Fix alloy issues

* Fix staged reconstruction logic

* Prevent weird slot drift

* Remove "allow" flag

* Update CLI help

* Remove FIXME about downgrade

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Remove some unnecessary error variants

* Fix new test

* Tree states archive - review comments and metrics (sigp#6386)

* Review PR comments and metrics

* Comments

* Add anchor metrics

* drop prev comment

* Update metadata.rs

* Apply suggestions from code review

---------

Co-authored-by: Michael Sproul <[email protected]>

* Update beacon_node/store/src/hot_cold_store.rs

Co-authored-by: Lion - dapplion <[email protected]>

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Clarify comment and remove anchor_slot garbage

* Simplify database anchor (sigp#6397)

* Simplify database anchor

* Update beacon_node/store/src/reconstruct.rs

* Add migration for anchor

* Fix and simplify light_client store tests

* Fix incompatible config test

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* More metrics

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* New historic state cache (sigp#6475)

* New historic state cache

* Add more metrics

* State cache hit rate metrics

* Fix store metrics

* More logs and metrics

* Fix logger

* Ensure cached states have built caches :O

* Replay blocks in preference to diffing

* Two separate caches

* Distribute cache build time to next slot

* Re-plumb historic-state-cache flag

* Clean up metrics

* Update book

* Update beacon_node/store/src/hdiff.rs

Co-authored-by: Lion - dapplion <[email protected]>

* Update beacon_node/store/src/historic_state_cache.rs

Co-authored-by: Lion - dapplion <[email protected]>

---------

Co-authored-by: Lion - dapplion <[email protected]>

* Update database docs

* Update diagram

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Update lockbud to work with bindgen/etc

* Correct pkg name for Debian

* Remove vestigial epochs_per_state_diff

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Markdown lint

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Address Jimmy's review comments

* Simplify ReplayFrom case

* Fix and document genesis_state_root

* Typo

Co-authored-by: Jimmy Chen <[email protected]>

* Merge branch 'unstable' into tree-states-archive

* Compute diff of validators list manually (sigp#6556)

* Split hdiff computation

* Dedicated logic for historical roots and summaries

* Benchmark against real states

* Mutated source?

* Version the hdiff

* Add lighthouse DB config for hierarchy exponents

* Tidy up hierarchy exponents flag

* Apply suggestions from code review

Co-authored-by: Michael Sproul <[email protected]>

* Address PR review

* Remove hardcoded paths in benchmarks

* Delete unused function in benches

* lint

---------

Co-authored-by: Michael Sproul <[email protected]>

* Test hdiff binary format stability (sigp#6585)

* Merge remote-tracking branch 'origin/unstable' into tree-states-archive

* Add deprecation warning for SPRP

* Update xdelta to get rid of duplicate deps

* Document test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Something to make Lighthouse run more efficiently. ready-for-review The code is ready for review tree-states Upcoming state and database overhaul
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants