Skip to content

Commit

Permalink
Use read_recursive locks in database (#2417)
Browse files Browse the repository at this point in the history
## Issue Addressed

Closes #2245

## Proposed Changes

Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.

## Additional Info

* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
  • Loading branch information
michaelsproul committed Jul 12, 2021
1 parent b3c7e59 commit 371c216
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions beacon_node/store/src/hot_cold_store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -721,7 +721,7 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>
let high_restore_point_idx = low_restore_point_idx + 1;

// Acquire the read lock, so that the split can't change while this is happening.
let split = self.split.read();
let split = self.split.read_recursive();

let low_restore_point = self.load_restore_point_by_index(low_restore_point_idx)?;
// If the slot of the high point lies outside the freezer, use the split state
Expand Down Expand Up @@ -883,7 +883,7 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>

/// Fetch a copy of the current split slot from memory.
pub fn get_split_slot(&self) -> Slot {
self.split.read().slot
self.split.read_recursive().slot
}

/// Fetch the slot of the most recently stored restore point.
Expand Down Expand Up @@ -1056,7 +1056,7 @@ pub fn migrate_database<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>>(
// 0. Check that the migration is sensible.
// The new frozen head must increase the current split slot, and lie on an epoch
// boundary (in order for the hot state summary scheme to work).
let current_split_slot = store.split.read().slot;
let current_split_slot = store.split.read_recursive().slot;

if frozen_head.slot() < current_split_slot {
return Err(HotColdDBError::FreezeSlotError {
Expand Down

0 comments on commit 371c216

Please sign in to comment.