Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksumming state store implementation #16257

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 24, 2023

Nomad's in-memory state store is subject to corruption when an object that's
read from memdb is mutated. Depending on memdb's (undocumented) isolation
guarantees, there may be a narrow number of cases where an object that's been
inserted can be safely mutated in the same transaction before the transaction is
committed. But in general the rule is that you always need to deep copy the
object before mutating. This class of bug is hard to track down, leading to
"spooky action at a distance" when servers have diverging in-memory states, and
the potential existence of these bugs undermines our confidence in understanding
the system.

This changeset extends the FSM interface wrapper started by the event stream
work to allow checksumming. When enabled, the wrapper checksums each object on
insert and writes it to a table of checksums. On read, the wrapper checksums the
object again and looks for a matching previously-recorded checksum. If there's
no match, the object has been mutated outside of an insert.

Checksumming slows down the state store quite a bit (4x latency for
typical RPCs with multiple queries like JobRegister), so this will only ever
be intended for testing and developer debugging.

The process of going through the whole code base to detect these problems is
going to land in several PRs, as we frequently corrupt the state store in test
code. This PR gets the nomad/state package tests all passing with the
checksumming state store.


Notes to reviewers:

  • I've got this in draft for now as I'd like to test out a different hash that @lgfa29 brought up. But this should be good enough to discuss the overall design.
  • I know at a glance this PR looks huge but a lot of the changed lines are swapping *txn pointers for for Txn interfaces. 😀
  • I'm seeing a few tests outside the nomad/state package are currently failing because they're grabbing the state.TestStateStore. Will fix that on Monday.
  • My very rough working branch where I'm pulling changesets out of is statestore-checksumming.

Nomad's in-memory state store is subject to corruption when an object that's
read from memdb is mutated. Depending on memdb's (undocumented) isolation
guarantees, there may be a narrow number of cases where an object that's been
inserted can be safely mutated in the same transaction before the transaction is
committed. But in general the rule is that you always need to deep copy the
object before mutating. This class of bug is hard to track down, leading to
"spooky action at a distance" when servers have diverging in-memory states, and
the potential existence of these bugs undermines our confidence in understanding
the system.

This changeset extends the FSM interface wrapper started by the event stream
work to allow checksumming. When enabled, the wrapper checksums each object on
insert and writes it to a table of checksums. On read, the wrapper checksums the
object again and looks for a matching previously-recorded checksum. If there's
no match, the object has been mutated outside of an insert.

The process of going through the whole code base to detect these problems is
going to land in several PRs, as we frequently corrupt the state store in test
code. Checksumming slows down the state store quite a bit (4x latency for
typical RPCs with multiple queries like `JobRegister`), so this will only ever
be intended for testing and developer debugging.
Get the `nomad/state` package tests all passing with the checksumming state
store. This work will be broken across multiple PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows theme/testing Test related issues type/enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant