-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checksumming state store implementation #16257
Draft
tgross
wants to merge
2
commits into
main
Choose a base branch
from
checksumming-state-store
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Nomad's in-memory state store is subject to corruption when an object that's read from memdb is mutated. Depending on memdb's (undocumented) isolation guarantees, there may be a narrow number of cases where an object that's been inserted can be safely mutated in the same transaction before the transaction is committed. But in general the rule is that you always need to deep copy the object before mutating. This class of bug is hard to track down, leading to "spooky action at a distance" when servers have diverging in-memory states, and the potential existence of these bugs undermines our confidence in understanding the system. This changeset extends the FSM interface wrapper started by the event stream work to allow checksumming. When enabled, the wrapper checksums each object on insert and writes it to a table of checksums. On read, the wrapper checksums the object again and looks for a matching previously-recorded checksum. If there's no match, the object has been mutated outside of an insert. The process of going through the whole code base to detect these problems is going to land in several PRs, as we frequently corrupt the state store in test code. Checksumming slows down the state store quite a bit (4x latency for typical RPCs with multiple queries like `JobRegister`), so this will only ever be intended for testing and developer debugging.
Get the `nomad/state` package tests all passing with the checksumming state store. This work will be broken across multiple PRs.
107b110
to
c42e111
Compare
This was referenced Feb 27, 2023
This was referenced Nov 6, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
stage/needs-rebase
This PR needs to be rebased on main before it can be backported to pick up new BPA workflows
theme/testing
Test related issues
type/enhancement
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Nomad's in-memory state store is subject to corruption when an object that's
read from memdb is mutated. Depending on memdb's (undocumented) isolation
guarantees, there may be a narrow number of cases where an object that's been
inserted can be safely mutated in the same transaction before the transaction is
committed. But in general the rule is that you always need to deep copy the
object before mutating. This class of bug is hard to track down, leading to
"spooky action at a distance" when servers have diverging in-memory states, and
the potential existence of these bugs undermines our confidence in understanding
the system.
This changeset extends the FSM interface wrapper started by the event stream
work to allow checksumming. When enabled, the wrapper checksums each object on
insert and writes it to a table of checksums. On read, the wrapper checksums the
object again and looks for a matching previously-recorded checksum. If there's
no match, the object has been mutated outside of an insert.
Checksumming slows down the state store quite a bit (4x latency for
typical RPCs with multiple queries like
JobRegister
), so this will only everbe intended for testing and developer debugging.
The process of going through the whole code base to detect these problems is
going to land in several PRs, as we frequently corrupt the state store in test
code. This PR gets the
nomad/state
package tests all passing with thechecksumming state store.
Notes to reviewers:
*txn
pointers for forTxn
interfaces. 😀nomad/state
package are currently failing because they're grabbing thestate.TestStateStore
. Will fix that on Monday.