-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add adr-001 for node key refactoring #608
Changes from 9 commits
4f9a004
8646a9a
6e5b081
4ee7804
9468ee2
0c2d610
1459a18
d8d0bf9
88885f1
5272de4
006d76c
7cc7280
4e6044f
0bf7486
a0dcc0e
ee11dff
d9f7d2e
10184ac
5f94844
85a90e7
8fb87b6
6bbf7f9
204d881
f743311
bf85d92
b064ede
6095bc4
7873267
fb0f182
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# ADR ADR-001: Node Key Refactoring | ||
|
||
## Changelog | ||
|
||
- 2022-10-31: First draft | ||
|
||
## Status | ||
|
||
Proposed | ||
|
||
## Context | ||
|
||
The original key format of IAVL nodes is a hash of the node. It does not take advantage of data locality on disk. Nodes are stored in random locations on disk due to the random hash value, so it needs to scan the disk to find the corresponding node which can be very inefficient. | ||
|
||
The `orphans` are used to manage node removal in the current design and allow deletion of removed nodes for the specific version from the disk through the `DeleteVersion` API. It needs to track every time when updating the tree and also requires extra storage to store `orphans`, but there are not many use cases of `DeleteVersion`. There are two use cases: | ||
cool-develope marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
1. Rollback of the tree to a previous version | ||
2. Remove unnecessary old nodes | ||
|
||
## Decision | ||
|
||
- Use the sequenced integer ID as a node key like `bigendian(nodeKey)` format. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is nodeKey computed, I remember it's sth like "version+path"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, it is a just sequenced integer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems better to use "version/seq", where seq is only unique inside the version.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "version/seq" is globally unique as well, seq itself is locally unique within the version. For each version, the nodes are written in a batch anyway, we only need to maintain the seq in memory. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why do we need to update the left and right node keys? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Meaning that every time we have a new version we update all referenced node keys to refer to that version rather than the original version where they were created? That seems really complex. I feel like I'm maybe not understanding something really basic here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yihuang , I agree with you. If we keep the local nonce as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think the size difference should be minimal, assuming version is
|
||
- Remove the `leftHash` and `rightHash` fields, and instead store `hash` field. | ||
- Remove the `orphans` from the tree. | ||
|
||
Theoretically, we can also remove the `version` field in the node structure but it leads to breaking the ics23 proof mechanism. We will revisit it later. | ||
|
||
New node structure | ||
tac0turtle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```go | ||
type Node struct { | ||
key []byte | ||
value []byte | ||
hash []byte // keep this field in the storage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess one consequence is proof generation is slower, previously we can get child hash directly, now we need to load the child node? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are using the ics23 proof, which requires size, version, and height so I believe it doesn't lead to any delays. |
||
nodeKey int64 // new field, use as a node key | ||
leftNodeKey int64 // new field, need to store in the storage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are using nonce as I thought that the whole idea with the path encoded as a key was that writes would be sequential, contributing to lower frequency of compactions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The nodes are all immutable, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know the tree path describes the data locality well. The problem is how to keep the path when rotating the tree. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aren't we need to split the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right, I updated the node struct |
||
rightNodeKey int64 // new field, need to store in the storage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need these fields, if we already have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
version int64 | ||
size int64 | ||
leftNode *Node | ||
rightNode *Node | ||
subtreeHeight int8 | ||
persisted bool | ||
} | ||
``` | ||
|
||
New tree structure | ||
|
||
```go | ||
type MutableTree struct { | ||
*ImmutableTree | ||
lastSaved *ImmutableTree | ||
nonce int64 // new field to track the current ID | ||
versions map[int64]bool | ||
allRootLoaded bool | ||
unsavedFastNodeAdditions map[string]*fastnode.Node | ||
unsavedFastNodeRemovals map[string]interface{} | ||
ndb *nodeDB | ||
skipFastStorageUpgrade bool | ||
|
||
mtx sync.Mutex | ||
} | ||
``` | ||
|
||
### Migration | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative way to migrate is using the state streamer, resync the chain, and recreate the new iavl db with the state changes, then switch. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a breaking change, so validators need to update at once (at least 67% of them). So we need a migration process anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's not consensus breaking, since the root hashes don't change, validators don't need to update at the same time? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably depends on the migration mechanism. I think the one described below won't guarantee the same IAVL tree. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, today IAVL key is a hash, so this update will change the keys stored in the IAVL, hence it will change an order in the tree, and the tree Merkle Hash. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it can migrate IAVL nodes one to one, change references between them, but the hashes of each node is not changed. |
||
|
||
We can migrate nodes one by one by iterating the version. | ||
|
||
- Iterate the version in order, and get the root node for the specific version. | ||
cool-develope marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Iterate the tree and assign the `nodeKey` to nodes which the node version equals. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Migrate a archive node and pruned node will end up with different node keys, but it's probably ok. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Inconsistent node nonces could make the state-sync snapshot inconsistent as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the global nonce they would be different but version | local nonce should be same, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, in migration, it depends on the traversal order, this is easy to specify, like preorder+ascending. but when handling the insertion, we should also specify the precise logic of nonce assignments, to allow alternative implementations be compatible with each other. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the inconsistency of nodeKey is not a big problem, the only thing we need is to match the proof. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can test this on a live network with state sync and a simple migration to see if it works to be sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
it depends on how we export the state sync snapshot I think, if we re-map the nodekey, then the stored one don't matter. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the nonce assignment should happens in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are right |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
We will implement the `Import` functionality for the original version. | ||
|
||
### Pruning | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this section needs more details. |
||
|
||
We introduce a new way to prune old versions. | ||
|
||
For example, when a user wants to prune the previous 500 versions every 1000 blocks | ||
- We assume the pruning is completed for `n`th version and the last nonce of `n`th version is `x`. | ||
- We iterate the tree from the `n+501`th root node and pick only nodes which the nodeKey is in `[(n+1)th version first nonce, (n+500)th version the last nonce]`. | ||
- We can delete missing nodes instantly or re-assign the nodeKey from `x+1` in order for those nodes. Re-assign should be done after stopping the node but it can lead to improving the data locality. | ||
aaronc marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Consequences | ||
cool-develope marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Positive | ||
|
||
Using the sequenced integer ID, we take advantage of data locality in the bTree and it leads to performance improvements. Also, it can reduce the node size in the storage. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nowadays, all the "mainstream" db backends are lsm tree, but it should help there too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessng by bTree we refer to IAVL tree? However, I agree that "bTree" seems confusing in this context because the alternative to LSM is the first thing that comes to mind There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the major benefits might also be reduced compactions since we always commit semi-sorted data (by nonce) to the underlying LSM. Since we commit versioned data, there wouldn't be a need for as many merge operations as today. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I will update this part and add more description of LSM |
||
|
||
Removing orphans also provides performance improvements including memory and storage saving. Also, it makes it easy to rollback the tree. Because we will keep the sequenced segment IDs for the specific version, and we can remove all nodes for which the `nodeKey` is greater than the specified integer value. | ||
|
||
### Negative | ||
|
||
It requires extra storage to store the node because it should keep `leftNodeKey` and `rightNodeKey` to iterate the tree. Instead, we can delete`leftHash` and `rightHash` fields in the node and reduce the key size. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we add how much extra per key would be added? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it will reduce the node size as a result, but it's difficult to say how much because
so we can save 16byte in the node body
so we can save 20byte in the node key |
||
|
||
It can't delete the old nodes for the specific version due to removing orphans. | ||
|
||
## References | ||
|
||
- https://github.com/cosmos/iavl/issues/548 | ||
- https://github.com/cosmos/iavl/issues/137 | ||
- https://github.com/cosmos/iavl/issues/571 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the "data locality" make any significant difference in SSD disks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, the data layout will depend on the underlying DB structure (goleveldb, rocksdb...) -- and most likely it will be different then we think.
Do we have a benchmark proving the statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#597 (comment)
maybe it doesn't reflect fully the above decisions, it only refactored the nodeKey as a global nonce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data locality makes a difference in any LSM design.
The reason is that an LSM key-value store performs compactions under the hood to remove duplicated entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Robert, data locality is the central problem for all computational work, including CPU caches, page tables and RAM.
Please read up on caches and the memory hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "data locality" maybe a bit misleading here, because even with the new key format, we don't really have data locality for most of cases, because a node's children are likely belong to different versions, so they are still located distantly.
For me, the biggest advantage of new format is, the key is monotonically increasing, so it might work better with kv-dbs where most of them store data in some kind of ordered trees, and there are optimizations for inserting ordered keys, for example rocksdb FIFO compaction style and lmdb APPEND mode. We can even simply dump the nodes into plain files, they can still be queryable with some minimal metadata.
EDIT: I guess "data locality" makes some sense here, because parent nodes's version
>=
children's, so if we visit a key that's recently added/updated, the nodes on the path are also recently added, so they are physically located closer than before with new key format, assuming a recent updated key is more likely to be accessed again, then it definitely improves those data locality.