Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make runtime_host operate over the entire trie rather than only over the storage entries of the trie #639

Merged
merged 62 commits into from
Jun 3, 2023

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented May 30, 2023

Context

The runtime_host module calls the Wasm runtime. When this Wasm runtime wants to access a storage entry a StorageGet is generated, and when this Wasm runtime iterates over keys, a StorageNextKey is generated.
In parallel of the execution, the runtime_host maintains a "trie cache" that contains the structure of the storage trie. When the runtime writes onto the storage, the runtime_host updates the structure contained in this cache and invalidates the hash of the trie nodes that are cached.
At the end of the operation, the runtime_host looks into this trie structure and recalculates all the hashes that have been invalidated.
The user has the possibility to not provide a cache, in which case the runtime_host will iterate over the entire trie structure (using a StoragePrefixKeys) in order to populate it.

There are a few problems with this system:

  • The trie structure can in theory be very big, and you're not supposed to store it entirely in memory. I can't easily measure its size at the head of the Polkadot chain, but if the trie continues to grow smoldot can eventually run out of memory. While this isn't a problem in practice yet and while this PR doesn't solve that problem entirely (we maintain a list of changes during the execution), it is a step towards a solution.

  • If no cache is provided, a StoragePrefixKeys is generated with a prefix of &[] in order to fetch the list of every single key in the trie. If the execution is performed by a light client, it is not possible to know the list of all keys, and we have no choice but to fail the execution. This is the error that happens in RPC call returns "PrefixKeysForbidden" error code #272.

  • In the context of a full node, when we're at the head of the chain we might have to verify blocks whose parents aren't the best block. This means that we have to maintain a copy of this cache (and thus of the entire trie structure) for every block that can potentially have children (meaning: the finalized block and every single descendant). If finality stalls for some reason, there can be several thousand such blocks, and thus several thousand copies of the trie structure, which can be quite disastrous.

  • The cache is not adapted to child tries support (Implement support for child tries #166).

What the PR does

This PR refactors this system by asking the user of the runtime_host (and thus all the layers on top of it) to return information about the trie structure rather than only about the storage values. In other words, it is the responsibility of the API user of the runtime_host to maintain the trie structure.

The precise API changes are as follows:

  • The main_trie_calculation_cache has been removed everywhere.
  • NextKey now provides a key() and prefix() as a list of nibbles rather than a list of bytes. The value that is provided back must also be a list of a nibbles rather than a list of bytes.
  • NextKey now has a branch_nodes() function. If it returns true, then the API user must return the next trie node (no matter whether it is a branch node or a storage node). If it returns false, then the API user must return the next trie node that contains a storage value (i.e. the behavior before this PR).
  • The PrefixKeys variant has been removed entirely.
  • There is now a new MerkleValue variant that requires the API user to provide the Merkle value of a trie node that was previously provided to NextKey. The API user can use resume_unknown in order to indicate that it doesn't know this Merkle value and that it must be recalculated no matter what.

In the future, another change I'd like to make is: once the runtime call is successful, the list of all changes to the trie structure should be provided (including the branch nodes that have been inserted or deleted), rather than just the changes to the storage entries.

The consensus_service is directly impacted by this change. However, it doesn't store the trie structure in the database (as that change would be too big for one PR) but instead dynamically determines what the next branch node is by performing multiple database lookups.

I have considered adding a system to the runtime_host so that it does internally what the consensus_service does. For example, there could be a configuration option called branch_nodes_unknown which, when set to true, would guarantee that NextKey::branch_nodes() is always false. However, I think that this would make a weird API especially w.r.t. the future change I mention where the diff of the trie structure is provided as well on success.

cc @xlc w.r.t. Chopsticks
I'm opening this as a draft PR in order to make you aware of this change and maybe gather an opinion. I think that you should be able to update Chopsticks to use the same system as what consensus_service.rs does in this diff, which is to dynamically determine the branch nodes.

This PR is currently a draft and it's failing due to a storage root mismatch (in other words there's a bug somewhere).

cc #272 #126

@xlc
Copy link
Contributor

xlc commented May 30, 2023

Thanks for the headsup. Will keep an eye on this.

@tomaka
Copy link
Contributor Author

tomaka commented Jun 1, 2023

The state root mismatch is still there.
I would call the code in trie_root_calculator correct, given that the fuzzing test should cover every possible scenario and always passes.

This means that there would be a bug in the code in consensus_service, but I can't find what it could be and writing tests for this is kind of difficult.

@tomaka
Copy link
Contributor Author

tomaka commented Jun 2, 2023

The full node now works, but has become insanely slow due to the removal of the cache. Verifying Polkadot block 1 takes 237 seconds on my machine for example. Blocks 2 to 6 take on average 350 seconds each.
This is because each call to Database::block_storage_main_trie_next_key takes around 1ms to 10ms, and we iterate over the entire trie with calls to next_key.

I believe that before this PR this isn't so much a problem because we read all the keys at the same time with a single database call.

Most the slowness might be caused by the fact that SQLite has to re-parse our SQL at every single call. Unfortunately, it's not possible to prepare statements in advance because the library we use has unnecessarily strict lifetime restrictions. cc #632

@tomaka
Copy link
Contributor Author

tomaka commented Jun 2, 2023

I'm still going to go ahead with this PR even if it "breaks" the full node, because #272 is more important to fix.
I know that the changes in this PR only make the full node slower because caches have been removed, and that once they're back it should from a theoretical perspective go at a normal speed again.

@tomaka tomaka marked this pull request as ready for review June 3, 2023 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants