Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IndexedJsonDocument, a JSONWrapper implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912

Merged
merged 42 commits into from
Jun 12, 2024

Conversation

nicktobey
Copy link
Contributor

tl;dr: We store a JSON document in a prolly tree, where the leaf nodes of the tree are blob nodes with each contain a fragment of the document, and the intermediate nodes are address map nodes, where the keys describe a JSONPath.

The new logic for reading and writing JSON documents is cleanly separated into the following files:

IndexedJsonDocument - The new JSONWrapper implementation. It holds the root hash of the prolly tree.

JsonChunker - A wrapper around a regular chunker. Used to write new JSON documents or apply edits to existing documents.

JsonCursor - A wrapper around a regular cursor, with added functionality allowing callers to seek to a specific location in the document.

JsonScanner - A custom JSON parser that tracks that current JSONPath.

JsonLocation - A custom representation of a JSON path suitable for use as a prolly tree key.

Each added file has additional documentation with more details about the individual components.

Throughout every iteration of this project, the core idea has always been to represent a JSON document as a mapping from JSONPath locations to the values stored at those locations, then we could store that map in a prolly tree and get all the benefits that we currently get from storing tables in prolly trees: fast diffing and merging, fast point lookups and mutations, etc.

This goal has three major challenges:

  • For deeply nested JSON documents, simply listing every JSONPath requires asymptotically more space than the original document.
  • We need to do this in a way that doesn't compromise performance on simply reading JSON documents from a table, which I understand is the most common use pattern.
  • Ideally, users should not need to migrate their databases, or update their clients in order to read newer dbs, or have to choose between different configurations based on their use case.

This design achieves all three of these requirements:

  • While it requires additional storage, this additional storage cannot exceed the size of the original document, and is in practice much smaller.
  • It has indistinguishable performance for reading JSON documents from storage, while also allowing asymptotically faster diff and merge operations when the size of the changes is much smaller than the size of the document. (There is a cost: initial inserts of JSON documents are currently around 20% slower, but this is a one-time cost that does not impact subsequent reads and could potentially be optimized further.)
  • Documents written by the new JSONChunker are backwards compatible with current Dolt binaries and can be read back by existing versions of Dolt. (Although they will have different hashes than equivalent documents that those versions would write.)

@nicktobey nicktobey requested a review from reltuk May 28, 2024 21:12
Copy link
Contributor

@reltuk reltuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks awesome and a really cool approach. Not too mind bending after starting at it for a while. I took a pass on some suggestions, but I may be missing some things. Please push back if I'm off base anywhere :).

@@ -217,6 +217,9 @@ func newLeafCursorAtKey[K ~[]byte, O Ordering[K]](ctx context.Context, ns NodeSt
// searchForKey returns a SearchFn for |key|.
func searchForKey[K ~[]byte, O Ordering[K]](key K, order O) SearchFn {
return func(nd Node) (idx int) {
if nd.keys.IsEmpty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is nd.Count() on one of these nodes? Why is this special case necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the flattened leaf nodes. They contain 0 keys and 1 value. If we don't have this check, then the binary search will attempt to compare the search key to key[0] in the node, which will be out of bounds.

I added a comment.

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
f820432 ok 5937457
version total_tests
f820432 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
443e348 ok 5937457
version total_tests
443e348 5937457
correctness_percentage
100.0

I don't like this, but the alternative is adding a context parameter to ToInterface, which would then propagate through *hundreds* of files. We want to do that eventually, but this is an acceptable stopgap.
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
1e72313 ok 5937457
version total_tests
1e72313 5937457
correctness_percentage
100.0

@nicktobey nicktobey requested a review from tbantle22 as a code owner June 12, 2024 02:12
@nicktobey nicktobey force-pushed the nicktobey/json-addressmap branch from 8023dad to 303526d Compare June 12, 2024 02:13
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
8113baa ok 5937457
version total_tests
8113baa 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
c79d151 ok 5937457
version total_tests
c79d151 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
87ac817 ok 5937457
version total_tests
87ac817 5937457
correctness_percentage
100.0

@tbantle22 tbantle22 removed their request for review June 12, 2024 19:03
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
e4a6d6e ok 5937457
version total_tests
e4a6d6e 5937457
correctness_percentage
100.0

@nicktobey nicktobey merged commit c660813 into main Jun 12, 2024
21 checks passed
@nicktobey nicktobey deleted the nicktobey/json-addressmap branch June 12, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants