Add `IndexedJsonDocument`, a `JSONWrapper` implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912

nicktobey · 2024-05-28T20:56:43Z

tl;dr: We store a JSON document in a prolly tree, where the leaf nodes of the tree are blob nodes with each contain a fragment of the document, and the intermediate nodes are address map nodes, where the keys describe a JSONPath.

The new logic for reading and writing JSON documents is cleanly separated into the following files:

IndexedJsonDocument - The new JSONWrapper implementation. It holds the root hash of the prolly tree.

JsonChunker - A wrapper around a regular chunker. Used to write new JSON documents or apply edits to existing documents.

JsonCursor - A wrapper around a regular cursor, with added functionality allowing callers to seek to a specific location in the document.

JsonScanner - A custom JSON parser that tracks that current JSONPath.

JsonLocation - A custom representation of a JSON path suitable for use as a prolly tree key.

Each added file has additional documentation with more details about the individual components.

Throughout every iteration of this project, the core idea has always been to represent a JSON document as a mapping from JSONPath locations to the values stored at those locations, then we could store that map in a prolly tree and get all the benefits that we currently get from storing tables in prolly trees: fast diffing and merging, fast point lookups and mutations, etc.

This goal has three major challenges:

For deeply nested JSON documents, simply listing every JSONPath requires asymptotically more space than the original document.
We need to do this in a way that doesn't compromise performance on simply reading JSON documents from a table, which I understand is the most common use pattern.
Ideally, users should not need to migrate their databases, or update their clients in order to read newer dbs, or have to choose between different configurations based on their use case.

This design achieves all three of these requirements:

While it requires additional storage, this additional storage cannot exceed the size of the original document, and is in practice much smaller.
It has indistinguishable performance for reading JSON documents from storage, while also allowing asymptotically faster diff and merge operations when the size of the changes is much smaller than the size of the document. (There is a cost: initial inserts of JSON documents are currently around 20% slower, but this is a one-time cost that does not impact subsequent reads and could potentially be optimized further.)
Documents written by the new JSONChunker are backwards compatible with current Dolt binaries and can be read back by existing versions of Dolt. (Although they will have different hashes than equivalent documents that those versions would write.)

…re representing the leaf node as a blob.

reltuk

Generally looks awesome and a really cool approach. Not too mind bending after starting at it for a while. I took a pass on some suggestions, but I may be missing some things. Please push back if I'm off base anywhere :).

go/store/prolly/tree/node_cursor.go

reltuk · 2024-05-29T21:13:14Z

go/store/prolly/tree/node_cursor.go

@@ -217,6 +217,9 @@ func newLeafCursorAtKey[K ~[]byte, O Ordering[K]](ctx context.Context, ns NodeSt
 // searchForKey returns a SearchFn for |key|.
 func searchForKey[K ~[]byte, O Ordering[K]](key K, order O) SearchFn {
 	return func(nd Node) (idx int) {
+		if nd.keys.IsEmpty() {


What is nd.Count() on one of these nodes? Why is this special case necessary?

These are the flattened leaf nodes. They contain 0 keys and 1 value. If we don't have this check, then the binary search will attempt to compare the search key to key[0] in the node, which will be out of bounds.

I added a comment.

go/store/prolly/tree/node_builder.go

go/store/prolly/tree/node_splitter.go

go/store/prolly/tree/json_cursor.go

go/store/prolly/tree/json_chunker.go

…or messages.

…on inside an object/array, before the first element.

…er: they're not needed.

…to indicate a bug in the JSON functions, not data corruption.

coffeegoddd · 2024-06-04T00:05:41Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`f820432`	ok	5937457

version	total_tests
`f820432`	5937457

correctness_percentage
100.0

Co-authored-by: Aaron Son <[email protected]>

…afe to call elsewhere.

coffeegoddd · 2024-06-04T01:02:40Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`443e348`	ok	5937457

version	total_tests
`443e348`	5937457

correctness_percentage
100.0

I don't like this, but the alternative is adding a context parameter to ToInterface, which would then propagate through *hundreds* of files. We want to do that eventually, but this is an acceptable stopgap.

coffeegoddd · 2024-06-10T17:01:45Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`1e72313`	ok	5937457

version	total_tests
`1e72313`	5937457

correctness_percentage
100.0

…ns other than JsonDocument.

…etails changed so much since then that the test is just wrong.

…ew buffer that contains anything that hasn't been put into a chunk yet.

…f we detect one during a lookup, fall back on the previous behavior.

…IndexedJsonDocument::Lookup`

…Bytes()

…dexedJsonDocument.

coffeegoddd · 2024-06-12T03:35:30Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`8113baa`	ok	5937457

version	total_tests
`8113baa`	5937457

correctness_percentage
100.0

…te.sh

coffeegoddd · 2024-06-12T07:04:39Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`c79d151`	ok	5937457

version	total_tests
`c79d151`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-06-12T07:11:41Z

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`87ac817`	ok	5937457

version	total_tests
`87ac817`	5937457

correctness_percentage
100.0

…ry doesn't move.

…en by a newer version of Dolt that it can't read.

coffeegoddd · 2024-06-12T19:12:25Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`e4a6d6e`	ok	5937457

version	total_tests
`e4a6d6e`	5937457

correctness_percentage
100.0

nicktobey added 11 commits May 28, 2024 13:19

Add JsonPathKey

f558e94

Add JsonScanner.

0996a9e

Add JsonCursor.

42f7b88

Add JsonChunker.

bbd5c99

Add IndexedJsonDocument.

da59a82

Use JsonSplitter when writing to Json columns.

bc65c09

Read an IndexedJsonDocument when reading from a table.

f61a93e

Handle reading and writing address maps when the leaf nodes are BLOBs.

58bc0da

Add dependency on github.com/mohae/uvarint

4f3ac0d

Allow NodeBuilder and NodeSplitter to ignore key length, for when we'…

6d7b614

…re representing the leaf node as a blob.

Assert that chunker implements its interface.

3ec7660

nicktobey requested a review from reltuk May 28, 2024 21:12

reltuk reviewed May 30, 2024

View reviewed changes

nicktobey added 9 commits June 1, 2024 18:10

Add Clone method to sql.JSONWrapper implementations.

75430ed

Improve jsonPathElementsFromMySQLJsonPath to fix bugs and improve err…

ef7ce49

…or messages.

Rework the scanner to add an additional scanner state / insert locati…

b434373

…on inside an object/array, before the first element.

Remove previousValueOffset and firstElement fields from JsonScann…

ade66ee

…er: they're not needed.

Handle additional corner cases in IndexedJsonDocument::Insert

71a98a2

Rename JsonScannerTest

9bfa19a

Add tests for indexed JSON_INSERT implementation.

73506d2

Don't panic when encountering impossible JSON: it's much more likely …

8f4856e

…to indicate a bug in the JSON functions, not data corruption.

Address PR feedback.

f820432

coffeegoddd added the correctness_approved label Jun 4, 2024

nicktobey and others added 3 commits June 3, 2024 17:11

Update go/store/prolly/tree/json_scanner.go

5c245e0

Co-authored-by: Aaron Son <[email protected]>

Update go/store/prolly/tree/json_scanner.go

323bee6

Co-authored-by: Aaron Son <[email protected]>

Inline some functions that are only called in once place and aren't s…

443e348

…afe to call elsewhere.

nicktobey added 2 commits June 10, 2024 09:26

Add context parameter to existing calls to jsonWrapper.clone

0aa8303

When creating an IndexedJsonDocument, capture the context in a closure.

dda2e04

I don't like this, but the alternative is adding a context parameter to ToInterface, which would then propagate through *hundreds* of files. We want to do that eventually, but this is an acceptable stopgap.

Update indexed document tests to use new GMS json test package

1e72313

nicktobey added 9 commits June 11, 2024 18:46

Add support to binlog serialization for sql.JSONWrapper implementatio…

4fdfe53

…ns other than JsonDocument.

Update tests to have correct result when casting JSON to string.

81f9282

Remove JsonScannerTest. It was a unit test for JsonScanner, but the d…

78ff699

…etails changed so much since then that the test is just wrong.

Handle escaped double-quotes in JSON key names in indexed documents.

2a6c01e

Fix jsonChunker::processBuffer to correctly set the offset in the n…

14b4631

…ew buffer that contains anything that hasn't been put into a chunk yet.

IndexedJsonDocument doesn't currently handle wildcards in paths, so i…

667b992

…f we detect one during a lookup, fall back on the previous behavior.

Correctly instantiate LazyJsonDocument when returning the result of `…

c106bf8

…IndexedJsonDocument::Lookup`

Store context in IndexedJsonDocument so we can use it in calls to Get…

e1d2148

…Bytes()

Add tests for JSON_VALUE, JSON_EXTRACT, and JSON_CONTAINS_PATH for In…

d4a876a

…dexedJsonDocument.

nicktobey requested a review from tbantle22 as a code owner June 12, 2024 02:12

Merge branch 'main' into nicktobey/json-addressmap

303526d

nicktobey force-pushed the nicktobey/json-addressmap branch from 8023dad to 303526d Compare June 12, 2024 02:13

Add copyright header to json_indexed_document_test.go

8113baa

nicktobey and others added 2 commits June 11, 2024 23:32

Add uvarint license.

c79d151

[ga-format-pr] Run go/utils/repofmt/format_repo.sh and go/Godeps/upda…

87ac817

…te.sh

nicktobey added 2 commits June 12, 2024 11:40

Fix jsonChunker::Done so that it finishes as soon at a chunk bounda…

890df78

…ry doesn't move.

Emit a warning if the scanner encounters json indexing metadata writt…

e4a6d6e

…en by a newer version of Dolt that it can't read.

tbantle22 removed their request for review June 12, 2024 19:03

nicktobey merged commit c660813 into main Jun 12, 2024
21 checks passed

nicktobey deleted the nicktobey/json-addressmap branch June 12, 2024 19:17

This was referenced Jun 13, 2024

dolt 1.40.0 Homebrew/homebrew-core#174440

Merged

dolt 1.40.1 Homebrew/homebrew-core#174531

Merged

nicktobey mentioned this pull request Feb 7, 2025

Expand ItemAccess::itemWidth to 32 bits #8831

Merged

BrewTestBot mentioned this pull request Feb 8, 2025

dolt 1.49.1 Homebrew/homebrew-core#206943

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `IndexedJsonDocument`, a `JSONWrapper` implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912

Add `IndexedJsonDocument`, a `JSONWrapper` implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912

nicktobey commented May 28, 2024

reltuk left a comment

reltuk May 29, 2024

nicktobey Jun 3, 2024

coffeegoddd commented Jun 4, 2024

coffeegoddd commented Jun 4, 2024

coffeegoddd commented Jun 10, 2024

coffeegoddd commented Jun 12, 2024

coffeegoddd commented Jun 12, 2024

coffeegoddd commented Jun 12, 2024

coffeegoddd commented Jun 12, 2024

Add IndexedJsonDocument, a JSONWrapper implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912

Add IndexedJsonDocument, a JSONWrapper implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912

Conversation

nicktobey commented May 28, 2024

reltuk left a comment

Choose a reason for hiding this comment

reltuk May 29, 2024

Choose a reason for hiding this comment

nicktobey Jun 3, 2024

Choose a reason for hiding this comment

coffeegoddd commented Jun 4, 2024

coffeegoddd commented Jun 4, 2024

coffeegoddd commented Jun 10, 2024

coffeegoddd commented Jun 12, 2024

coffeegoddd commented Jun 12, 2024

coffeegoddd commented Jun 12, 2024

coffeegoddd commented Jun 12, 2024

Add `IndexedJsonDocument`, a `JSONWrapper` implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912

Add `IndexedJsonDocument`, a `JSONWrapper` implementation that stores JSON documents in a prolly tree with probabilistic hashing. #7912