Codec revamp #112

warpfork · 2020-11-14T15:51:41Z

Alright, this is kind of a lot.

This introduces some new decoders. I'll probably keep appending to this PR for a while until it includes encoders as well.

The aim is to, by the end of this revamp, be able to do something about the linking API situation (#55),
and in particular, start approaching the API drafts in https://gist.github.com/warpfork/c0200cc4d99ee36ba5ce5a612f1d1a22 ... which involves normalizing the codec interfaces a bit better than they currently are.

While we're at it... why not fix everything? Scope control is for chumps, amiright?

(I'll regret saying that, of course. I already do. But, the code that bites off a lot of fixes at once is already brewing here, so, here we go...)

So: performance, API consistency, better configurability, improved strictness, and more, are all goals in this changeset. Whee!
(We'll probably get to some of those, and leave some others as strategically deferred TODOs; hopefully the latter at least get closer into sight for future work.)

The first commit in this stach adds a scratch.Reader tool, which is helpful for decoders.

The docs in the diff should cover it pretty well.
It's a reader-wrapper that does a lot of extremely common
buffering and small-read operations that parsers tend to need.

The tests include aggressive checks that operations take exactly as
many allocations as planned -- and mostly, that's zero.

The second commit is a giant revamped DAG-JSON decoder and unmarshaller.

This is added in a new "dagjson2" package for the time being,
but aims to replace the current dagjson package entirely,
and will take over that namespace when complete.

So far only the decoder/unmarshaller is included in this first commit,
and the encoder/marshaller is still coming up.

Note that a lot of the state machine code is partially-evolved code from refmt/json (which is also what backs the old dagjson package we're aiming to replace). That code is probably not the main focus of interest here, because it's how it fits together that we're working on improving here... but unfortunately I don't see a way to keep the diffs separated for ease of review.

This revamp is making several major strides:

The decoding system is cleanly separated from the tree building.
(The tree-building was all introduced in Fresh take on codec APIs, and some tokenization utilities. #101 .)
The tree building reuses the codectools token assembler systems.
This saves a lot of code, and adds a lot of consistency.
(By contrast, the older dagjson and dagcbor packages had similar
outlines, but didn't actually share much code; this was annoying
to maintain, and meant improvements to one needed to be ported
to the other manually. No more.)
The token type used by this codectools system is more tightly
associated with the IPLD Data Model. In practice, what this means
is links are parsed at the same stage as the rest of parsing,
rather than being added on in an awkward "parse 1.5" stage.
This results in much less complicated code than the old token
system from refmt which the older dagjson package leans on.
Budgets are more consistently woven through this system.
The JSON decoder components are in their own sub-package,
and should be relatively reusable. Some features like string parsing
are exported in their own right, in addition to being accessible
via the full recursive supports-everything decoders.
(This might not often be compelling, but -- maybe. I myself wanted
more reusable access to fine-grained decoder and encoder components
when I was working on the "JST" experiment, so, I'm scratching my
own itch here if nothing else.)
End-users should mostly not need to see this, but library
implementors might appreciate it.
The codectools scratch.Reader type is used in all the decoder APIs.
This results in good performance for either streaming io.Reader or
already-in-memory bytes slices as data sources, and does it without
doubling the number of exported functions we need (or pushing the
need for feature detection into every single exported function).
The configuration system for the decoder is actually in this repo,
and it's sanely and clearly settable while also being optional.
Previously, if you wanted to configure dagjson, you'd have to reach
into the refmt json package for those configuration structs,
which was workable but just very confusing and gave the end-user a
lot of different places to look before finding what they need.
The implementations are very mindful of memory allocation efficiency.
Almost all of the component structures carefully utilize embedding:
ReusableUnmarsahller embeds the Decoder; the Decoder embeds the
scratch.Reader as well as the Token it yields; etc.
This should result in overall being able to produce fully usable
codecs with a minimal number of allocations -- much fewer than the
older implementations required.

Some benefits have yet to be realized, but are on the map now:

The new Token structure also includes space for position and
progress tracking, which we want to use to produce better errors.
(This needs more implementation work, still, though.)
There are several configuration options for strictness.
These aren't all backed up by the actual implementation yet
(I'm porting over old code fast enough to write a demo and make
sure the whole suite of interfaces works; it'll require further
work, especially on this strictness front, later), but
at the very least these are now getting documented,
and several comment blocks point to where more work is needed.
The new multicodec registry is alluded to in comments here, but
isn't implemented yet. This will be part of the long game big goal
about API normalization and linking cleanup, though.

Benchmarks are still forthcoming. My recollection from the previous bout of this in refmt was that microbenchmarking (especially of things like scratch.Reader) wasn't a great use of time, because when we start benchmarking codecs as a whole, and especially, when looking at the pprof reports from that, we'll see things like scratch.Reader showing up plain as day there, and nicely contextualized... so, it's a good use of time to just save our efforts for that point. (This PR as a whole is probably getting to the point where we could actually start doing those benchmarks over the whole decoder now, but, alas, my brain ran out of gas. Will retry soon.)

As usual, the naming of things should be considered to be a bit in flux.
In particular, a bunch of things are still called "codectools", even though in that PR, that naming is already in question and we're looking for alternatives.
We might do something about that soon, but this PR is still running forward with those names. (Maybe by the end of this PR, we'll have figured out enough of the big picture to pick better names, and can then fire lazers on that. We'll see.)

warpfork · 2020-11-14T17:33:26Z

This might be a little premature for serious review -- among other things, I just noticed I didn't actually finish fleshing out the parsing for links, so the claim that this is cleaner than the old packages on that detail is pretty speculative (eep) -- but I wanted to at least get it out here and visible as the second half of the direction I was dreaming of going with #101 . So, read and review at this point only if you dare :)

The docs in the diff should cover it pretty well. It's a reader-wrapper that does a lot of extremely common buffering and small-read operations that parsers tend to need. This emerges from some older generation of code in refmt with similar purpose: https://github.com/polydawn/refmt/blob/master/shared/reader.go Unlike those antecedents, this one is a single concrete implementation, rather than using interfaces to allow switching between the two major modes of use. This is surely uglier code, but I think the result is more optimizable. The tests include aggressive checks that operations take exactly as many allocations as planned -- and mostly, that's *zero*. In the next couple of commits, I'll be adding parsers which use this. Benchmarks are still forthcoming. My recollection from the previous bout of this in refmt was that microbenchmarking this type wasn't a great use of time, because when we start benchmarking codecs built *upon* it, and especially, when looking at the pprof reports from that, we'll see this reader showing up plain as day there, and nicely contextualized... so, we'll just save our efforts for that point.

This is added in a new "dagjson2" package for the time being, but aims to replace the current dagjson package entirely, and will take over that namespace when complete. So far only the decoder/unmarshaller is included in this first commit, and the encoder/marshaller is still coming up. This revamp is making several major strides: - The decoding system is cleanly separated from the tree building. - The tree building reuses the codectools token assembler systems. This saves a lot of code, and adds a lot of consistency. (By contrast, the older dagjson and dagcbor packages had similar outlines, but didn't actually share much code; this was annoying to maintain, and meant improvements to one needed to be ported to the other manually. No more.) - The token type used by this codectools system is more tightly associated with the IPLD Data Model. In practice, what this means is links are parsed at the same stage as the rest of parsing, rather than being added on in an awkward "parse 1.5" stage. This results in much less complicated code than the old token system from refmt which the older dagjson package leans on. - Budgets are more consistently woven through this system. - The JSON decoder components are in their own sub-package, and should be relatively reusable. Some features like string parsing are exported in their own right, in addition to being accessable via the full recursive supports-everything decoders. (This might not often be compelling, but -- maybe. I myself wanted more reusable access to fine-grained decoder and encoder components when I was working on the "JST" experiment, so, I'm scratching my own itch here if nothing else.) End-users should mostly not need to see this, but library implementors might appreciate it. - The codectools scratch.Reader type is used in all the decoder APIs. This results in good performance for either streaming io.Reader or already-in-memory bytes slices as data sources, and does it without doubling the number of exported functions we need (or pushing the need for feature detection into every single exported function). - The configuration system for the decoder is actually in this repo, and it's sanely and clearly settable while also being optional. Previously, if you wanted to configure dagjson, you'd have to reach into the refmt json package for *those* configuration structs, which was workable but just very confusing and gave the end-user a lot of different places to look before finding what they need. - The implementations are very mindful of memory allocation efficiency. Almost all of the component structures carefully utilize embedding: ReusableUnmarsahller embeds the Decoder; the Decoder embeds the scratch.Reader as well as the Token it yields; etc. This should result in overall being able to produce fully usable codecs with a minimal number of allocations -- much fewer than the older implementations required. Some benefits have yet to be realized, but are on the map now: - The new Token structure also includes space for position and progress tracking, which we want to use to produce better errors. (This needs more implementation work, still, though.) - There are several configuraiton options for strictness. These aren't all backed up by the actual implementation yet (I'm porting over old code fast enough to write a demo and make sure the whole suite of interfaces works; it'll require further work, especially on this strictness front, later), but at the very least these are now getting documented, and several comment blocks point to where more work is needed. - The new multicodec registry is alluded to in comments here, but isn't implemented yet. This is part of the long game big goal. The aim is to, by the end of this revamp, be able to do something about #55 , and approach https://gist.github.com/warpfork/c0200cc4d99ee36ba5ce5a612f1d1a22 .

I dearly wish this wasn't such a dark art. But I really want these tests, too.

warpfork · 2020-12-01T03:35:21Z

Gonna use master for work-in-progress on this, and merge even though the work will be continuing. Since it's in its own package, it should be possible to do this non-disruptively.

(It's desirable to do this because we want to do some core interface changes and make a v0.7 with those changes happen soon -- on a shorter timeline than we are going to prioritize the remainder of this codec rework.)

The original idea of this branch was to explore some reusable components for codecs; maybe to actually get a useful prettyprinter; and... actually turned a lot into being a bit of practical discovery about string encoding and escaping. Some of those goals were partially accomplished, but in general, this seems better to chalk up as a learning experience. #89 (comment) already does a good job of discussing why, and what as learned. A lot of the reusable codec components stuff has also now shaken out, just... in other PRs that were written after the learnings here. Namely, #101 was able to introduce some tree transformers; and then #112 demonstrates how those can compose into a complete codec end to end. There's still work to go on these, too, but they seem to have already grabbed the concept of reusable parts I was hoping for here and gotten farther with it, so. These diffs are interesting enough I want to keep them referencable in history. But I'm merging them with the "-s ours" strategy, so that the diffs don't actually land any impact on master. These commits are for reference only.

warpfork force-pushed the codec-revamp branch 2 times, most recently from 4f21398 to ea4e83a Compare November 14, 2020 17:32

mvdan approved these changes Nov 30, 2020

View reviewed changes

warpfork added 3 commits December 1, 2020 04:34

Tweak to alloc counting tests.

ca68071

I dearly wish this wasn't such a dark art. But I really want these tests, too.

warpfork force-pushed the codec-revamp branch from d361330 to ca68071 Compare December 1, 2020 03:35

warpfork merged commit 2aa907a into master Dec 1, 2020

warpfork deleted the codec-revamp branch December 1, 2020 03:35

aschmahmann mentioned this pull request May 14, 2021

Release v0.9.0 ipfs/kubo#8058

Closed

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codec revamp #112

Codec revamp #112

warpfork commented Nov 14, 2020 •

edited

Loading

warpfork commented Nov 14, 2020

warpfork commented Dec 1, 2020

Codec revamp #112

Codec revamp #112

Conversation

warpfork commented Nov 14, 2020 • edited Loading

warpfork commented Nov 14, 2020

warpfork commented Dec 1, 2020

warpfork commented Nov 14, 2020 •

edited

Loading