Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codec revamp #112

Merged
merged 3 commits into from
Dec 1, 2020
Merged

Codec revamp #112

merged 3 commits into from
Dec 1, 2020

Conversation

warpfork
Copy link
Collaborator

@warpfork warpfork commented Nov 14, 2020

Alright, this is kind of a lot.

This introduces some new decoders. I'll probably keep appending to this PR for a while until it includes encoders as well.

The aim is to, by the end of this revamp, be able to do something about the linking API situation (#55),
and in particular, start approaching the API drafts in https://gist.github.com/warpfork/c0200cc4d99ee36ba5ce5a612f1d1a22 ... which involves normalizing the codec interfaces a bit better than they currently are.

While we're at it... why not fix everything? Scope control is for chumps, amiright?

(I'll regret saying that, of course. I already do. But, the code that bites off a lot of fixes at once is already brewing here, so, here we go...)

So: performance, API consistency, better configurability, improved strictness, and more, are all goals in this changeset. Whee!
(We'll probably get to some of those, and leave some others as strategically deferred TODOs; hopefully the latter at least get closer into sight for future work.)


The first commit in this stach adds a scratch.Reader tool, which is helpful for decoders.

The docs in the diff should cover it pretty well.
It's a reader-wrapper that does a lot of extremely common
buffering and small-read operations that parsers tend to need.

The tests include aggressive checks that operations take exactly as
many allocations as planned -- and mostly, that's zero.


The second commit is a giant revamped DAG-JSON decoder and unmarshaller.

This is added in a new "dagjson2" package for the time being,
but aims to replace the current dagjson package entirely,
and will take over that namespace when complete.

So far only the decoder/unmarshaller is included in this first commit,
and the encoder/marshaller is still coming up.

Note that a lot of the state machine code is partially-evolved code from refmt/json (which is also what backs the old dagjson package we're aiming to replace). That code is probably not the main focus of interest here, because it's how it fits together that we're working on improving here... but unfortunately I don't see a way to keep the diffs separated for ease of review.

This revamp is making several major strides:

  • The decoding system is cleanly separated from the tree building.
    (The tree-building was all introduced in Fresh take on codec APIs, and some tokenization utilities. #101 .)

  • The tree building reuses the codectools token assembler systems.
    This saves a lot of code, and adds a lot of consistency.
    (By contrast, the older dagjson and dagcbor packages had similar
    outlines, but didn't actually share much code; this was annoying
    to maintain, and meant improvements to one needed to be ported
    to the other manually. No more.)

  • The token type used by this codectools system is more tightly
    associated with the IPLD Data Model. In practice, what this means
    is links are parsed at the same stage as the rest of parsing,
    rather than being added on in an awkward "parse 1.5" stage.
    This results in much less complicated code than the old token
    system from refmt which the older dagjson package leans on.

  • Budgets are more consistently woven through this system.

  • The JSON decoder components are in their own sub-package,
    and should be relatively reusable. Some features like string parsing
    are exported in their own right, in addition to being accessible
    via the full recursive supports-everything decoders.
    (This might not often be compelling, but -- maybe. I myself wanted
    more reusable access to fine-grained decoder and encoder components
    when I was working on the "JST" experiment, so, I'm scratching my
    own itch here if nothing else.)
    End-users should mostly not need to see this, but library
    implementors might appreciate it.

  • The codectools scratch.Reader type is used in all the decoder APIs.
    This results in good performance for either streaming io.Reader or
    already-in-memory bytes slices as data sources, and does it without
    doubling the number of exported functions we need (or pushing the
    need for feature detection into every single exported function).

  • The configuration system for the decoder is actually in this repo,
    and it's sanely and clearly settable while also being optional.
    Previously, if you wanted to configure dagjson, you'd have to reach
    into the refmt json package for those configuration structs,
    which was workable but just very confusing and gave the end-user a
    lot of different places to look before finding what they need.

  • The implementations are very mindful of memory allocation efficiency.
    Almost all of the component structures carefully utilize embedding:
    ReusableUnmarsahller embeds the Decoder; the Decoder embeds the
    scratch.Reader as well as the Token it yields; etc.
    This should result in overall being able to produce fully usable
    codecs with a minimal number of allocations -- much fewer than the
    older implementations required.

Some benefits have yet to be realized, but are on the map now:

  • The new Token structure also includes space for position and
    progress tracking, which we want to use to produce better errors.
    (This needs more implementation work, still, though.)

  • There are several configuration options for strictness.
    These aren't all backed up by the actual implementation yet
    (I'm porting over old code fast enough to write a demo and make
    sure the whole suite of interfaces works; it'll require further
    work, especially on this strictness front, later), but
    at the very least these are now getting documented,
    and several comment blocks point to where more work is needed.

  • The new multicodec registry is alluded to in comments here, but
    isn't implemented yet. This will be part of the long game big goal
    about API normalization and linking cleanup, though.


Benchmarks are still forthcoming. My recollection from the previous bout of this in refmt was that microbenchmarking (especially of things like scratch.Reader) wasn't a great use of time, because when we start benchmarking codecs as a whole, and especially, when looking at the pprof reports from that, we'll see things like scratch.Reader showing up plain as day there, and nicely contextualized... so, it's a good use of time to just save our efforts for that point. (This PR as a whole is probably getting to the point where we could actually start doing those benchmarks over the whole decoder now, but, alas, my brain ran out of gas. Will retry soon.)


As usual, the naming of things should be considered to be a bit in flux.
In particular, a bunch of things are still called "codectools", even though in that PR, that naming is already in question and we're looking for alternatives.
We might do something about that soon, but this PR is still running forward with those names. (Maybe by the end of this PR, we'll have figured out enough of the big picture to pick better names, and can then fire lazers on that. We'll see.)

@warpfork warpfork force-pushed the codec-revamp branch 2 times, most recently from 4f21398 to ea4e83a Compare November 14, 2020 17:32
@warpfork
Copy link
Collaborator Author

This might be a little premature for serious review -- among other things, I just noticed I didn't actually finish fleshing out the parsing for links, so the claim that this is cleaner than the old packages on that detail is pretty speculative (eep) -- but I wanted to at least get it out here and visible as the second half of the direction I was dreaming of going with #101 . So, read and review at this point only if you dare :)

The docs in the diff should cover it pretty well.
It's a reader-wrapper that does a lot of extremely common
buffering and small-read operations that parsers tend to need.

This emerges from some older generation of code in refmt with similar purpose:
https://github.com/polydawn/refmt/blob/master/shared/reader.go
Unlike those antecedents, this one is a single concrete implementation,
rather than using interfaces to allow switching between the two major modes of use.
This is surely uglier code, but I think the result is more optimizable.

The tests include aggressive checks that operations take exactly as
many allocations as planned -- and mostly, that's *zero*.

In the next couple of commits, I'll be adding parsers which use this.

Benchmarks are still forthcoming.  My recollection from the previous
bout of this in refmt was that microbenchmarking this type wasn't
a great use of time, because when we start benchmarking codecs built
*upon* it, and especially, when looking at the pprof reports from that,
we'll see this reader showing up plain as day there, and nicely
contextualized... so, we'll just save our efforts for that point.
This is added in a new "dagjson2" package for the time being,
but aims to replace the current dagjson package entirely,
and will take over that namespace when complete.

So far only the decoder/unmarshaller is included in this first commit,
and the encoder/marshaller is still coming up.

This revamp is making several major strides:

- The decoding system is cleanly separated from the tree building.

- The tree building reuses the codectools token assembler systems.
  This saves a lot of code, and adds a lot of consistency.
  (By contrast, the older dagjson and dagcbor packages had similar
  outlines, but didn't actually share much code; this was annoying
  to maintain, and meant improvements to one needed to be ported
  to the other manually.  No more.)

- The token type used by this codectools system is more tightly
  associated with the IPLD Data Model.  In practice, what this means
  is links are parsed at the same stage as the rest of parsing,
  rather than being added on in an awkward "parse 1.5" stage.
  This results in much less complicated code than the old token
  system from refmt which the older dagjson package leans on.

- Budgets are more consistently woven through this system.

- The JSON decoder components are in their own sub-package,
  and should be relatively reusable.  Some features like string parsing
  are exported in their own right, in addition to being accessable
  via the full recursive supports-everything decoders.
  (This might not often be compelling, but -- maybe.  I myself wanted
  more reusable access to fine-grained decoder and encoder components
  when I was working on the "JST" experiment, so, I'm scratching my
  own itch here if nothing else.)
  End-users should mostly not need to see this, but library
  implementors might appreciate it.

- The codectools scratch.Reader type is used in all the decoder APIs.
  This results in good performance for either streaming io.Reader or
  already-in-memory bytes slices as data sources, and does it without
  doubling the number of exported functions we need (or pushing the
  need for feature detection into every single exported function).

- The configuration system for the decoder is actually in this repo,
  and it's sanely and clearly settable while also being optional.
  Previously, if you wanted to configure dagjson, you'd have to reach
  into the refmt json package for *those* configuration structs,
  which was workable but just very confusing and gave the end-user a
  lot of different places to look before finding what they need.

- The implementations are very mindful of memory allocation efficiency.
  Almost all of the component structures carefully utilize embedding:
  ReusableUnmarsahller embeds the Decoder; the Decoder embeds the
  scratch.Reader as well as the Token it yields; etc.
  This should result in overall being able to produce fully usable
  codecs with a minimal number of allocations -- much fewer than the
  older implementations required.

Some benefits have yet to be realized, but are on the map now:

- The new Token structure also includes space for position and
  progress tracking, which we want to use to produce better errors.
  (This needs more implementation work, still, though.)

- There are several configuraiton options for strictness.
  These aren't all backed up by the actual implementation yet
  (I'm porting over old code fast enough to write a demo and make
  sure the whole suite of interfaces works; it'll require further
  work, especially on this strictness front, later), but
  at the very least these are now getting documented,
  and several comment blocks point to where more work is needed.

- The new multicodec registry is alluded to in comments here, but
  isn't implemented yet.  This is part of the long game big goal.
  The aim is to, by the end of this revamp, be able to do something
  about #55 , and approach
  https://gist.github.com/warpfork/c0200cc4d99ee36ba5ce5a612f1d1a22 .
I dearly wish this wasn't such a dark art.
But I really want these tests, too.
@warpfork
Copy link
Collaborator Author

warpfork commented Dec 1, 2020

Gonna use master for work-in-progress on this, and merge even though the work will be continuing. Since it's in its own package, it should be possible to do this non-disruptively.

(It's desirable to do this because we want to do some core interface changes and make a v0.7 with those changes happen soon -- on a shorter timeline than we are going to prioritize the remainder of this codec rework.)

@warpfork warpfork merged commit 2aa907a into master Dec 1, 2020
@warpfork warpfork deleted the codec-revamp branch December 1, 2020 03:35
warpfork added a commit that referenced this pull request Dec 1, 2020
The original idea of this branch was to explore some reusable
components for codecs; maybe to actually get a useful
prettyprinter; and... actually turned a lot into being a bit of
practical discovery about string encoding and escaping.

Some of those goals were partially accomplished, but in general,
this seems better to chalk up as a learning experience.
#89 (comment)
already does a good job of discussing why, and what as learned.

A lot of the reusable codec components stuff has also now
shaken out, just... in other PRs that were written after the learnings
here.  Namely, #101
was able to introduce some tree transformers; and then
#112 demonstrates how those
can compose into a complete codec end to end.
There's still work to go on these, too, but they seem to have already
grabbed the concept of reusable parts I was hoping for here and gotten
farther with it, so.

These diffs are interesting enough I want to keep them referencable
in history.  But I'm merging them with the "-s ours" strategy, so that
the diffs don't actually land any impact on master.  These commits
are for reference only.
@aschmahmann aschmahmann mentioned this pull request May 14, 2021
71 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants