Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build-a-Codec? #86

Open
warpfork opened this issue Sep 28, 2020 · 1 comment
Open

Build-a-Codec? #86

warpfork opened this issue Sep 28, 2020 · 1 comment

Comments

@warpfork
Copy link
Collaborator

Some interesting formations have appeared in downstream code as well as our own code: codecs have some common patterns and common segments of code, and then a fairly small section that specializes or diverges and makes that codec unique.

In large part this is due to the use of the refmt Token/TokenSource/TokenSink types. (Which is interesting, because I also think the way we currently expose some of those interface details is not great; but apparently it does have some virtues and maybe I shouldn't be so hasty in wanting to rip it out or conceal it. Another issue will be made for that discussion, anyway.)

This is probably best shown by full example:

  • This is some code written by a project that wanted to mildly customize the JSON output of some of its structures: https://github.com/filecoin-project/statediff/blob/2240ddfdaf7372732948ac411691e87d5c04d7ca/codec/fcjson/marshal.go#L122-L127
  • For comparison: This is what the code for our dag-json serializer currently looks like:
    v, err := n.AsBool()
    if err != nil {
    return err
    }
    tk.Type = tok.TBool
    tk.Bool = v
    _, err = sink.Step(&tk)
    return err
    case ipld.ReprKind_Int:
    v, err := n.AsInt()
    if err != nil {
    return err
    }
    tk.Type = tok.TInt
    tk.Int = int64(v)
    _, err = sink.Step(&tk)
    return err
    case ipld.ReprKind_Float:
    v, err := n.AsFloat()
    if err != nil {
    return err
    }
    tk.Type = tok.TFloat64
    tk.Float64 = v
    _, err = sink.Step(&tk)
    return err
    case ipld.ReprKind_String:
    v, err := n.AsString()
    if err != nil {
    return err
    }
    tk.Type = tok.TString
    tk.Str = v
    _, err = sink.Step(&tk)
    return err
    case ipld.ReprKind_Bytes:
    v, err := n.AsBytes()
    if err != nil {
    return err
    }
    tk.Type = tok.TBytes
    tk.Bytes = v
    _, err = sink.Step(&tk)
    return err
    case ipld.ReprKind_Link:
    v, err := n.AsLink()
    if err != nil {
    return err
    }
    switch lnk := v.(type) {
    case cidlink.Link:
    // Precisely four tokens to emit:
    tk.Type = tok.TMapOpen
    tk.Length = 1
    if _, err = sink.Step(&tk); err != nil {
    return err
    }
    tk.Type = tok.TString
    tk.Str = "/"
    if _, err = sink.Step(&tk); err != nil {
    return err
    }
    tk.Str = lnk.Cid.String()
    if _, err = sink.Step(&tk); err != nil {
    return err
    }
    tk.Type = tok.TMapClose
    if _, err = sink.Step(&tk); err != nil {
    return err
    }
    return nil
    default:
    return fmt.Errorf("schemafree link emission only supported by this codec for CID type links")
    }
    default:
    panic("unreachable")
    }
    }

The similarity is... almost 100%, as you can see. There's just a tiiiny divergence in one of the cases in one of the functions which introduces some custom logic (in this example, it peeks at the concrete types and treats some of them a little special).

Looking into the details a bit more...

  • Normally I'd say a project "shouldn't" need to do something like this: IPLD Schemas already offer a lot of ways to tune the isomorphism between logical data and a serialization-ready data model view of the same.
  • But in this example, there's a serialization already -- and that was described with IPLD Schemas already -- and what this developer wanted was another, distinct serialization that does not have the same token stream, which happens to have the purpose of debug/human-readability. (It's the "not the same token stream" detail which makes switching to another known multicodec not suitable to attain the goal here. I think this is probably rare, but I'll let it fly unchallenged for the sake of this discussion.)
  • I'm perfectly happy with this outcome: the developer writes a custom codec, and uses it for their human-readable output presentation, and that's great.
  • This custom codec is not, and cannot be, a multicodec. It's reaching around for data that's not in the data model! If the function can't be specified in terms of the data model, and needs some other "special sauce" to be defined, then it's not a multicodec -- there'd never be a context-free way to morph the serial data back into data model, and thus it does not meet our criteria for multicodecs.

There's a couple things interesting about this:

  1. I think it's interesting that we came up with a scenario where walking a thing as tokens was useful (and it happened to pretty much produce a codec... just not a multicodec).

  2. I think it's also interesting that we came up with a scenario where it would've been helpful to plug together some handler functions for various token kinds (together with a bunch of default handlers for every token kind we didn't have a special behavior in mind for) and synthesize a whole codec function out of it.

So.

Maybe... maybe having an API for traversals that's based on streams of tokens... is actually a useful idea, and something we should keep supporting. (Probably still in a muchly refactored way than the present, but nonetheless.)

And Maybe... having some kind of build-a-codec gadget, which handles the token case-switching for you and composes callbacks you give it (...while doing all the other fiddly bits like memory budgeting etc, for you) could actually be useful. (We wouldn't necessarily use it for the core codecs, for performance reasons, but it could be plenty helpful for building custom prettyprinters or suchlike.)

Or maybe this is all a bit too much. :) It's probably best to sit on this idea, until another example usecase comes along. Anyway, the notes are here now.

@willscott
Copy link
Member

I think the evolution of codecs like this one are likely 'away from what is supported' such that it'll be hard to make a general enough framework that's both performant, useful, and flexible enough to work for all cases.
Having a string or other fast path for type enumeration seems useful (versus doing type assertions, which seem like they're going to get expensive)
I suspect the other half of this that will be intertwined, but also breaks down the 'stream of tokens' as interface level, is going to be custom semantics about when to load/follow links when performing traversals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants