Build-a-Codec? #86

warpfork · 2020-09-28T00:03:00Z

Some interesting formations have appeared in downstream code as well as our own code: codecs have some common patterns and common segments of code, and then a fairly small section that specializes or diverges and makes that codec unique.

In large part this is due to the use of the refmt Token/TokenSource/TokenSink types. (Which is interesting, because I also think the way we currently expose some of those interface details is not great; but apparently it does have some virtues and maybe I shouldn't be so hasty in wanting to rip it out or conceal it. Another issue will be made for that discussion, anyway.)

This is probably best shown by full example:

This is some code written by a project that wanted to mildly customize the JSON output of some of its structures: https://github.com/filecoin-project/statediff/blob/2240ddfdaf7372732948ac411691e87d5c04d7ca/codec/fcjson/marshal.go#L122-L127

For comparison: This is what the code for our dag-json serializer currently looks like:

go-ipld-prime/codec/dagjson/marshal.go

Lines 78 to 155 in 3500324

    
           		v, err := n.AsBool() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		tk.Type = tok.TBool 
        
           		tk.Bool = v 
        
           		_, err = sink.Step(&tk) 
        
           		return err 
        
           	case ipld.ReprKind_Int: 
        
           		v, err := n.AsInt() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		tk.Type = tok.TInt 
        
           		tk.Int = int64(v) 
        
           		_, err = sink.Step(&tk) 
        
           		return err 
        
           	case ipld.ReprKind_Float: 
        
           		v, err := n.AsFloat() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		tk.Type = tok.TFloat64 
        
           		tk.Float64 = v 
        
           		_, err = sink.Step(&tk) 
        
           		return err 
        
           	case ipld.ReprKind_String: 
        
           		v, err := n.AsString() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		tk.Type = tok.TString 
        
           		tk.Str = v 
        
           		_, err = sink.Step(&tk) 
        
           		return err 
        
           	case ipld.ReprKind_Bytes: 
        
           		v, err := n.AsBytes() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		tk.Type = tok.TBytes 
        
           		tk.Bytes = v 
        
           		_, err = sink.Step(&tk) 
        
           		return err 
        
           	case ipld.ReprKind_Link: 
        
           		v, err := n.AsLink() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		switch lnk := v.(type) { 
        
           		case cidlink.Link: 
        
           			// Precisely four tokens to emit: 
        
           			tk.Type = tok.TMapOpen 
        
           			tk.Length = 1 
        
           			if _, err = sink.Step(&tk); err != nil { 
        
           				return err 
        
           			} 
        
           			tk.Type = tok.TString 
        
           			tk.Str = "/" 
        
           			if _, err = sink.Step(&tk); err != nil { 
        
           				return err 
        
           			} 
        
           			tk.Str = lnk.Cid.String() 
        
           			if _, err = sink.Step(&tk); err != nil { 
        
           				return err 
        
           			} 
        
           			tk.Type = tok.TMapClose 
        
           			if _, err = sink.Step(&tk); err != nil { 
        
           				return err 
        
           			} 
        
           			return nil 
        
           		default: 
        
           			return fmt.Errorf("schemafree link emission only supported by this codec for CID type links") 
        
           		} 
        
           	default: 
        
           		panic("unreachable") 
        
           	} 
        
           }

The similarity is... almost 100%, as you can see. There's just a tiiiny divergence in one of the cases in one of the functions which introduces some custom logic (in this example, it peeks at the concrete types and treats some of them a little special).

Looking into the details a bit more...

Normally I'd say a project "shouldn't" need to do something like this: IPLD Schemas already offer a lot of ways to tune the isomorphism between logical data and a serialization-ready data model view of the same.
But in this example, there's a serialization already -- and that was described with IPLD Schemas already -- and what this developer wanted was another, distinct serialization that does not have the same token stream, which happens to have the purpose of debug/human-readability. (It's the "not the same token stream" detail which makes switching to another known multicodec not suitable to attain the goal here. I think this is probably rare, but I'll let it fly unchallenged for the sake of this discussion.)
I'm perfectly happy with this outcome: the developer writes a custom codec, and uses it for their human-readable output presentation, and that's great.
This custom codec is not, and cannot be, a multicodec. It's reaching around for data that's not in the data model! If the function can't be specified in terms of the data model, and needs some other "special sauce" to be defined, then it's not a multicodec -- there'd never be a context-free way to morph the serial data back into data model, and thus it does not meet our criteria for multicodecs.

There's a couple things interesting about this:

I think it's interesting that we came up with a scenario where walking a thing as tokens was useful (and it happened to pretty much produce a codec... just not a multicodec).
I think it's also interesting that we came up with a scenario where it would've been helpful to plug together some handler functions for various token kinds (together with a bunch of default handlers for every token kind we didn't have a special behavior in mind for) and synthesize a whole codec function out of it.

So.

Maybe... maybe having an API for traversals that's based on streams of tokens... is actually a useful idea, and something we should keep supporting. (Probably still in a muchly refactored way than the present, but nonetheless.)

And Maybe... having some kind of build-a-codec gadget, which handles the token case-switching for you and composes callbacks you give it (...while doing all the other fiddly bits like memory budgeting etc, for you) could actually be useful. (We wouldn't necessarily use it for the core codecs, for performance reasons, but it could be plenty helpful for building custom prettyprinters or suchlike.)

Or maybe this is all a bit too much. :) It's probably best to sit on this idea, until another example usecase comes along. Anyway, the notes are here now.

The text was updated successfully, but these errors were encountered:

willscott · 2020-09-28T05:25:35Z

I think the evolution of codecs like this one are likely 'away from what is supported' such that it'll be hard to make a general enough framework that's both performant, useful, and flexible enough to work for all cases.
Having a string or other fast path for type enumeration seems useful (versus doing type assertions, which seem like they're going to get expensive)
I suspect the other half of this that will be intertwined, but also breaks down the 'stream of tokens' as interface level, is going to be custom semantics about when to load/follow links when performing traversals.

This was referenced Oct 1, 2020

Nodes need a pretty-print convention. #88

Closed

Introduce pretty printing tool. #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build-a-Codec? #86

Build-a-Codec? #86

warpfork commented Sep 28, 2020

willscott commented Sep 28, 2020

Build-a-Codec? #86

Build-a-Codec? #86

Comments

warpfork commented Sep 28, 2020

willscott commented Sep 28, 2020