Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce LinkSystem #143

Merged
merged 10 commits into from
Mar 12, 2021
Merged

introduce LinkSystem #143

merged 10 commits into from
Mar 12, 2021

Conversation

warpfork
Copy link
Collaborator

@warpfork warpfork commented Feb 5, 2021

This is the long-awaited address to #55 .

The code is largely based upon the draft earlier published in gist form: https://gist.github.com/warpfork/c0200cc4d99ee36ba5ce5a612f1d1a22

Key feature: there is now an ipld.LinkSystem type, and it gives us a place to compose all the different things that go into storing and loading links (getting encoders and decoders; getting hashers; and getting the actual I/O streams)... and then it gives us simple Store and Load methods that do all the plumbing and present a nice simple interface to end-users.

This means the ipld.Link interface does much less work. Now it's just a data holder. (Previously it had a Load method which did a lot of work, but also was quite convoluted to use because it tied so many things together.)

We can also now easily keep adding more helper methods to LinkSystem, without complicating the Link interface or anything else. For example, we already now have a Fill method and a Load method, which gives the user the option to either populate data into a NodeAssembler, or, just do the whole the whole operation and return a Node. The latter (Load) is essentially syntactic sugar for a sequence of nb := NewBuilder(); linksystem.Fill(nb); n := nb.Build();, while the former gives the controller more control over allocations. Previously, adding helper methods like this would've involved changing the (very central) Link interface, which would be a huge pain because of how many places it appears; now, it's easy to add helper methods like this.

The ipld.Storer and ipld.Loader function interfaces are now ipld.BlockWriteOpener and ipld.BlockReadOpener, respectively. They do roughly the same job as before, but figuring out where to use them should be simpler: now you just put both of them in a LinkSystem, and then use that.


multihashes: Along the way, I did some some additional work to distance ourselves further from the go-multihash package and the way it forms unconditional dependencies to several other repositories for functionality that end-users might not always want. This was prompted because the LinkSystem design detail of a HasherChooser function prompts us to want to use standard hash.Hash APIs. It also gets us the ability to use hashes in a streaming way (stdlib hash.Hash is streaming; go-multihash is, oddly, not). go-multihash is not removed from our dependency tree (that's going to be a fair bit more slog), so those dependencies are still with us transitively -- but finishing that separation in the future is now much closer to tractable.

This means we have new default multihash registry, rather than leaning on what's hardcoded in go-multihash. Upsides: can register new things; can choose your own dependencies. Downsides: this might produce a bit of a bumpy migration, because this registry has fewer things in it by default.


multicodecs: The multicodec registry map also moved. Semantics are pretty much the same, though. It's now in the codec package instead of the linking/cid package. This'll be a breaking change for any packages implementing codecs and registering themselves in this map (but a very easy one to fix). It means packages implementing codecs can do so without being forced into a transitive dependency on go-cid (and all of its many further-transitive dependencies). (Some codec packages will depend on go-cid anyway -- but that should be their choice, and now it will be.)


storage: I added a storage package! This contains some (very) basic gadgets that provide functions matching the ipld.BlockWriteOpener and ipld.BlockReadOpener interfaces.

The only thing in here right now is a simple in-memory storage heap. We'll still probably expect serious applications to get a more serious storage system, and we'll probably expect those more serious storage systems to live in other repos. But it's nice to have something "batteries included" for demos and for tests.

(Previously there were some functions scattered around the tests and examples that did roughly the same things, but since they were just defined at their use site, it was hard to attach documentation to them or use them as a reference. Now it's easier.)


Most examples are updated. I'm sure we could still use more of them, though.

@warpfork
Copy link
Collaborator Author

warpfork commented Feb 5, 2021

(This isn't quite ready to merge yet, but it's getting close. When it does merge, I think it's going to be grounds for tagging a v0.9.0 -- this is a sufficiently important interface change we'll want to start propagating it without delay. Also, I just plain want these simplifications before I start any new work downstream of this library that uses linking!)

Copy link
Member

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, big fan of this setup 👍

// Functions in the traversal package will set this automatically.
ParentNode Node

// REVIEW: ParentNode in LinkContext -- so far, this has only ever been hypothetically useful. Keep or drop?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that it's unclear the special case here is useful. either the path should be enough or we would need arbitrary ancestor / dag access from the context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've ended up squashing most of my work-in-progress commits together, so it's now invisible, but it might be interesting to note: I originally did comment this out entirely... but by the time I finished migrating the traversal package, it turned out so easy to keep it that I did.

So, this doesn't really argue against dropping it. But it turned out not much work (at the moment, anyway) to keep it, either.

linking.go Outdated
// like multihashType, multicodecType, and cidVersion, for example.
type LinkBuilder interface {
Build(context.Context, LinkContext, Node, Storer) (Link, error)
func (ls *LinkSystem) Load(lnkCtx LinkContext, lnk Link, np NodePrototype) (Node, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe define these with the definition of LinkSystem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one design thing this setup leaves open that is a bit worrying is that as it gets used for applied subsets of IPLD dags, we'll end up with implementations that don't fully fill out LinkContext, and will be fine for their initial use case, but will then end up in designs where tools aren't generally applicable to IPLD data because of that specificity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super worried? The documentation does make an explicit statement that a zero-value LinkContext is to be considered acceptable. Also, that values in a LinkContext should only ever affect data availability, not actual content.

I can't enforce this programmatically, of course; that would require some kind of magic. But: that's sort of inevitable. Even if we took out LinkContext entirely, we'd still need a context.Context here... and that still has the arbitrary attachment feature. (Which is actually desired. I fully expect someone to use this to tunnel auth credentials, at some point.)

I don't really expect people to have a strong incentive to do unlawful/inadvisable things here, either.

(Although... speaking of "only data availability effects"... that does remind me that we need to get a lot more buckled down on well-typed errors throughout this whole library.)

Comment on lines +13 to +34
StorageWriteOpener BlockWriteOpener
StorageReadOpener BlockReadOpener
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review wanted on naming here. (Both the functions, and the field names.)

@hannahhoward had previously commented on them being a bit of a mouthful, but rather liking that they mention "blocks" now. I think I've improved the names a bit since the draft she was looking at, but they're still a mouthful.

Probably I should s/Storage/Block/ on the field names, so they're all just consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated, and perhaps a bit of a nit: I don't think func fields should be named like interfaces. It's far more natural to call lsys.ChooseEncoder(...) or lsys.EncoderChoose(...) than lsys.EncoderChooser(...). The result of that call is an encoder, not an encoder chooser :) A bit like if strings.Repeat were called strings.Repeater.

Interfaces are different, because you call methods on them, and it's good to differentiate the interface name from the method name. You just have one name here, so in my opinion it should be what best fits the method-like function calls.

@warpfork
Copy link
Collaborator Author

warpfork commented Feb 8, 2021

... Okay, I think this actually is ready for review after all.

There's one more inconsistency I'd like to resolve, but will maybe punt to a subsequent PR: some additional hash functions from non-stdlib are still included in the new multicodec table (specifically, sha3), which kind of runs contrary to the goal of moving towards "only what you need" in the dependency tree. They're there because... I had used them in examples. I'd like to update those examples, but I think I'd also like to update them to sha2-384 while I'm at it, which... pends multiformats/multicodec#205 .

@warpfork
Copy link
Collaborator Author

warpfork commented Feb 8, 2021

I didn't start using https://github.com/multiformats/go-multicodec yet. But maybe it's also time for that. At least it should be used in examples, from now on, I think.

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be interesting to send a draft PR to update the ipld-docs Go examples to use this new API, to see how it affects our very first basic use case.

It would also be useful as a way to quickly visualize how the new API compares to the old one for basic use cases. You did add new examples here, but since there weren't previous versions of that same code, I can't do a side by side comparison.

//
// In a real program, you'll probably make functions to load and store from disk,
// or some network storage, or... whatever you want, really :)
var store = storage.Memory{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I think each separate example will only render as its example func body, so the references to "store" might not show up well. You might want to use local store variables instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, for the sake of keeping the example simple, I'd use &storage.Memory{} and avoid (&store).xxx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think both examples need to be a single one; otherwise, running only the Load one will fail. This will be a problem for many scenarios like pkg.go.dev and godoc.org, where you can click "Run" on a single runnable example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, boy. Yeah. I ported this example somewhat uncritically. But you're right, it was wrong both now and before; that standalone property is really important. I'll separate these.

...although I'm not really sure how. Is there a good idiom for using the actions of one example as setup for another? If I extract a function, it disappears from the rendered example in godoc again. But the Example* function contract doesn't allow enough parameters to use it directly, either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I wasn't clear - I mean having a single Example function, and putting all the code in there. You can't have two separate example funcs that share state. It would be a bit like having two Test funcs that share global state, but worse, since running a single Example func is their main use case for the web docs.

type LinkSystem struct {
EncoderChooser func(LinkPrototype) (Encoder, error)
DecoderChooser func(Link) (Decoder, error)
HasherChooser func(LinkPrototype) (hash.Hash, error)
Copy link
Contributor

@mvdan mvdan Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EncoderChooser and HasherChooser are almost always used together. Have you thought about a way to make that easier? Maybe join the two "choosers" into a single func, or maybe add a helper method to simplify calling both at once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what LinkSystem.Store(...) is 😅

The idea is that users can set these, but will almost never use them directly.

I should finish adding a documentation block to the LinkSystem declaration which describes this.

I debate maybe splitting this into a LinkSystem type that has the synthesized high level methods, and a LinkSystemConfig type that has these fields and is used to build it, but is thereafter unseen. Or just unexporting these fields and going with a constructor function. Seemed kind of heavyweight though? And then we'd end up wanting a bunch of (*LinkSystem).With{Variation}(...) LinkSystem constructor variants, too, I suspect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, really. If Store is meant to be what most people are going to use, then SGTM.

//
// store := storage.Memory{}
// lsys.StorageReadOpener = (&store).OpenRead
// lsys.StorageWriteOpener = (&store).OpenWrite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit weird to have to supply a storage reader and a storage writer as separate entities. When supplying both, it's going to be very common that they'll both come from the same implementation, like storage.Memory here.

So it would seem more natural to me to define an interface, and be able to supply both read and write ends in one go. For example:

store := storage.Memory{}
lsys = lsys.WithStore(store)

Copy link
Collaborator Author

@warpfork warpfork Feb 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to try this way. It's close to what we've got already; and are several reasons I'd like to use functional interfaces and be able to set these separately:

Reason 1: It's true that when supplying both, they often come from the same implementation... but I really like not having to supply both. Making something read-only by not providing it an ability to write sits well with me. It's a "capabilities" style system, and I like those: the lines are very clear.

Reason 2: Functions as interfaces makes it easier to decorate one or the other (or both), because when they're just functions, it can be done inline. By contrast, a whole type that has methods attached to satisfy an interface has to be a top-level declaration. It's also harder to decorate just one of the several functions on an interface; doing so tends to cause one to end up with a whole suite of decorator wrapper types. I think it's been not uncommon so far for us to want to decorate one of these methods but not care much about the other.

Reason 3: An interface, because of how it has the ability to grow, makes it more easier for something to evolve to become stateful. I think that would be a bad direction of growth for these data in and out primitives. These should be really simple and really functional.

These reasons 2 and 3 may sound a little hypothetical, but I think it's what we can observe actually happening in the previous generation of interfaces for this domain: go-ipfs-blockstore grew in both these directions over time, and while the result hasn't been wrong per se, it also hasn't been very satisfying and it hasn't stayed simple; I seem to hear grumbling from people who look at that interface for long. I don't want to set ourselves up for drift in that same direction again.

Reason 4: This pattern just plain has worked well for me in the past, and I think functional interfaces are underutilized in golang :D

I'd revisit this if we did a design review in depth on the blockstore concepts, and found with confidence that we wanted a certain sequence of methods to always be available in a bundle together; or, if we have some ideas where we'd actually want to be able to do feature detection on the storage system.

At the moment, my thinking is that I don't think we'll want the ability to do further feature detection on these values in any of the places we use a LinkSystem. For example: will the traversal package ever want to check if the storage system supports GC (which is one of the other interfaces that have appeared around blockstorage)? I think it shouldn't. Things that just do loading and storing should be pretty logically unaware of anything but the read and write operations.

The best example I can think of for wanting feature detection is probably the https://pkg.go.dev/github.com/ipfs/go-ipfs-blockstore#Viewer interface, which wants to check if certain performance shortcuts are available. But, we've also seen that to be doable by checking the io.Reader returned by these functions for other features. So we probably don't need to be preparing to do things like that in two separate places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a middle position on this -- let's try functional interfaces for now (this is also the smallest change from the current Loader and Storer system); but we stay extremely ready to change this as folks work on moving go-ipfs over to these libraries? We should get good and hard info from that process, and if we discover the need for more feature detection on storage systems during that work, then we do it. We could defer tagging a new version until we've got enough experience from that work to decide more confidently.

Comment on lines +13 to +34
StorageWriteOpener BlockWriteOpener
StorageReadOpener BlockReadOpener
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated, and perhaps a bit of a nit: I don't think func fields should be named like interfaces. It's far more natural to call lsys.ChooseEncoder(...) or lsys.EncoderChoose(...) than lsys.EncoderChooser(...). The result of that call is an encoder, not an encoder chooser :) A bit like if strings.Repeat were called strings.Repeater.

Interfaces are different, because you call methods on them, and it's good to differentiate the interface name from the method name. You just have one name here, so in my opinion it should be what best fits the method-like function calls.


// Now: time to apply the LinkSystem, and do the actual store operation!
lnk, err := lsys.Store(
ipld.LinkContext{}, // The zero value is fine. Configure it it you want cancellability or other features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I feel about the simple cases having to type this out :) I get that it's the case with context.Context too, but in most cases you do end up having a ctx parameter or variable you can easily reuse. Not really the case with a custom ipld.LinkContext.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd like to be able to get rid of this. I think we'd need more methods for it, though? If you have naming suggestions, I'd be happy to add them.

We can also add more methods later.

// We'll need to decide what in-memory implementation of ipld.Node we want to use.
// Here, we'll use the "basicnode" implementation. This is a good getting-started choice.
// But you could also use other implementations, or even a code-generated type with special features!
np := basicnode.Prototype.Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we make nil equal basicnode.Prototype.Any?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately :( Cycle problems would ensue :(

np, // The NodePrototype says what kind of Node we want as a result.
)
if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you declared Must methods, so you could also use them here to keep the example shorter

@rvagg
Copy link
Member

rvagg commented Feb 9, 2021

what are the implications for external codecs, the two dag-pb ones in particular?

@warpfork
Copy link
Collaborator Author

warpfork commented Feb 9, 2021

what are the implications for external codecs, the two dag-pb ones in particular?

Smol patch to the init function when they register themselves. There's just a map now, rather than a registration function.

Although I wonder if I should switch it back to a registration function. It doesn't make a ton of difference, but it would be somewhat more evolvable in the future.

I think that's about it. None of the semantics around codecs really changed too much; they were already sanely encapsulated, it's just the Link stuff above it that was previously twisty.

linking.go Outdated
if err != nil {
return err
}
tee := io.TeeReader(reader, hasher)
Copy link
Collaborator Author

@warpfork warpfork Feb 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannahhoward you might wanna be aware: I didn't carry forward the byteAccessor feature detector. That might be a "yet" thing -- I'd like to try getting along without it -- but do you know if we'll be in a good place to detect a performance regression in the area that made us introduce this the first time?

I'm weakly betting it might less relevant now because the hasher doesn't force excessive copies after this rework, but, it's only a bet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to request we put this back in before merge.

I realize the theoretical benefits in working with io.Reader data. However, in practice, 99% of practical use cases currently involve reading plain bytes in a bytes.Buffer while loading DagPB data.

The reason for this is two-fold:

  • sadly the vast majority of data we work with in Filecoin and soon IPFS using go-ipld-prime is DagPB. Put simply, we're working with UnixFS v1 files (the cbor data in filecoin mostly uses cbor-gen)
  • the vast majority of interfaces that feed into the io.Reader here return byte arrays. This includes go-datastore and go-ipfs-blockstore (well it returns blocks, but they have byte arrays underneath), and network messages in go-graphsync which have byte arrays at the block level. We should in the future look at the abstractions around reading data off disk -- it makes more sense that one data they would lazy load on read. But it is not that day.

This optimization is one part of a two part optimization. The other part is here -- https://github.com/ipld/go-ipld-prime-proto/blob/master/coding.go#L25. Taken together, these optimizations do not avoid extra byte array copies -- they avoid copies entirely. When working with protobuf data, which is again, unfortunately, 99% of the practical use case of this library so far, it's pretty important we don't incur an extra copy hit for using an io.Reader interface.

We detected the memory issues transferring large protobuf files in Filecoin. It's perhaps on us we didn't build effective reproducible benchmarks of this specific phenomena, but also, these complex scenarios are often. Nonetheless, it puts a lot more work to put on downstream teams detecting a likely performance regression in the name of maintaining purity of implementation on a hunch that this will fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's back in. Reintroduced with 08f13e5.

@warpfork
Copy link
Collaborator Author

After stewing on this for a while: a couple more changes I want to squeeze in here:

  • I'm thinking I probably should move this back to having registration functions instead of just the bare map for multicodecs and multihash registration. This is a weakly held opinion, but I think having the ability to interject functionality here later might be good.

  • I'm going to move the multihash registry to a new package, rather than lumping it in with the multicodec stuff.

    • and I think I should probably add some sub-package of that which make it possible to adapt relatively-well-known things like sha3 into the multihash registry, so one can just import that package and have the desired effects happen without a fuss. Although I don't really know how to make this jive with go mod's granularity of understanding dependencies :(

@@ -15,27 +14,26 @@ import (
)

func TestRoundtripCidlink(t *testing.T) {
lb := cidlink.LinkBuilder{cid.Prefix{
lp := cidlink.LinkPrototype{cid.Prefix{
Version: 1,
Codec: 0x71,
MhType: 0x17,
MhLength: 4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this not be inferred from the type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ not ( :) it's an adapter type, not an interface cast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i was wanting the apparently deprecated '-1' to indicate default length because a lot of these hashes have a single length that makes sense, and i don't want to be remembering / fumble that sha1 should be 20 bytes while sha224 should be 28 bytes every time i make one of these.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry, I thought the comment was on the diff line, since that's what github highlighted most brightly.

Yeah, agree. I wish the go-cid and/or go-multihash libraries were more friendly about this very common user story.

I think a -1 should flow through and do whatever go-multihash does, still. And I have no idea why that's deprecated, fwiw. (A lot of things in go-multihash seem deprecated without much comment on why or what to do instead. I think some review and renovation of that is overdue.)

It's slightly on the other side of where I'm cordoning my renovation today, though.

willscott added a commit to ipfs/go-ipld-git that referenced this pull request Feb 24, 2021
waiting for working link system (ipld/go-ipld-prime#143) to test sha1
(currently link system doesn't do the hashing)
Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two requested changes, one blocking and one I'd like to have a conversation about. Suffice to say, overall this is great, and registering strong objections to two small parts should not be confused with general disapproval of the overall approach, which I really, really like.

I would like the byte accessor method to be put back before this is merged. I would like not to be in the habit of removing features in refactors unless there is a very strong motivating factor and some demonstration it won't affect downstream consumers negatively.

I am not a fan of the HasherChooser concept and have put my suggestions below. I'm open to a conversation about why this is neccesary but I definitely do not want to create a second hash table outside of go-multihash which may not match with what's there.

linking.go Outdated
if err != nil {
return err
}
tee := io.TeeReader(reader, hasher)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to request we put this back in before merge.

I realize the theoretical benefits in working with io.Reader data. However, in practice, 99% of practical use cases currently involve reading plain bytes in a bytes.Buffer while loading DagPB data.

The reason for this is two-fold:

  • sadly the vast majority of data we work with in Filecoin and soon IPFS using go-ipld-prime is DagPB. Put simply, we're working with UnixFS v1 files (the cbor data in filecoin mostly uses cbor-gen)
  • the vast majority of interfaces that feed into the io.Reader here return byte arrays. This includes go-datastore and go-ipfs-blockstore (well it returns blocks, but they have byte arrays underneath), and network messages in go-graphsync which have byte arrays at the block level. We should in the future look at the abstractions around reading data off disk -- it makes more sense that one data they would lazy load on read. But it is not that day.

This optimization is one part of a two part optimization. The other part is here -- https://github.com/ipld/go-ipld-prime-proto/blob/master/coding.go#L25. Taken together, these optimizations do not avoid extra byte array copies -- they avoid copies entirely. When working with protobuf data, which is again, unfortunately, 99% of the practical use case of this library so far, it's pretty important we don't incur an extra copy hit for using an io.Reader interface.

We detected the memory issues transferring large protobuf files in Filecoin. It's perhaps on us we didn't build effective reproducible benchmarks of this specific phenomena, but also, these complex scenarios are often. Nonetheless, it puts a lot more work to put on downstream teams detecting a likely performance regression in the name of maintaining purity of implementation on a hunch that this will fix it.

// If more than one package registers for the same multicodec indicator, and
// you somehow end up with both in your import tree, and yet care about which wins:
// then just don't use this registry anymore: make a LinkSystem that does what you need.)
var MultihashRegistry = make(map[uint64]func() hash.Hash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand more the rationale for HasherChooser and this registry, along with the code in LinkPrototype.BuildLink

It feels like a reinvention of go-multihash. How do I know every codec already registered with go-multihash is registered here? What if they are missing and I don't expect that?

Why does BuildLink take a hash sum instead of the bytes of the whole node? Instead of writing to a hash function returned by HasherChooser, why not just write to a byte buffer and then give the bytes to buildLink which can then just call cid.Prefix.Sum in the cidLink implementation?

I feel like this pushes us further into the "reinvent low level libraries" direction. While I'm open enough to the possibility of a go-cid rewrite to allow for a Link abstraction (though I'm unconvinced still about the trade-off in refactor effort, even with the link interface abstraction), I'm definitely opposed to trying to rewrite go-multihash, which has only a few dependencies. In any case, if you have Link.BuildLink take all the bytes rather than a hash, you remove all these hash library depedencies any time you want to import the codec table for example.

Copy link
Collaborator Author

@warpfork warpfork Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. go-multihash is frankly awful.

I covered some of this in the initial PR message, but with greater 🔥 :

go-multihash makes it very difficult to handle multihash identifiers in a forward compatible way (impossible, in fact), which is ridiculous. It has direct unconditional transitive dependencies which many projects don't need, which adds friction to adopting IPLD as a whole, and needs to be fixed. It doesn't work streamingly, which is just ridiculous. It doesn't provide a way to use a bog-standard hash.Hash without pushing patches (and a surprising number of them, because the core abstractions are terrible) directly into that library (I take part of this one back; it's just more steps than I expected, so I initially didn't see the path.), which is also ridiculous. It actively rejects multihash indicator numbers that it doesn't recognize, which makes its the opposite of future proof, and in this domain that's somewhat ridiculous. It doesn't even connect with go-cid particularly well (it predates a refactor which uses strings for immutability and map-key reasons), which is (ready?) ridiculous. There's very little that's good about the current state of go-multihash.

I did not completely remove go-multihash from the dependency tree in this diffset yet, but it's close, and I probably should, either now or soon. One of the recommendations I got from Jeromy when looking at the BuildLink method is that we could just cut out the middleman go-mulithash code, assemble the bytes ourselves, and use one of the go-cid parse methods. It's a somewhat staggering idea, but would probably honestly be more efficient than some of the dancing go-multihash does right now.

I don't doubt that this migration could be bumpy, because it does mean less things are registered by default. But I'm highly confident that getting a registry in this area somewhere is the way forward, and has to happen sometime (JS has already gone the direction of reworking these primitives so dependencies are optional, and we should too); and I think this is probably going to be one of the easier times to do it.

I'll add some more packaging to this to make the migration easier -- I'm thinking of making a go-ipld-prime/multihash/all package which, when imported, makes sure everything currently registered by go-multihash is registered here too. Any application that's a data broker that needs to support all the hashes can import that and be done. Hopefully that and a good changelog message will be enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll confess, I didn't start out with this much aggression towards go-multicodec. It developed during the writing of this PR, when I got hands-on with it. I did not expect it to be so hard to take an interface draft using golang's standard-since-the-first-release-of-golang hash.Hash and connect it with doing hashing using a library with the word hash in the name. And it. was. awful. Hair-pulling, screaming-at-the-screen awful. Simple things should be simple. This should have been, and was not.)

It's possible that we should recuse up to renovating the go-multihash repo. However, what I came to in the course of working on this PR seems to address most of the painpoints in a way that's low impact to downstreams, other than requiring a new import statement. And there's a lot of reasons I'm happy with it:

  • I don't think the upsides of dynamic registration are realizable with anything less.
  • I actually got this PR done. I don't think changing go-multicodec is on the same timescale.
    • And I don't really want to couple this change to that one, because I want us to be able to start writing new code against LinkSystem now.
  • If we do push these semantics back towards go-multihash later: the effect to downstreams of go-ipld-prime can be as mild as changing their import statement from go-ipld-prime/multihash/all to go-multihash/register/all.
    • Or less: the go-ipld-prime/multihash/all package can just itself be reduced to an import that passes it on, leaving downstream users will no breaking changes at all.

So: are there smells of workaround here: yes. But I think they're minimal and manageable; and the workarounds that would've been generated in this PR continuing to use go-multihash directly would actually be larger and less manageable.

This significantly reworks how linking is handled.

All of the significant operations involved in storing and loading
data are extracted into their own separate features, and the LinkSystem
just composes them.  The big advantage of this is we can now add as
many helper methods to the LinkSystem construct as we want -- whereas
previously, adding methods to the Link interface was a difficult
thing to do, because that interface shows up in a lot of places.

Link is now *just* treated as a data holder -- it doesn't need logic
attached to it directly.  This is much cleaner.

The way we interact with the CID libraries is also different.
We're doing multihash registries ourselves, and breaking our direct
use of the go-multihash library.  The big upside is we're now using
the familiar and standard hash.Hash interface from the golang stdlib.
(And as a bonus, that actually works streamingly; go-mulithash didn't.)
However, this also implies a really big change for downstream users:
we're no longer baking as many hashes into the new multihash registry
by default.
And, make a package which can be imported to register "all" of the
multihashes.  (Or at least all of them that you would've expected
from go-multihash.)

There are also packages that are split roughly per the transitive
dependency it brings in, so you can pick and choose.

This cascaded into more work than I might've expected.
Turns out a handful of the things we have multihash identifiers for
actually *do not* implement the standard hash.Hash contract at all.
For these, I've made small shims.

Test fixtures across the library switch to using sha2-512.
Previously I had written a bunch of them to use sha3 variants,
but since that is not in the standard library, I'm going to move away
from that so as not to re-bloat the transitive dependency tree
just for the tests and examples.
// This should never be done to make behavior alterations
// (hash functions are well standardized and so is the multihash indicator table),
// but may be relevant if one is really itching to try out different hash implementations for performance reasons.
var Registry = make(map[uint64]func() hash.Hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reality is there are a bunch of 3rd party hashes that end up in the current multihash registry in ipfs world and won't conform to hash.Hash

How do we expect to support that backwards compatibility?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the terrible glue code I just wrote. ┐_(ツ)_┌

See the new gadgets in multihash/register/sha3 for handling Shake, for example.

(Like actually, review appreciated. I just utterly invented a BlockSize which is nonsensical. My theory is that it's harmless.)

I'll admit, I had kind of a discovery process today. I didn't realize how many things in golang don't conform to hash.Hash directly.

At the same time: I think still more things conform to hash.Hash than conform to anything else, so I think sticking to that as the centroid is probably the best option available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again - isn't the reality that we need to take existing 3rd party codecs currently being registered with mulithash, and be able to keep them working after this transition? not "write a custom shim for each case" but be backwards compatible for them? If that grows to a week+ of work, it's less likely to make it into this refactor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, "3rd party" in the sense of beyond-the-third-party-things-already-directly-depended-on-by-go-mulitcodec. Mmm. Okay, so, I think I had stated earlier that this was impossible, and I think I walk that back. On further look, it seems possible; I just didn't see how at first, because it requires multiple steps, not just the so-called register function in that library.

But I still suspect, looking over the flavor of some of the issues in that repo, and how issues and PRs seem to focus on getting new things added to the library, that the number of third-party registrations made in the wild is asymptotically close to zero.

So I think at the moment: we can functionally disregard the possibility. And if there are any applications out there which hit a speedbump here, they hit it when converting to go-ipld-prime... which is probably already in their roadmap as a speedbump, so there shouldn't be a category-change in speedbumpism, and so as a result I think everyone involved in such a hypothetical will probably survive.

Longer run: yeah, conversation is needed about how to re-reconcile this with go-multihash. I think a bigger venue than this PR is going to be needed for that, though.

@Stebalien
Copy link

@warpfork I agree with your points but your conclusion that "go-multihash is awful" is overblown:

  1. We need a separate core package.
  2. We shouldn't require that the hash function be registered to parse a multihash. This is basically just a matter of getting rid of the ValidCode check in Cast.

Given those two changes, everything should "just work", right?

@warpfork
Copy link
Collaborator Author

Okay, "go-multihash is awful" may be a little too spicy of a take. It's... improvable.

I didn't initially intend to go this deep on the multihash topic in this PR at all; it was quite a surprise to me that it became entangled. It was my sheer surprise on that which made me initially rather cranky.

I agree with those two counts of key changes, but would word them slightly differently and add a third (or forth depending on how you want to count them):

  1. Importing the multihash registry needs to not bring on transitive dependencies. (Same statement, just goal-oriented rather than implementation path oriented.)
  2. Yep. These types really need to support just quietly being data carriers. 100% agree.
  3. (maybe this is 2a) The registration system should really be single-step. IIUC, to register something in go-multihash right now, one should both put new numbers in one map, name mappings in another map, and then use the registration function. The first and third steps are load-bearing and must happen in that order; the second (afaict) isn't load-bearing. So, I'd do slightly more than drop the ValidCode check in Cast if revisiting this. The whole library should decide what it's actually offering, and then do so consistently.
  4. The hashing interface should be streaming. When I hear the word "hash", I think "ah, a function that can compress an unlimited amount of data into a hashsum". Streaming is in the definition. go-mulithash removes the ability to work streamingly. It's just too odd. Helper methods to turn things into a single function call instead of two when you've got one big chunk? Sure. But not exposing the streaming API at all? Oi.
  5. (maybe this is 4a) The API go-multihash exposes is often less performant and forces more allocs than hash.Hash. (It skipped the "accept a byte slice to let the caller do reuse" thing, etc.) Minor? Sure. But just self-inflicted pain that's extremely avoidable by just going with the standard things.
  6. A conversation should probably be had about how go-cid and go-multihash relate. They're very related in theory; but right now, neither of them offers smooth ways to fit together with the other.

(Shoot, I'm doing monty-python counting here. The number keeps going up.)

In general, I'd really like to move towards hash.Hash, because it solves several problems at once: the streaming problem, some of the unnecessarily-forced alloc problems, and it greatly increases the odds of easy plugging in of new functions. There's just no reason not to do this, that I can see.

We should probably probably move this discussion to a different forum to grow, though. I suspect any changes to go-multihash are going to have to spend as much or more time thinking about change management as is spent thinking about the actual end-goal, meaning it's going to be a nontrivial amount of discussion.

willscott added a commit to ipfs/go-cid that referenced this pull request Mar 3, 2021
Note: this means the PR cannot be merged until ipld/go-ipld-prime#143 is merged and a version of ipld-prime exists
@warpfork
Copy link
Collaborator Author

warpfork commented Mar 4, 2021

General notice: Some other code beyond this repo is now referring to these commits, so I'm going to quit rebasing/amending these commits, and consider them hash-stable and fine to reference. There might be some more diffs to come here, but I'll stack them atop the current commits.

… to use it.

I'm still not fully convinced this is always a win for performance,
but as discussed in
#143 (comment) ,
there was past evidence, and reproducing that takes work -- so, we're
going to try to be relatively conservative here and keep this logical
branch behavior in place for now.

(The reason this may be worth re-examining is that the change to
hashing interface probably gets rid of one big source of copies.
That doens't answer the holistic performance question alone, though;
it just reopens it.)
@warpfork
Copy link
Collaborator Author

  • go-multihash is updated and all the hasher stuff is moved into there. (Much diff removed. Great success.)
  • detection of Bytes() []byte is back.
  • reintroduced functions wrapping multicodec registration. (Doesn't affect most consumers, but will affect other repos that are registering codecs. Sorry for the churn if you already updated these to use the map form. This was dumb of me.)

I think these were the major remaining concerns, and this is now on track to merge shortly.

warpfork and others added 2 commits March 12, 2021 16:44
Resolves conflicts in go.sum.
(They were conflicts in a textual sense only, but semantically trivial;
application of `go mod tidy` was sufficient.)
@warpfork
Copy link
Collaborator Author

A note about how multihash registration is working in practice at this point: not much has changed. The transitive dependency weight for hashers is still full for any program importing linking/cid.

  • In this repo, we really only use the new go-mulitcodec/core package, which does not bring in all transitive dependencies unconditionally. But...
  • go-cid still dependes on go-multicodec (not go-multicodec/core), so our dependency on that does still bring in all transitive dependencies.

On the plus side, that means no one has to start bothering to import go-multihash/register/all yet if they want exotic hashes as of this PR.

(Some things got better.  Several others in the area still have not.)
…ess matter in LinkSystem.

(It can be hard to intuit this just by reading the code, because some
of the key relevance is actually in *other* functions that might not
be in this repo at all!)
@warpfork
Copy link
Collaborator Author

With even the changelog updated now: woosh. Time to merge!

@warpfork warpfork merged commit f6f7124 into master Mar 12, 2021
@warpfork warpfork deleted the linksystem branch March 12, 2021 16:50
@warpfork
Copy link
Collaborator Author

This is now included within release tag v0.9.0.

@aschmahmann aschmahmann mentioned this pull request May 14, 2021
71 tasks
rvagg pushed a commit to ipfs/go-ipld-git that referenced this pull request Jul 29, 2021
waiting for working link system (ipld/go-ipld-prime#143) to test sha1
(currently link system doesn't do the hashing)
rvagg pushed a commit to ipfs/go-ipld-git that referenced this pull request Jul 29, 2021
waiting for working link system (ipld/go-ipld-prime#143) to test sha1
(currently link system doesn't do the hashing)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants