-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce LinkSystem #143
introduce LinkSystem #143
Conversation
(This isn't quite ready to merge yet, but it's getting close. When it does merge, I think it's going to be grounds for tagging a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, big fan of this setup 👍
// Functions in the traversal package will set this automatically. | ||
ParentNode Node | ||
|
||
// REVIEW: ParentNode in LinkContext -- so far, this has only ever been hypothetically useful. Keep or drop? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree that it's unclear the special case here is useful. either the path should be enough or we would need arbitrary ancestor / dag access from the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've ended up squashing most of my work-in-progress commits together, so it's now invisible, but it might be interesting to note: I originally did comment this out entirely... but by the time I finished migrating the traversal
package, it turned out so easy to keep it that I did.
So, this doesn't really argue against dropping it. But it turned out not much work (at the moment, anyway) to keep it, either.
linking.go
Outdated
// like multihashType, multicodecType, and cidVersion, for example. | ||
type LinkBuilder interface { | ||
Build(context.Context, LinkContext, Node, Storer) (Link, error) | ||
func (ls *LinkSystem) Load(lnkCtx LinkContext, lnk Link, np NodePrototype) (Node, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe define these with the definition of LinkSystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one design thing this setup leaves open that is a bit worrying is that as it gets used for applied subsets of IPLD dags, we'll end up with implementations that don't fully fill out LinkContext, and will be fine for their initial use case, but will then end up in designs where tools aren't generally applicable to IPLD data because of that specificity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super worried? The documentation does make an explicit statement that a zero-value LinkContext is to be considered acceptable. Also, that values in a LinkContext should only ever affect data availability, not actual content.
I can't enforce this programmatically, of course; that would require some kind of magic. But: that's sort of inevitable. Even if we took out LinkContext entirely, we'd still need a context.Context here... and that still has the arbitrary attachment feature. (Which is actually desired. I fully expect someone to use this to tunnel auth credentials, at some point.)
I don't really expect people to have a strong incentive to do unlawful/inadvisable things here, either.
(Although... speaking of "only data availability effects"... that does remind me that we need to get a lot more buckled down on well-typed errors throughout this whole library.)
StorageWriteOpener BlockWriteOpener | ||
StorageReadOpener BlockReadOpener |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review wanted on naming here. (Both the functions, and the field names.)
@hannahhoward had previously commented on them being a bit of a mouthful, but rather liking that they mention "blocks" now. I think I've improved the names a bit since the draft she was looking at, but they're still a mouthful.
Probably I should s/Storage/Block/
on the field names, so they're all just consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat unrelated, and perhaps a bit of a nit: I don't think func fields should be named like interfaces. It's far more natural to call lsys.ChooseEncoder(...)
or lsys.EncoderChoose(...)
than lsys.EncoderChooser(...)
. The result of that call is an encoder, not an encoder chooser :) A bit like if strings.Repeat
were called strings.Repeater
.
Interfaces are different, because you call methods on them, and it's good to differentiate the interface name from the method name. You just have one name here, so in my opinion it should be what best fits the method-like function calls.
... Okay, I think this actually is ready for review after all. There's one more inconsistency I'd like to resolve, but will maybe punt to a subsequent PR: some additional hash functions from non-stdlib are still included in the new multicodec table (specifically, sha3), which kind of runs contrary to the goal of moving towards "only what you need" in the dependency tree. They're there because... I had used them in examples. I'd like to update those examples, but I think I'd also like to update them to sha2-384 while I'm at it, which... pends multiformats/multicodec#205 . |
I didn't start using https://github.com/multiformats/go-multicodec yet. But maybe it's also time for that. At least it should be used in examples, from now on, I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be interesting to send a draft PR to update the ipld-docs Go examples to use this new API, to see how it affects our very first basic use case.
It would also be useful as a way to quickly visualize how the new API compares to the old one for basic use cases. You did add new examples here, but since there weren't previous versions of that same code, I can't do a side by side comparison.
// | ||
// In a real program, you'll probably make functions to load and store from disk, | ||
// or some network storage, or... whatever you want, really :) | ||
var store = storage.Memory{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I think each separate example will only render as its example func body, so the references to "store" might not show up well. You might want to use local store variables instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, for the sake of keeping the example simple, I'd use &storage.Memory{}
and avoid (&store).xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think both examples need to be a single one; otherwise, running only the Load one will fail. This will be a problem for many scenarios like pkg.go.dev and godoc.org, where you can click "Run" on a single runnable example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, boy. Yeah. I ported this example somewhat uncritically. But you're right, it was wrong both now and before; that standalone property is really important. I'll separate these.
...although I'm not really sure how. Is there a good idiom for using the actions of one example as setup for another? If I extract a function, it disappears from the rendered example in godoc again. But the Example*
function contract doesn't allow enough parameters to use it directly, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I wasn't clear - I mean having a single Example function, and putting all the code in there. You can't have two separate example funcs that share state. It would be a bit like having two Test funcs that share global state, but worse, since running a single Example func is their main use case for the web docs.
type LinkSystem struct { | ||
EncoderChooser func(LinkPrototype) (Encoder, error) | ||
DecoderChooser func(Link) (Decoder, error) | ||
HasherChooser func(LinkPrototype) (hash.Hash, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EncoderChooser and HasherChooser are almost always used together. Have you thought about a way to make that easier? Maybe join the two "choosers" into a single func, or maybe add a helper method to simplify calling both at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what LinkSystem.Store(...)
is 😅
The idea is that users can set these, but will almost never use them directly.
I should finish adding a documentation block to the LinkSystem declaration which describes this.
I debate maybe splitting this into a LinkSystem type that has the synthesized high level methods, and a LinkSystemConfig type that has these fields and is used to build it, but is thereafter unseen. Or just unexporting these fields and going with a constructor function. Seemed kind of heavyweight though? And then we'd end up wanting a bunch of (*LinkSystem).With{Variation}(...) LinkSystem
constructor variants, too, I suspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, really. If Store is meant to be what most people are going to use, then SGTM.
// | ||
// store := storage.Memory{} | ||
// lsys.StorageReadOpener = (&store).OpenRead | ||
// lsys.StorageWriteOpener = (&store).OpenWrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit weird to have to supply a storage reader and a storage writer as separate entities. When supplying both, it's going to be very common that they'll both come from the same implementation, like storage.Memory here.
So it would seem more natural to me to define an interface, and be able to supply both read and write ends in one go. For example:
store := storage.Memory{}
lsys = lsys.WithStore(store)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to try this way. It's close to what we've got already; and are several reasons I'd like to use functional interfaces and be able to set these separately:
Reason 1: It's true that when supplying both, they often come from the same implementation... but I really like not having to supply both. Making something read-only by not providing it an ability to write sits well with me. It's a "capabilities" style system, and I like those: the lines are very clear.
Reason 2: Functions as interfaces makes it easier to decorate one or the other (or both), because when they're just functions, it can be done inline. By contrast, a whole type that has methods attached to satisfy an interface has to be a top-level declaration. It's also harder to decorate just one of the several functions on an interface; doing so tends to cause one to end up with a whole suite of decorator wrapper types. I think it's been not uncommon so far for us to want to decorate one of these methods but not care much about the other.
Reason 3: An interface, because of how it has the ability to grow, makes it more easier for something to evolve to become stateful. I think that would be a bad direction of growth for these data in and out primitives. These should be really simple and really functional.
These reasons 2 and 3 may sound a little hypothetical, but I think it's what we can observe actually happening in the previous generation of interfaces for this domain: go-ipfs-blockstore grew in both these directions over time, and while the result hasn't been wrong per se, it also hasn't been very satisfying and it hasn't stayed simple; I seem to hear grumbling from people who look at that interface for long. I don't want to set ourselves up for drift in that same direction again.
Reason 4: This pattern just plain has worked well for me in the past, and I think functional interfaces are underutilized in golang :D
I'd revisit this if we did a design review in depth on the blockstore concepts, and found with confidence that we wanted a certain sequence of methods to always be available in a bundle together; or, if we have some ideas where we'd actually want to be able to do feature detection on the storage system.
At the moment, my thinking is that I don't think we'll want the ability to do further feature detection on these values in any of the places we use a LinkSystem. For example: will the traversal package ever want to check if the storage system supports GC (which is one of the other interfaces that have appeared around blockstorage)? I think it shouldn't. Things that just do loading and storing should be pretty logically unaware of anything but the read and write operations.
The best example I can think of for wanting feature detection is probably the https://pkg.go.dev/github.com/ipfs/go-ipfs-blockstore#Viewer interface, which wants to check if certain performance shortcuts are available. But, we've also seen that to be doable by checking the io.Reader
returned by these functions for other features. So we probably don't need to be preparing to do things like that in two separate places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a middle position on this -- let's try functional interfaces for now (this is also the smallest change from the current Loader
and Storer
system); but we stay extremely ready to change this as folks work on moving go-ipfs over to these libraries? We should get good and hard info from that process, and if we discover the need for more feature detection on storage systems during that work, then we do it. We could defer tagging a new version until we've got enough experience from that work to decide more confidently.
StorageWriteOpener BlockWriteOpener | ||
StorageReadOpener BlockReadOpener |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat unrelated, and perhaps a bit of a nit: I don't think func fields should be named like interfaces. It's far more natural to call lsys.ChooseEncoder(...)
or lsys.EncoderChoose(...)
than lsys.EncoderChooser(...)
. The result of that call is an encoder, not an encoder chooser :) A bit like if strings.Repeat
were called strings.Repeater
.
Interfaces are different, because you call methods on them, and it's good to differentiate the interface name from the method name. You just have one name here, so in my opinion it should be what best fits the method-like function calls.
|
||
// Now: time to apply the LinkSystem, and do the actual store operation! | ||
lnk, err := lsys.Store( | ||
ipld.LinkContext{}, // The zero value is fine. Configure it it you want cancellability or other features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how I feel about the simple cases having to type this out :) I get that it's the case with context.Context
too, but in most cases you do end up having a ctx
parameter or variable you can easily reuse. Not really the case with a custom ipld.LinkContext
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd like to be able to get rid of this. I think we'd need more methods for it, though? If you have naming suggestions, I'd be happy to add them.
We can also add more methods later.
// We'll need to decide what in-memory implementation of ipld.Node we want to use. | ||
// Here, we'll use the "basicnode" implementation. This is a good getting-started choice. | ||
// But you could also use other implementations, or even a code-generated type with special features! | ||
np := basicnode.Prototype.Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we make nil
equal basicnode.Prototype.Any
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, unfortunately :( Cycle problems would ensue :(
np, // The NodePrototype says what kind of Node we want as a result. | ||
) | ||
if err != nil { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you declared Must methods, so you could also use them here to keep the example shorter
what are the implications for external codecs, the two dag-pb ones in particular? |
Smol patch to the init function when they register themselves. There's just a map now, rather than a registration function. Although I wonder if I should switch it back to a registration function. It doesn't make a ton of difference, but it would be somewhat more evolvable in the future. I think that's about it. None of the semantics around codecs really changed too much; they were already sanely encapsulated, it's just the Link stuff above it that was previously twisty. |
linking.go
Outdated
if err != nil { | ||
return err | ||
} | ||
tee := io.TeeReader(reader, hasher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hannahhoward you might wanna be aware: I didn't carry forward the byteAccessor feature detector. That might be a "yet" thing -- I'd like to try getting along without it -- but do you know if we'll be in a good place to detect a performance regression in the area that made us introduce this the first time?
I'm weakly betting it might less relevant now because the hasher doesn't force excessive copies after this rework, but, it's only a bet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to request we put this back in before merge.
I realize the theoretical benefits in working with io.Reader data. However, in practice, 99% of practical use cases currently involve reading plain bytes in a bytes.Buffer while loading DagPB data.
The reason for this is two-fold:
- sadly the vast majority of data we work with in Filecoin and soon IPFS using go-ipld-prime is DagPB. Put simply, we're working with UnixFS v1 files (the cbor data in filecoin mostly uses cbor-gen)
- the vast majority of interfaces that feed into the io.Reader here return byte arrays. This includes go-datastore and go-ipfs-blockstore (well it returns blocks, but they have byte arrays underneath), and network messages in go-graphsync which have byte arrays at the block level. We should in the future look at the abstractions around reading data off disk -- it makes more sense that one data they would lazy load on read. But it is not that day.
This optimization is one part of a two part optimization. The other part is here -- https://github.com/ipld/go-ipld-prime-proto/blob/master/coding.go#L25. Taken together, these optimizations do not avoid extra byte array copies -- they avoid copies entirely. When working with protobuf data, which is again, unfortunately, 99% of the practical use case of this library so far, it's pretty important we don't incur an extra copy hit for using an io.Reader interface.
We detected the memory issues transferring large protobuf files in Filecoin. It's perhaps on us we didn't build effective reproducible benchmarks of this specific phenomena, but also, these complex scenarios are often. Nonetheless, it puts a lot more work to put on downstream teams detecting a likely performance regression in the name of maintaining purity of implementation on a hunch that this will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's back in. Reintroduced with 08f13e5.
After stewing on this for a while: a couple more changes I want to squeeze in here:
|
@@ -15,27 +14,26 @@ import ( | |||
) | |||
|
|||
func TestRoundtripCidlink(t *testing.T) { | |||
lb := cidlink.LinkBuilder{cid.Prefix{ | |||
lp := cidlink.LinkPrototype{cid.Prefix{ | |||
Version: 1, | |||
Codec: 0x71, | |||
MhType: 0x17, | |||
MhLength: 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this not be inferred from the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
not (
:) it's an adapter type, not an interface cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i was wanting the apparently deprecated '-1' to indicate default length because a lot of these hashes have a single length that makes sense, and i don't want to be remembering / fumble that sha1 should be 20 bytes while sha224 should be 28 bytes every time i make one of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry, I thought the comment was on the diff line, since that's what github highlighted most brightly.
Yeah, agree. I wish the go-cid and/or go-multihash libraries were more friendly about this very common user story.
I think a -1 should flow through and do whatever go-multihash does, still. And I have no idea why that's deprecated, fwiw. (A lot of things in go-multihash seem deprecated without much comment on why or what to do instead. I think some review and renovation of that is overdue.)
It's slightly on the other side of where I'm cordoning my renovation today, though.
waiting for working link system (ipld/go-ipld-prime#143) to test sha1 (currently link system doesn't do the hashing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two requested changes, one blocking and one I'd like to have a conversation about. Suffice to say, overall this is great, and registering strong objections to two small parts should not be confused with general disapproval of the overall approach, which I really, really like.
I would like the byte accessor method to be put back before this is merged. I would like not to be in the habit of removing features in refactors unless there is a very strong motivating factor and some demonstration it won't affect downstream consumers negatively.
I am not a fan of the HasherChooser concept and have put my suggestions below. I'm open to a conversation about why this is neccesary but I definitely do not want to create a second hash table outside of go-multihash which may not match with what's there.
linking.go
Outdated
if err != nil { | ||
return err | ||
} | ||
tee := io.TeeReader(reader, hasher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to request we put this back in before merge.
I realize the theoretical benefits in working with io.Reader data. However, in practice, 99% of practical use cases currently involve reading plain bytes in a bytes.Buffer while loading DagPB data.
The reason for this is two-fold:
- sadly the vast majority of data we work with in Filecoin and soon IPFS using go-ipld-prime is DagPB. Put simply, we're working with UnixFS v1 files (the cbor data in filecoin mostly uses cbor-gen)
- the vast majority of interfaces that feed into the io.Reader here return byte arrays. This includes go-datastore and go-ipfs-blockstore (well it returns blocks, but they have byte arrays underneath), and network messages in go-graphsync which have byte arrays at the block level. We should in the future look at the abstractions around reading data off disk -- it makes more sense that one data they would lazy load on read. But it is not that day.
This optimization is one part of a two part optimization. The other part is here -- https://github.com/ipld/go-ipld-prime-proto/blob/master/coding.go#L25. Taken together, these optimizations do not avoid extra byte array copies -- they avoid copies entirely. When working with protobuf data, which is again, unfortunately, 99% of the practical use case of this library so far, it's pretty important we don't incur an extra copy hit for using an io.Reader interface.
We detected the memory issues transferring large protobuf files in Filecoin. It's perhaps on us we didn't build effective reproducible benchmarks of this specific phenomena, but also, these complex scenarios are often. Nonetheless, it puts a lot more work to put on downstream teams detecting a likely performance regression in the name of maintaining purity of implementation on a hunch that this will fix it.
codec/multicodecs.go
Outdated
// If more than one package registers for the same multicodec indicator, and | ||
// you somehow end up with both in your import tree, and yet care about which wins: | ||
// then just don't use this registry anymore: make a LinkSystem that does what you need.) | ||
var MultihashRegistry = make(map[uint64]func() hash.Hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to understand more the rationale for HasherChooser and this registry, along with the code in LinkPrototype.BuildLink
It feels like a reinvention of go-multihash. How do I know every codec already registered with go-multihash is registered here? What if they are missing and I don't expect that?
Why does BuildLink take a hash sum instead of the bytes of the whole node? Instead of writing to a hash function returned by HasherChooser, why not just write to a byte buffer and then give the bytes to buildLink which can then just call cid.Prefix.Sum in the cidLink implementation?
I feel like this pushes us further into the "reinvent low level libraries" direction. While I'm open enough to the possibility of a go-cid rewrite to allow for a Link abstraction (though I'm unconvinced still about the trade-off in refactor effort, even with the link interface abstraction), I'm definitely opposed to trying to rewrite go-multihash, which has only a few dependencies. In any case, if you have Link.BuildLink take all the bytes rather than a hash, you remove all these hash library depedencies any time you want to import the codec table for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. go-multihash is frankly awful.
I covered some of this in the initial PR message, but with greater 🔥 :
go-multihash makes it very difficult to handle multihash identifiers in a forward compatible way (impossible, in fact), which is ridiculous. It has direct unconditional transitive dependencies which many projects don't need, which adds friction to adopting IPLD as a whole, and needs to be fixed. It doesn't work streamingly, which is just ridiculous. It doesn't provide a way to use a bog-standard hash.Hash
without pushing patches (and a surprising number of them, because the core abstractions are terrible) directly into that library (I take part of this one back; it's just more steps than I expected, so I initially didn't see the path.), which is also ridiculous. It actively rejects multihash indicator numbers that it doesn't recognize, which makes its the opposite of future proof, and in this domain that's somewhat ridiculous. It doesn't even connect with go-cid particularly well (it predates a refactor which uses strings for immutability and map-key reasons), which is (ready?) ridiculous. There's very little that's good about the current state of go-multihash.
I did not completely remove go-multihash from the dependency tree in this diffset yet, but it's close, and I probably should, either now or soon. One of the recommendations I got from Jeromy when looking at the BuildLink method is that we could just cut out the middleman go-mulithash code, assemble the bytes ourselves, and use one of the go-cid parse methods. It's a somewhat staggering idea, but would probably honestly be more efficient than some of the dancing go-multihash does right now.
I don't doubt that this migration could be bumpy, because it does mean less things are registered by default. But I'm highly confident that getting a registry in this area somewhere is the way forward, and has to happen sometime (JS has already gone the direction of reworking these primitives so dependencies are optional, and we should too); and I think this is probably going to be one of the easier times to do it.
I'll add some more packaging to this to make the migration easier -- I'm thinking of making a go-ipld-prime/multihash/all
package which, when imported, makes sure everything currently registered by go-multihash is registered here too. Any application that's a data broker that needs to support all the hashes can import that and be done. Hopefully that and a good changelog message will be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll confess, I didn't start out with this much aggression towards go-multicodec. It developed during the writing of this PR, when I got hands-on with it. I did not expect it to be so hard to take an interface draft using golang's standard-since-the-first-release-of-golang hash.Hash
and connect it with doing hashing using a library with the word hash in the name. And it. was. awful. Hair-pulling, screaming-at-the-screen awful. Simple things should be simple. This should have been, and was not.)
It's possible that we should recuse up to renovating the go-multihash repo. However, what I came to in the course of working on this PR seems to address most of the painpoints in a way that's low impact to downstreams, other than requiring a new import statement. And there's a lot of reasons I'm happy with it:
- I don't think the upsides of dynamic registration are realizable with anything less.
- I actually got this PR done. I don't think changing go-multicodec is on the same timescale.
- And I don't really want to couple this change to that one, because I want us to be able to start writing new code against LinkSystem now.
- If we do push these semantics back towards go-multihash later: the effect to downstreams of go-ipld-prime can be as mild as changing their import statement from
go-ipld-prime/multihash/all
togo-multihash/register/all
.- Or less: the
go-ipld-prime/multihash/all
package can just itself be reduced to an import that passes it on, leaving downstream users will no breaking changes at all.
- Or less: the
So: are there smells of workaround here: yes. But I think they're minimal and manageable; and the workarounds that would've been generated in this PR continuing to use go-multihash directly would actually be larger and less manageable.
This significantly reworks how linking is handled. All of the significant operations involved in storing and loading data are extracted into their own separate features, and the LinkSystem just composes them. The big advantage of this is we can now add as many helper methods to the LinkSystem construct as we want -- whereas previously, adding methods to the Link interface was a difficult thing to do, because that interface shows up in a lot of places. Link is now *just* treated as a data holder -- it doesn't need logic attached to it directly. This is much cleaner. The way we interact with the CID libraries is also different. We're doing multihash registries ourselves, and breaking our direct use of the go-multihash library. The big upside is we're now using the familiar and standard hash.Hash interface from the golang stdlib. (And as a bonus, that actually works streamingly; go-mulithash didn't.) However, this also implies a really big change for downstream users: we're no longer baking as many hashes into the new multihash registry by default.
And, make a package which can be imported to register "all" of the multihashes. (Or at least all of them that you would've expected from go-multihash.) There are also packages that are split roughly per the transitive dependency it brings in, so you can pick and choose. This cascaded into more work than I might've expected. Turns out a handful of the things we have multihash identifiers for actually *do not* implement the standard hash.Hash contract at all. For these, I've made small shims. Test fixtures across the library switch to using sha2-512. Previously I had written a bunch of them to use sha3 variants, but since that is not in the standard library, I'm going to move away from that so as not to re-bloat the transitive dependency tree just for the tests and examples.
multihash/multihash.go
Outdated
// This should never be done to make behavior alterations | ||
// (hash functions are well standardized and so is the multihash indicator table), | ||
// but may be relevant if one is really itching to try out different hash implementations for performance reasons. | ||
var Registry = make(map[uint64]func() hash.Hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reality is there are a bunch of 3rd party hashes that end up in the current multihash registry in ipfs world and won't conform to hash.Hash
How do we expect to support that backwards compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With all the terrible glue code I just wrote. ┐_(ツ)_┌
See the new gadgets in multihash/register/sha3
for handling Shake, for example.
(Like actually, review appreciated. I just utterly invented a BlockSize which is nonsensical. My theory is that it's harmless.)
I'll admit, I had kind of a discovery process today. I didn't realize how many things in golang don't conform to hash.Hash
directly.
At the same time: I think still more things conform to hash.Hash
than conform to anything else, so I think sticking to that as the centroid is probably the best option available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again - isn't the reality that we need to take existing 3rd party codecs currently being registered with mulithash, and be able to keep them working after this transition? not "write a custom shim for each case" but be backwards compatible for them? If that grows to a week+ of work, it's less likely to make it into this refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, "3rd party" in the sense of beyond-the-third-party-things-already-directly-depended-on-by-go-mulitcodec. Mmm. Okay, so, I think I had stated earlier that this was impossible, and I think I walk that back. On further look, it seems possible; I just didn't see how at first, because it requires multiple steps, not just the so-called register function in that library.
But I still suspect, looking over the flavor of some of the issues in that repo, and how issues and PRs seem to focus on getting new things added to the library, that the number of third-party registrations made in the wild is asymptotically close to zero.
So I think at the moment: we can functionally disregard the possibility. And if there are any applications out there which hit a speedbump here, they hit it when converting to go-ipld-prime... which is probably already in their roadmap as a speedbump, so there shouldn't be a category-change in speedbumpism, and so as a result I think everyone involved in such a hypothetical will probably survive.
Longer run: yeah, conversation is needed about how to re-reconcile this with go-multihash. I think a bigger venue than this PR is going to be needed for that, though.
@warpfork I agree with your points but your conclusion that "go-multihash is awful" is overblown:
Given those two changes, everything should "just work", right? |
Okay, "go-multihash is awful" may be a little too spicy of a take. It's... improvable. I didn't initially intend to go this deep on the multihash topic in this PR at all; it was quite a surprise to me that it became entangled. It was my sheer surprise on that which made me initially rather cranky. I agree with those two counts of key changes, but would word them slightly differently and add a third (or forth depending on how you want to count them):
(Shoot, I'm doing monty-python counting here. The number keeps going up.) In general, I'd really like to move towards We should probably probably move this discussion to a different forum to grow, though. I suspect any changes to go-multihash are going to have to spend as much or more time thinking about change management as is spent thinking about the actual end-goal, meaning it's going to be a nontrivial amount of discussion. |
Note: this means the PR cannot be merged until ipld/go-ipld-prime#143 is merged and a version of ipld-prime exists
General notice: Some other code beyond this repo is now referring to these commits, so I'm going to quit rebasing/amending these commits, and consider them hash-stable and fine to reference. There might be some more diffs to come here, but I'll stack them atop the current commits. |
… to use it. I'm still not fully convinced this is always a win for performance, but as discussed in #143 (comment) , there was past evidence, and reproducing that takes work -- so, we're going to try to be relatively conservative here and keep this logical branch behavior in place for now. (The reason this may be worth re-examining is that the change to hashing interface probably gets rid of one big source of copies. That doens't answer the holistic performance question alone, though; it just reopens it.)
I think these were the major remaining concerns, and this is now on track to merge shortly. |
Resolves conflicts in go.sum. (They were conflicts in a textual sense only, but semantically trivial; application of `go mod tidy` was sufficient.)
A note about how multihash registration is working in practice at this point: not much has changed. The transitive dependency weight for hashers is still full for any program importing
On the plus side, that means no one has to start bothering to import |
(Some things got better. Several others in the area still have not.)
…ess matter in LinkSystem. (It can be hard to intuit this just by reading the code, because some of the key relevance is actually in *other* functions that might not be in this repo at all!)
With even the changelog updated now: woosh. Time to merge! |
This is now included within release tag |
waiting for working link system (ipld/go-ipld-prime#143) to test sha1 (currently link system doesn't do the hashing)
waiting for working link system (ipld/go-ipld-prime#143) to test sha1 (currently link system doesn't do the hashing)
This is the long-awaited address to #55 .
The code is largely based upon the draft earlier published in gist form: https://gist.github.com/warpfork/c0200cc4d99ee36ba5ce5a612f1d1a22
Key feature: there is now an
ipld.LinkSystem
type, and it gives us a place to compose all the different things that go into storing and loading links (getting encoders and decoders; getting hashers; and getting the actual I/O streams)... and then it gives us simpleStore
andLoad
methods that do all the plumbing and present a nice simple interface to end-users.This means the
ipld.Link
interface does much less work. Now it's just a data holder. (Previously it had aLoad
method which did a lot of work, but also was quite convoluted to use because it tied so many things together.)We can also now easily keep adding more helper methods to LinkSystem, without complicating the Link interface or anything else. For example, we already now have a
Fill
method and aLoad
method, which gives the user the option to either populate data into aNodeAssembler
, or, just do the whole the whole operation and return aNode
. The latter (Load
) is essentially syntactic sugar for a sequence ofnb := NewBuilder(); linksystem.Fill(nb); n := nb.Build();
, while the former gives the controller more control over allocations. Previously, adding helper methods like this would've involved changing the (very central)Link
interface, which would be a huge pain because of how many places it appears; now, it's easy to add helper methods like this.The
ipld.Storer
andipld.Loader
function interfaces are nowipld.BlockWriteOpener
andipld.BlockReadOpener
, respectively. They do roughly the same job as before, but figuring out where to use them should be simpler: now you just put both of them in a LinkSystem, and then use that.multihashes: Along the way, I did some some additional work to distance ourselves further from the go-multihash package and the way it forms unconditional dependencies to several other repositories for functionality that end-users might not always want. This was prompted because the LinkSystem design detail of a HasherChooser function prompts us to want to use standard
hash.Hash
APIs. It also gets us the ability to use hashes in a streaming way (stdlibhash.Hash
is streaming; go-multihash is, oddly, not). go-multihash is not removed from our dependency tree (that's going to be a fair bit more slog), so those dependencies are still with us transitively -- but finishing that separation in the future is now much closer to tractable.This means we have new default multihash registry, rather than leaning on what's hardcoded in go-multihash. Upsides: can register new things; can choose your own dependencies. Downsides: this might produce a bit of a bumpy migration, because this registry has fewer things in it by default.
multicodecs: The multicodec registry map also moved. Semantics are pretty much the same, though. It's now in the
codec
package instead of thelinking/cid
package. This'll be a breaking change for any packages implementing codecs and registering themselves in this map (but a very easy one to fix). It means packages implementing codecs can do so without being forced into a transitive dependency on go-cid (and all of its many further-transitive dependencies). (Some codec packages will depend on go-cid anyway -- but that should be their choice, and now it will be.)storage: I added a
storage
package! This contains some (very) basic gadgets that provide functions matching theipld.BlockWriteOpener
andipld.BlockReadOpener
interfaces.The only thing in here right now is a simple in-memory storage heap. We'll still probably expect serious applications to get a more serious storage system, and we'll probably expect those more serious storage systems to live in other repos. But it's nice to have something "batteries included" for demos and for tests.
(Previously there were some functions scattered around the tests and examples that did roughly the same things, but since they were just defined at their use site, it was hard to attach documentation to them or use them as a reference. Now it's easier.)
Most examples are updated. I'm sure we could still use more of them, though.