-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort map entries marshalling dag-json #203
Conversation
codec/dagjson/marshal.go
Outdated
@@ -31,21 +32,32 @@ func Marshal(n ipld.Node, sink shared.TokenSink, allowLinks bool) error { | |||
if _, err := sink.Step(&tk); err != nil { | |||
return err | |||
} | |||
// Emit map contents (and recurse). | |||
// Collect map entries, then sort by key | |||
type entry struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is declaring a struct in the middle of this naughty, or OK? can I get away without a struct somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's naughty :) certainly better than declaring it globally, unless we need methods on it in the future.
codec/dagjson/marshal.go
Outdated
if err != nil { | ||
return err | ||
} | ||
entries = append(entries, entry{keyStr, v}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the implications of holding on to v
while we collect and sort? Yeah, a perf hit, but what about memory and safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface doesn't say whether it's safe to use the value of a Next call after the Next call has happened. Many APIs like badgerdb or Go's own range
will reuse the same memory for the following iterations, meaning that it's on you to make a copy if you want to keep it around for longer.
I think it should be Eric's call to say what the interface semantics should be here. Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Okay. Going to think-out-loud for a first pass here.)
Hrm. I don't think that's actually come up as a trade before, because I think we've always been able to answer with pointers to existing and already escaped-to-heap memory.
The immutability guarantee held by all most of the node implementations so far also made it easy to say it's no problem to hold onto things.
Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.
So this is certainly true, we just happen to not have hit a case where the distinction has been possible yet.
I can imagine in the future we might have ADL implementations which can optimize things by reusing internal buffers. That would hit this. I'm not immediately certain that's going to be a predominating concern in that situation, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assuming it's safe to hold onto the values is the safer approach, anyway, as going the opposite route might well break some users.
In the future, we could always add opt-in iterator interfaces that take more aggressive shortcuts, like "may share memory", or "allows for more performant codec sorting", etc. But the basic iteration interface is... simple :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Agree. Let's go that route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting here that we probably shouldn't optimise too heavily for the theoretical ADL case here because this is a block codec and if you're trying to serialize a massive HAMT into DAG-{JSON,CBOR} then you're in for a bad time regardless of what optimisations might apply here. I think it's safe-ish to assume we're dealing with maps of a maximum of ~1M, maybe a little more, in the 99.9% case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sorting in codecs -- yeah. It's not particularly important to optimize for someone feeding an ADL into a marshaller. (Even if that's technically allowed, it's... not a usual thing to want to actually do.)
The reason we ended up considering it is because it would have effect on the iterator design.
But all the same, the result here is: yeah, this is fine as written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a sane default behavior, even if it's a bit slow.
We could define an optional interface too, like:
type SortedMapNode interface {
SortedMapIterator() MapIterator
}
Then, any ipld.Node implementation having that method would signal to us that its keys are already provided in a sorted manner, meaning we wouldn't need to buffer them all in memory and sort.
That should be a separate issue, though. Then we could look at implementing it appropriately for codegen and/or bindnode.
codec/dagjson/marshal.go
Outdated
@@ -31,21 +32,32 @@ func Marshal(n ipld.Node, sink shared.TokenSink, allowLinks bool) error { | |||
if _, err := sink.Step(&tk); err != nil { | |||
return err | |||
} | |||
// Emit map contents (and recurse). | |||
// Collect map entries, then sort by key | |||
type entry struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's naughty :) certainly better than declaring it globally, unless we need methods on it in the future.
codec/dagjson/marshal.go
Outdated
if err != nil { | ||
return err | ||
} | ||
entries = append(entries, entry{keyStr, v}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface doesn't say whether it's safe to use the value of a Next call after the Next call has happened. Many APIs like badgerdb or Go's own range
will reuse the same memory for the following iterations, meaning that it's on you to make a copy if you want to keep it around for longer.
I think it should be Eric's call to say what the interface semantics should be here. Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.
Click for some rambling about how different codecs need different sorting and the options available .. off-topic for this particular change.type SortedMapNode interface {
SortedMapIterator() MapIterator
} Sadly this won't cut it as we have different sorting rules for different codecs. DAG-JSON does standard bytewise sorting while DAG-CBOR does length-first bytewise-second sorting. Eric's ideal codec would not do sorting, and that's certainly an option for the future. Something like this might work so you'd get something like a standard type SortedMapNode interface {
SortedMapIterator(less func(i, j ipld.Node) bool)) MapIterator
} But .. messy and complicated, so not just now. |
Codecs could also have an opt-in to disable the sorting, but the default should be as per the spec, IMO. Good point about different codecs. Maybe one interface per codec is an option, too. I feel like we should discuss those opt-in optimizations in a separate thread. |
Okay, so yeah, holistically, I'm gonna concede, fine, let's make the codec sort. Thing one, seems to import to address now: I think we need to put the whole thing in a condition branch, even on this first pass. DAG-JSON should gain this sort. JSON shouldn't. (I guess we don't have tests which have made this obvious, but iirc, the json package depends on the dagjson package, so will be affected unless we make it not so.) We don't have to worry about trying to standardize a codec config interface for this today, I don't think. (That would be nice-to-have, but probably requires more thought.) Just enough of a bool passing so the JSON codec can not exhibit this trait. Thing two, that we're keeping an eye on for later: about the fastpath options: I think agree with the general idea of an outline for a feature detector for fastpath:
But I would like to figure out how we can get this to support memoization, if something is created with a sort, and later asked to yield an iterator for the same sort -- that should be as close to free as possible. I made some speculations on this, but these are actually slightly broken, because they assume golang will let one compare function pointers, which is... not entirely the case. However, I'll share this train of thought anyway, and we can maybe figure out practical routes to it later: We should probably expect to also then form a list of well-known sort funcs, a la:
and then another option for builders:
and then the reason why having all that can pay off: Map node implementations that implement both of these feature detection interfaces can store the |
All the ideas around future opt-in optimizations sound really interesting and I do think we should explore them further. But I also think this PR is probably not the place :) Perhaps my bad for starting that train of thought, but I only meant it as "we're not dropping streaming encodes on the floor forever". +1 to Eric's comment that this should be turned off for the "plain json" encoding, in any case. |
Might have overdone it here? Need feedback. Instead of adding more boolean parameters, I've cleaned up both Marshal and Unmarshal to take MarshalOptions and UnmarshalOptions which have all of the various ways the modes can vary. This also means that instead of piggybacking |
Beaut. Probably the way it always should've been. Stylistic option: you can take what you've done here and go one tiny step further: one can attach a method to a struct, and use that rather than an argument to pass the config. I kinda like doing this for config, because it frees up the namespace to have a function of the same name at package scope, which can have default behavior. So, roughly:
(I started moving in this direction in the I think I like this pattern. That said, it's stylistic. It also might not matter much to be decisive at the moment; we still seem to have most consumers of the library going only through the zero-config multicodec access pattern, which means we might still have relative freedom to iterate on this config stuff for a while. Sidenote on naming, in case it's been confusing: stuff is called "marshal"/"unmarshal" when... it's older, really. I've started dropping that nomenclature and using encode/decode exclusively. ("Marshal"/"unmarshal" is a convention in golang. But I'd rather have terminology that's consistent with our language-agnostic IPLD docs which revolve around "codec"... so, "encode"/"decode" is better from that viewpoint.) I haven't gone on a rampage to remove the old nomenclature. But if you happen to be renovating stuff and it seems low-value to keep it around... feel free to drop functions by those names, IMO. (...presuming that they already have an alternative in the "decode"/"encode" naming convention, of course. But I'm pretty sure they all do.) |
Please deprecate instead :) I tripped over Encoder vs Encode just yesterday, still.
Fully agreed there. It's what the new encoding/json experiment has gone for, as well:
(Ignore the EncodeOptions parameter, as json wants to separate the semantic "marshal options" from the syntactic "encode options") |
Yes, I guess I should clarify to say: if you'd be in the position of making a breaking change anyway, then maybe consider dropping. Otherwise, it is often now worth maintaining some facades (since we're generally at the bottom of some deep dependency trees, frequently subject to diamond problems, etc).
Oh, nice. That sounds like the same distinction I was thinking to try to make with those words these days: "marshal" to regard object<->serial/token mapping (or, in IPLD's case, Data Model rather than jumping straight to token); "encode" to regard serial/token<->encodedbytes. |
ok, gross and weird Golang config/options handling patterns aside, how's this PR looking for merge as it stands? The way it's looking to me right now is that |
Let me count the ways that you're going to hate this @warpfork ... but this is what everyone else has agreed the spec should do, it's what go-ipfs is doing with its existing encoding/json usage so existing test fixtures all make this assumption, and this opens the way to get some cross-impl test fixtures in place.
As for my impl, yeah .. that might take some finessing.
Builds on #202 so whitespace removal is in here already.