-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Consistent Varint Encoding #340
Comments
Implementation proposal, please. |
@acruikshank took a look, needs more info. @dignifiedquire will elaborate. |
@acruikshank could you provide some questions on what is unclear please? |
@dignifiedquire I was slightly mis-remembering when I said this was unclear (since part of the task is to clarify). The only real question is the scope of this change. There are many places we encode values. Are we only concerned with ints that will ultimately live on-chain here? That would include message parameters and anything that stays in actor memory. If that's it, I think this is ready to start (researching). |
We are concerned with all values that are ending up on the chain and also all the ones on disk, as it will make life easier in the future for us. |
@dignifiedquire (and anyone else knowledgeable about this area):
Edit: for clarity |
I started digging into this today and figured I'd surface some initial thoughts for feedback/input. This is new territory for me so please poke holes and point out oversights in what follows.
There's more but it doesn't rise to the level of discussion at this point and depends on input on the above.... |
Also:
|
Two ideas to this
good catch, we should choose a single option and make sure to add it to the spec and enforce it.
What exactly do you mean with custom types? |
OK. The fact that these "standards" are like <= 40 lines of code therefore easy for anyone to implement inclines me to pick the best one, as opposed to the one most people have heard of. But it bears a bit more investigation.
Potential optimizations noted. My instinct is to go the other, simpler direction though: define exactly two types (one varint, one varuint) with a reasonable max size say 256 bits and just use that everywhere until we have a good reason to introduce a bunch of types. I'm asking for feedback from everyone: are there reasons to prefer defining different sizes varints (say, varint32, varint64, varint128, varint256) rather than just defining one size (say, var(u)int256) and being done with it for now? I'm having trouble marshaling an argument in favor of many types -- it seem slightly more safe in the sense that fields and parameters are more guaranteed to be right-sized, but it's hard for me to know how much safer that makes us or if there are other reasons to prefer lots of types... Thoughts?
The idea would be to introduce |
There's a lot going on here, so I'll just comment on the explicit question above. Abstractly, I favor minimizing the number of variable-sized formats. Just one for signed and one for unsigned sounds good to me. The alternative feels like premature optimization with a complexity cost (and especially the overhead of forcing design decisions up-front) which undermines some benefits of adopting a flexible and future-proof format in the first place. |
I'm ambivalent on having multiple types. For me it's a trade-off between a slightly more complicated spec vs. the slight absurdity of using a 256 bit (capable) number to store something that won't get out of the thousands. I am wondering if we might decide to store something like a public key as a number, in which case we'd need something bigger than 256. |
I would really encourage people to think of this kind of work in the other direction: from serialization up. The serialization is the most important thing. It needs to meet our use cases, needs to be guaranteed deterministic, and cannot easily be changed once shipped. It becomes part of the spec. The API is just API. It's language-specific, and can change whenever. As you say, the API can just be
Again, it sort of depends on the level you are speaking at:
Maybe stupid question, but do we even have to do that? Is it not possible to come up with a relatively compact encoding that can represent any rational value and call it a day? This is what I wish we did in Noms, FWIW.
I don't have a strong opinion on encoding issues, except that I prefer to just have one that is as general and well-tested as possible.
Yeah this is a big topic.
I'm going to be thinking about the data model more as part of the vm work. For now I think it makes sense to continue the status quo, but maybe fork the serialization code so that we can improve it faster (of course making it public so that upstream can take the work if they want).
I don't understand the question. We can represent integers bigger than 32 bits in a 32 bit environment. It's just slower.
Why can't we launch w/o ooc? |
That's an interesting idea. Does LEB128 (basically encode 7 bits of number per 8 bit byte) qualify to you as compact? Or do you mean a fixed-size encoding like IEE 754 floats (eg, maybe 80-bit extended precision or 128 bit)? I wonder...
Sorry I was talking about the size of the address space. 4GB is tiny. There must be a mechanism that enables us to address >4GB of data?
We can, but if we do seems like we're accepting go semantics for our types. Seems like it has far-reaching implications, beyond even stuff like "you can't index a slice with an int64". For example if we want to index a map with our varint type, how does that work? If we use the varint type directly then it has to support == properly. Or we could index with varint.String() by convention or maybe varint.Bytes() -- or anything else that supports equality. Or we could support it as a first-class thing with our own map type and define key and equality ourselves. Maybe we want it keyed by hash? Think of the different ways these choices serialize. Whatever we choose we're locking ourselves into it. Seems unlikely to me that what go does is exactly what we want for the long term. Similarly, map iteration order is random. I guess we could say "dont iterate maps" and launch with that and figure it out later. Or we could define our own type that prohibits iteration or defines an order we think makes sense. I'm kind of going down the line for each of these types and wondering we are making an explicit choice to accept go semantics, and if we should be. Maybe it's just a matter of putting an API in place to constrain how they work, I dunno. Edit: in the above I'm conflating two things, how data types are serialized and their semantics (how they work). The former is the biggest concern and I guess it doesn't require custom types it just requires fine-grained control over the serialization format. The latter is still a concern, but a lesser one I guess. |
Re:
Yeah I was referring to the API level. At the serialization level I think we'll have exactly one or at most two serialization formats no matter how many integer types we have. Either exactly one serialization format that works for both signed and unsigned integers, or exactly one for signed and exactly one for unsigned. I wouldn't encode anything about sizes into the serialization format, that would be a higher-level API thing (basically, barf once you've read too many bits of data or overflow).
Totally agree and this is what's worrying me at the moment. We need a line of VM work in this area and it sounds like you are on top of it.
OK. I'm not going to worry too much about the implications of the new varint type on composite data types, I assume we're going to have work to do in that area as the data model emerges and we'll get to it then. |
No because for values with large magnitude but low precision, there are lots of wasted bits. I mean something more like this: But changing:
(Warning: I haven't thought this through, there might be major plotholes. Will think about it some over the weekend.)
I think we should declare that floats are verboten in this system.
I feel weird limiting this at the serialization layer if we can avoid it.
We who? If we can isolate the bit-twiddling in a library, it seems like it's just an issue for efficiency of decoding and encoding.
I think that the storage system will be separate from address space when we introduce the vm. Think of just like any classic storage API -- the address space inside the VM will have no bearing on the size of data storable.
.. this is a very deep question that I care a lot about. How about we split it out into its own bug or work stream or something? |
Ah, that makes sense. If this is the case we should stop talking about memory as a linear byte space and start talking about it as a structured data dag or merkelspace or something like that, meaning data structures are merkelized and references are hashes.
Sure. Digging into this touched on a bunch of issues that we don't have to solve now, they just got me thinking. |
Popping back up to the actual problem. Here's what I've gathered. Goals:
Assumptions/requirements for encoding scheme:
Assumptions/requirements for API:
The plan:
Other things that I considered/encountered:
BTW @aboodman thanks for the discussion this thinking in particular was very helpful:
|
I think I see where you are coming from now. My thinking was just that big integer arithmetic is sufficiently well understood that the answers we get from, e.g., As a comparison, we don't implement sha2 ourselves. The behavior is well-defined and we use a library that implements it per the spec. Serialization is a different matter. Our spec will demand a very specific serialization and we will likely have to implement that ourselves no matter what runtime library we use. Just as we implement the serialization of sha2 hashes ourselves today. The reason I bring this up is because as a user, it would be nice to be able to use
I think this is the intent.
Or maybe you use off-the-shelf libraries for some of the types and interfaces, which have the same behavior? |
I talked to aaron directly. In all worlds we have to precisely specify serialization formats. Beyond that there are a couple of questions about how prescriptive we are in the spec:
It seems like the answers should be "types in the spec are abstract types" and "our implementation does not have to track the types interfaces in the spec". Seems like this has to be the case to give implementors (ourselves) freedom to choose the best interface for the job given their language and constraints. There will be some places where we'll have to dial up our prescriptiveness on implementation (eg, "rounding must work this way"), but we shouldn't feel the need to implement what's in the spec. Correct? |
The point of the spec here is to enable interop, so if your implementation gets the same result using |
I think they are more abstract types and should map to a clear definition of how they behave under certain operations. If there is ambiguity we should spell it out in the spec (if it influences the end result). |
Filecoin is using a version of go-ipld-cbor from the refmt branch of ipfs/go-ipld-cbor. Here I'm merging that branch into master in our local fork. We need to make changes to it and it is more obvious for everyone if master reflects the actual state of the code. Changing this code is a pre-req for filecoin-project/venus#340
Do this so we can control the big.Int encoding from within the filecoin project. Filecoin will register a custom atlas, other consumers can register the now-exported atlas defined here. This is work towards filecoin-project/venus#340
OK, finally get back around to this. @whyrusleeping points out in #527 (comment) why infinite precision is a bad idea. I want to get consensus on what to do before we move forward. The easiest thing to do is pick a minimum unit of FIL and store a varint count of of those units. This is basically a fixed-point representation with the point at 10**-18 and is what ethereum does. So for example if the minimum unit is 10**-18 which I've heard several times then we'd store 1 FIL as 1000000000000000000 encoded as a varint. This is simple but it's also really expensive, storage-wise. Storing the quantity 1 FIL takes 9 bytes. There are a few alternatives, for example
|
No, this isnt good. For the reasons in my other issue and just in general.
This sounds perfectly reasonable to me. We can even restrict the exponent field to be a single byte varint. It also helps to correct our thinking on the units here. Instead of thinking of the minimum unit of filecoin being represented here by the number 10**-18, it could instead be 1 (as it is in ethereum). This should make the base-change logic for combining numbers much simpler. |
Ensure filecoin and hamt are using the same new version of ipld-cbor, which contains the global dictionary they need to share. Work towards #340.
This is work towards #340 in preparation to switch from big.Int's byte encoding to our leb128 encoding. This change does not modify what you give to and (mostly) see in commands for token amounts, which is units of filecoin. This change is already enormous so that can be a followon for another day. I don't plan to get to it right now.
Work towards #340. All that remains to convert are primitive types.
I'm going to close this issue out. The only thing left to do here that is a priority is the following which I put in the Ready queue: Also, I dumped state into a designdoc for the benefit of posterity: https://docs.google.com/document/d/12xNPzVCPSC2bTv7myxNRoMGx79AqFHQb89LfmRcFpGc/edit# |
Description
Currently the code base uses a mix of
big.Int
and{u}int{32|64}
to represent numbers, which get translated as such into encoded blocks.The spec though says to use LEB128 with limitations on size in some places when decoding.
Your task, should you accept it, is to review the code base and
Acceptance criteria
Risks + pitfalls
Where to begin
The text was updated successfully, but these errors were encountered: