Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Consistent Varint Encoding #340

Closed
dignifiedquire opened this issue Apr 25, 2018 · 28 comments
Closed

Implement Consistent Varint Encoding #340

dignifiedquire opened this issue Apr 25, 2018 · 28 comments
Assignees
Milestone

Comments

@dignifiedquire
Copy link
Contributor

dignifiedquire commented Apr 25, 2018

Description

Currently the code base uses a mix of big.Int and {u}int{32|64} to represent numbers, which get translated as such into encoded blocks.

The spec though says to use LEB128 with limitations on size in some places when decoding.

Your task, should you accept it, is to review the code base and

  1. make a suggestion on how to best integrate this into the codebase
  2. after feedback, implement.

Acceptance criteria

  • ideal solution is both efficient, as well as makes it ergonomic to use number types in the codebase

Risks + pitfalls

Where to begin

@mishmosh
Copy link
Contributor

Implementation proposal, please.

@aboodman aboodman self-assigned this May 22, 2018
@mishmosh
Copy link
Contributor

@acruikshank took a look, needs more info. @dignifiedquire will elaborate.

@dignifiedquire
Copy link
Contributor Author

@acruikshank could you provide some questions on what is unclear please?

@acruikshank
Copy link
Contributor

acruikshank commented May 23, 2018

@dignifiedquire I was slightly mis-remembering when I said this was unclear (since part of the task is to clarify). The only real question is the scope of this change. There are many places we encode values. Are we only concerned with ints that will ultimately live on-chain here? That would include message parameters and anything that stays in actor memory. If that's it, I think this is ready to start (researching).

@dignifiedquire
Copy link
Contributor Author

We are concerned with all values that are ending up on the chain and also all the ones on disk, as it will make life easier in the future for us.

@phritz
Copy link
Contributor

phritz commented May 30, 2018

@dignifiedquire (and anyone else knowledgeable about this area):

  • Why did we choose LEB128?
  • Do we have a reason to be opposed to using something like prefix varint which is exactly like LEB128 but just shifts the length bits to the front for efficiency (eg https://news.ycombinator.com/item?id=11263378)?
  • We are optimizing for (1) storage space, esp efficient encoding for small values and (2) what else? I'm not asking what would be nice, I'm asking what is most important.
  • Why not just have two varint types, signed and unsigned as opposed to having size limits filecoin-project/specs@7b8d7d7#diff-958e7270f96f5407d7d980f500805b1bR56? Is it to restrict the size of inputs to prevent DOS/resource consumption or are there other reasons?

Edit: for clarity

@phritz phritz self-assigned this May 30, 2018
@phritz
Copy link
Contributor

phritz commented May 31, 2018

I started digging into this today and figured I'd surface some initial thoughts for feedback/input. This is new territory for me so please poke holes and point out oversights in what follows.

  • See also questions in previous comment.
  • It looks like what's required here is defining one or more varint types with an interface similar to big.Int (Add, Mul, Exp, LessThan, etc). These types will need to be implemented on both sides of the vm barrier. On the node/Go side seems like we can embed/delegate to bit.Int for most of the implementation. The Bytes/Token/etc implementations we have only take small tweaks to switch to a new type similar to big.Int. On the vm side we can re-use this implementation for now; when we move to a real vm we'll have to implement the whole interface on that side of the boundary. The main point of departure from big.Int is in serialization: VarInt.Bytes() []byte should serialize as an LEB128 or similar varint.
  • The utility of having types of varintN for different values of N is unclear to me. The utility that I can see in having multiple values is limiting the size of inputs to prevent DOS / resource consumption. Maybe someone can tell me how useful it is to do that in a fine-grained way (this thing is max 8 bits, this other thing is max 32 bits, etc) -- at first blush it seems like we get most of the utility in a very simple way by picking a reasonable max value N (say 256 or 512 bits) and having all varints be var(u)intN's. Am I wrong on this? Looks like Ethereum does this for N=256 -- the other types are just syntax on top of varint256's. The only place in the codebase that it seems like would have to change if we use 256 instead of 512 is the mining lottery, but we have the option to fall back to the original mod strategy or do something else (eg shave a few bytes off the hash) if we only have 256 bits to work with, so this would be no big deal.
  • Re encoding: why LEB128? It seems like a good encoding on several axes but there are potentially better encodings out there. For example prefix varint which is basically LEB128 but with the continuation bits packed at the front (see previous comment); this results in significant performance gains or so the internets tell me.
  • Re encoding: the encoding needs to be deterministic and unique. LEB128 is not unique out of the box (you can left-pad the encoded value with 0's or 1's if negative) but it looks like we can just disallow zero padding.
  • Actor code uses Go slices and maps. A slice is indexed by int64 in miner.go, but I think slice indices are only 32 bits on some platforms, so that gets me a little worried. If indexing a slice by a varint we could convert to 32/64 bits, this would just be a call like varint.ToInt64() or similar. Maps we're indexing by string, I think this works fine for varints -- if want to use a varint index by its .String() value. But this gets me wondering whether sooner rather than later we should be using our own slice/array and map types here instead of the Go types. It would enable us to enforce how the types are used and also start giving shape to how these things should actually be implemented. @aboodman you've built this type system before, whatchoo think?
  • Speaking of 32 bits... wasm's address space is 32 bits. How do we imagine that working? @aboodman ?

There's more but it doesn't rise to the level of discussion at this point and depends on input on the above....

@phritz
Copy link
Contributor

phritz commented May 31, 2018

Also:

  • More generally than just slice and map: we talk about vm memory modeled as a linear array of bytes. That's true at a low level but from an actor's point of view memory is structured data: literally a struct and whatever composition of go types it contains. Seems like we can't launch without custom types, I wonder whether all of this structured data shouldn't be custom types sooner rather than later...

@dignifiedquire
Copy link
Contributor Author

Why did we choose LEB128?
Do we have a reason to be opposed to using something like prefix varint which is exactly like LEB128 but just shifts the length bits to the front for efficiency (eg https://news.ycombinator.com/item?id=11263378)?

We are optimizing for (1) storage space, esp efficient encoding for small values and (2) what else? I'm not asking what would be nice, I'm asking what is most important.

  1. future proofing, making sure we don't suddenly run out of number space
  2. minimizing storage size on chain
  3. performance
  4. developer ux

Why not just have two varint types, signed and unsigned as opposed to having size limits filecoin-project/specs@7b8d7d7#diff-958e7270f96f5407d7d980f500805b1bR56? Is it to restrict the size of inputs to prevent DOS/resource consumption or are there other reasons?

Two ideas to this

  • be able to optimize runtime performance by just using 32/64 bit numbers instead of actual varints on these, for now while still knowing we can upgrade without breaking the chain format
  • make validation easier/stricter and as such reduce the ability for dos/resorce consumption attacks during parse time

Re encoding: the encoding needs to be deterministic and unique. LEB128 is not unique out of the box (you can left-pad the encoded value with 0's or 1's if negative) but it looks like we can just disallow zero padding.

good catch, we should choose a single option and make sure to add it to the spec and enforce it.

I wonder whether all of this structured data shouldn't be custom types sooner rather than later...

What exactly do you mean with custom types?

@phritz
Copy link
Contributor

phritz commented May 31, 2018

LEB128 is the only well known, somewhat standardized varint encoding format that I could find, all other derivatives are mostly one off projects with little standard support outside of their specific ecosystem

OK. The fact that these "standards" are like <= 40 lines of code therefore easy for anyone to implement inclines me to pick the best one, as opposed to the one most people have heard of. But it bears a bit more investigation.

Two ideas to this

be able to optimize runtime performance by just using 32/64 bit numbers instead of actual varints on these, for now while still knowing we can upgrade without breaking the chain format
make validation easier/stricter and as such reduce the ability for dos/resorce consumption attacks during parse time

Potential optimizations noted. My instinct is to go the other, simpler direction though: define exactly two types (one varint, one varuint) with a reasonable max size say 256 bits and just use that everywhere until we have a good reason to introduce a bunch of types. I'm asking for feedback from everyone: are there reasons to prefer defining different sizes varints (say, varint32, varint64, varint128, varint256) rather than just defining one size (say, var(u)int256) and being done with it for now? I'm having trouble marshaling an argument in favor of many types -- it seem slightly more safe in the sense that fields and parameters are more guaranteed to be right-sized, but it's hard for me to know how much safer that makes us or if there are other reasons to prefer lots of types... Thoughts?

I wonder whether all of this structured data shouldn't be custom types sooner rather than later...
What exactly do you mean with custom types?

The idea would be to introduce types.Map and types.Array and use those on chain and in actors. That way we can be sure that maps and arrays are being used (eg, indexed) in safe/proper ways; we start formalizing the structured data api which is what actors see, what goes into the chain, and what we're going to have to implement on the vm side; and it's clear what to merkelize (at the level of the structured data types). I guess if we start going down this route we need to decide how deep to go: do we mix these types with go types or go all the way in (eg, types.Struct instead of go structs)?

@porcuquine
Copy link
Contributor

There's a lot going on here, so I'll just comment on the explicit question above.

Abstractly, I favor minimizing the number of variable-sized formats. Just one for signed and one for unsigned sounds good to me. The alternative feels like premature optimization with a complexity cost (and especially the overhead of forcing design decisions up-front) which undermines some benefits of adopting a flexible and future-proof format in the first place.

@acruikshank
Copy link
Contributor

I'm ambivalent on having multiple types. For me it's a trade-off between a slightly more complicated spec vs. the slight absurdity of using a 256 bit (capable) number to store something that won't get out of the thousands. I am wondering if we might decide to store something like a public key as a number, in which case we'd need something bigger than 256.

@aboodman
Copy link
Contributor

aboodman commented May 31, 2018

It looks like what's required here is defining one or more varint types with an interface similar to big.Int (Add, Mul, Exp, LessThan, etc)

I would really encourage people to think of this kind of work in the other direction: from serialization up. The serialization is the most important thing. It needs to meet our use cases, needs to be guaranteed deterministic, and cannot easily be changed once shipped. It becomes part of the spec. The API is just API. It's language-specific, and can change whenever. As you say, the API can just be big.Int for now, or even int in most cases for now.

The utility of having types of varintN for different values of N is unclear to me.

Again, it sort of depends on the level you are speaking at:

  • In the serialization, the only value I can imagine is efficiency of decoding/encoding. This doesn't seem important enough to me to introduce complexity in the serialization. Because serialization is so important to get right in these kind of systems, I would prefer to focus on a few types that we think very carefully about that, then relegate the rest to higher layers.

  • At the message level, it is nice to be able to be specific about the parameter and return types, so that consumers can have guarantees about the kind of data they are working with, and so that they aren't on the hook (from both correctness and cost pov) for validating their inputs. However, this is different from serialization. This is a type system - it can (and should) be expressed entirely separartely from the serialization.

picking a reasonable max value N

Maybe stupid question, but do we even have to do that? Is it not possible to come up with a relatively compact encoding that can represent any rational value and call it a day? This is what I wish we did in Noms, FWIW.

Re encoding: why LEB128? It seems like a good encoding on several axes but there are potentially better encodings out there. For example prefix varint which is basically LEB128 but with the continuation bits packed at the front (see previous comment); this results in significant performance gains or so the internets tell me.

I don't have a strong opinion on encoding issues, except that I prefer to just have one that is as general and well-tested as possible.

Actor code uses Go slices and maps. A slice is indexed by int64 in miner.go, but I think slice indices are only 32 bits on some platforms, so that gets me a little worried. If indexing a slice by a varint we could convert to 32/64 bits, this would just be a call like varint.ToInt64() or similar. Maps we're indexing by string, I think this works fine for varints -- if want to use a varint index by its .String() value. But this gets me wondering whether sooner rather than later we should be using our own slice/array and map types here instead of the Go types. It would enable us to enforce how the types are used and also start giving shape to how these things should actually be implemented. @aboodman you've built this type system before, whatchoo think?

Yeah this is a big topic.

  1. People like to use the language-native types (map, slice, whatever), especially in Go as API. I think that is fine for most cases, at least for now. But whatever API we use to access the data model, we need to be absolutely sure of our serialization and validation between that and the on-chain representation. We've already run into several bugs where the serialization was non-deterministic (e.g., Round-trip nil byte slices correctly. polydawn/refmt#28), and this makes me very nervous. I think that we need to control this code directly.

  2. Because people generally prefer to use the Go types, my preference would be to focus on that for now and really make sure we have the translations correct there. If we introduce another API there will be tension over which to use and we'll have to ensure both are correct, and we'll just double the amount of work we have to do.

  3. Large-scale data. We are much different than, e.g., Ethereum because we want to store lots of structured data inside actor memory. Like, the entire storage market. We cannot just deserialize the entire market into a Go map and call it a day. So this is going to eventually have to get more complicated. We are going to need collections of some sort that we can read and write incrementally.

  4. Structure/types/interop. Just like how we have a current desire to restrict the size of integers in parameters to messages, we will, over time, probably want to add additional structure to message parameters. Instead of just a buffer, I want to say that you should pass me a struct that represents a multisig signing policy. Putting it on developers to reinvent deterministic serialization for complex data, and putting decoding and validation inside actors doesn't seem like it's going to work long-term. It also makes it harder for actors to interop. If I want to rely on two different actors, but they have different ways to represent list or struct, that makes it pretty hard to compose things. I believe that over time there will be pressure to increase the richness of the vocabulary you can use in message parameters for this reason.

I'm going to be thinking about the data model more as part of the vm work. For now I think it makes sense to continue the status quo, but maybe fork the serialization code so that we can improve it faster (of course making it public so that upstream can take the work if they want).

Speaking of 32 bits... wasm's address space is 32 bits. How do we imagine that working? @aboodman ?

I don't understand the question. We can represent integers bigger than 32 bits in a 32 bit environment. It's just slower.

Seems like we can't launch without custom types

Why can't we launch w/o ooc?

@phritz
Copy link
Contributor

phritz commented May 31, 2018

Maybe stupid question, but do we even have to do that? Is it not possible to come up with a relatively compact encoding that can represent any rational value and call it a day? This is what I wish we did in Noms, FWIW.

That's an interesting idea. Does LEB128 (basically encode 7 bits of number per 8 bit byte) qualify to you as compact? Or do you mean a fixed-size encoding like IEE 754 floats (eg, maybe 80-bit extended precision or 128 bit)?

I wonder...

  • Up to what size we want exactly represented integers? Maybe a better way to put it: what is the lowest number at which having a non-exact integer representation would be a deal-breaker? 2**64-1?
  • How much we care about bit-twiddling?

Speaking of 32 bits... wasm's address space is 32 bits. How do we imagine that working? @aboodman ?
I don't understand the question. We can represent integers bigger than 32 bits in a 32 bit environment. It's just slower.

Sorry I was talking about the size of the address space. 4GB is tiny. There must be a mechanism that enables us to address >4GB of data?

Seems like we can't launch without custom types
Why can't we launch w/o ooc?

We can, but if we do seems like we're accepting go semantics for our types. Seems like it has far-reaching implications, beyond even stuff like "you can't index a slice with an int64". For example if we want to index a map with our varint type, how does that work? If we use the varint type directly then it has to support == properly. Or we could index with varint.String() by convention or maybe varint.Bytes() -- or anything else that supports equality. Or we could support it as a first-class thing with our own map type and define key and equality ourselves. Maybe we want it keyed by hash? Think of the different ways these choices serialize. Whatever we choose we're locking ourselves into it. Seems unlikely to me that what go does is exactly what we want for the long term. Similarly, map iteration order is random. I guess we could say "dont iterate maps" and launch with that and figure it out later. Or we could define our own type that prohibits iteration or defines an order we think makes sense. I'm kind of going down the line for each of these types and wondering we are making an explicit choice to accept go semantics, and if we should be.

Maybe it's just a matter of putting an API in place to constrain how they work, I dunno.

Edit: in the above I'm conflating two things, how data types are serialized and their semantics (how they work). The former is the biggest concern and I guess it doesn't require custom types it just requires fine-grained control over the serialization format. The latter is still a concern, but a lesser one I guess.

@phritz
Copy link
Contributor

phritz commented May 31, 2018

Re:

In the serialization, the only value I can imagine is efficiency of decoding/encoding....
...
At the message level, it is nice to be able to be specific about the parameter and return types....

Yeah I was referring to the API level. At the serialization level I think we'll have exactly one or at most two serialization formats no matter how many integer types we have. Either exactly one serialization format that works for both signed and unsigned integers, or exactly one for signed and exactly one for unsigned. I wouldn't encode anything about sizes into the serialization format, that would be a higher-level API thing (basically, barf once you've read too many bits of data or overflow).

People like to use the language-native types (map, slice, whatever), especially in Go as API. I think that is fine for most cases, at least for now. But whatever API we use to access the data model, we need to be absolutely sure of our serialization and validation between that and the on-chain representation.

Totally agree and this is what's worrying me at the moment. We need a line of VM work in this area and it sounds like you are on top of it.

Because people generally prefer to use the Go types, my preference would be to focus on that for now

OK. I'm not going to worry too much about the implications of the new varint type on composite data types, I assume we're going to have work to do in that area as the data model emerges and we'll get to it then.

@aboodman
Copy link
Contributor

aboodman commented May 31, 2018

Maybe stupid question, but do we even have to do that? Is it not possible to come up with a relatively compact encoding that can represent any rational value and call it a day? This is what I wish we did in Noms, FWIW.

That's an interesting idea. Does LEB128 (basically encode 7 bits of number per 8 bit byte) qualify to you as compact?

No because for values with large magnitude but low precision, there are lots of wasted bits. I mean something more like this:

https://github.com/filecoin-project/specs/pull/78/files#diff-958e7270f96f5407d7d980f500805b1bR62

But changing:

  • Instead of varint64, let the size of the components be unrestricted (I suppose using leb128 or whatever)
  • I'd also want to investigate putting a denominator in there so that it can represent rationals, not just decimal values, just for completeness. But you'd have to define how to canonicalize.

(Warning: I haven't thought this through, there might be major plotholes. Will think about it some over the weekend.)

Or do you mean a fixed-size encoding like IEE 754 floats (eg, maybe 80-bit extended precision or 128 bit)?

I think we should declare that floats are verboten in this system.

I wonder...

  • Up to what size we want exactly represented integers? Maybe a better way to put it: what is the lowest number at which having a non-exact integer representation would be a deal-breaker? 2**64-1?

I feel weird limiting this at the serialization layer if we can avoid it.

  • How much we care about bit-twiddling?

We who? If we can isolate the bit-twiddling in a library, it seems like it's just an issue for efficiency of decoding and encoding.

Speaking of 32 bits... wasm's address space is 32 bits. How do we imagine that working? @aboodman ?

I don't understand the question. We can represent integers bigger than 32 bits in a 32 bit environment. It's just slower.

Sorry I was talking about the size of the address space. 4GB is tiny. There must be a mechanism that enables us to address >4GB of data?

I think that the storage system will be separate from address space when we introduce the vm. Think of just like any classic storage API -- the address space inside the VM will have no bearing on the size of data storable.

Seems like we can't launch without custom types

Why can't we launch w/o ooc?

We can, but if we do seems like we're accepting go semantics for our types.

.. this is a very deep question that I care a lot about. How about we split it out into its own bug or work stream or something?

@phritz
Copy link
Contributor

phritz commented May 31, 2018

I think that the storage system will be separate from address space when we introduce the vm. Think of just like any classic storage API -- the address space inside the VM will have no bearing on the size of data storable.

Ah, that makes sense. If this is the case we should stop talking about memory as a linear byte space and start talking about it as a structured data dag or merkelspace or something like that, meaning data structures are merkelized and references are hashes.

.. this is a very deep question that I care a lot about. How about we split it out into its own bug or work stream or something?

Sure. Digging into this touched on a bunch of issues that we don't have to solve now, they just got me thinking.

@phritz
Copy link
Contributor

phritz commented Jun 1, 2018

Popping back up to the actual problem. Here's what I've gathered.

Goals:

  • Primary: future proof our numbers: enable those that want to grow arbitrarily large to do so
  • Close second: save bytes on chain

Assumptions/requirements for encoding scheme:

  • Exactly represent arbitrarily large unsigned and (maybe) signed integers
  • Deterministic, unique encoding
  • Optimize for small numbers, as many will be close to zero and nearly all will be less than 2**64 in practice.
  • Select encoding for small size and performance in that order
  • Non-goals:
    • actually optimize the en/decoding

Assumptions/requirements for API:

  • Make it easy to delegate implementation eg to int64 for efficiency
  • Don't get too caught up in the API right now, we can change it
  • Supply a reasonable minimum of types (ie, dont feel the need to serve every need right now)

The plan:

  • Because we want to support arbitrarily-sized varints the obvious choice is LEB128 (https://en.wikipedia.org/wiki/LEB128), basically encode 7 bits in each byte with one bit signaling another byte to follow (hence the B128/base-128 name). LEB128 makes no assumptions about max size though it also works for max-sized varints and as a building block for other encoding schemes like scientific notation. It's easy to understand and implement. To ensure a unique encoding we'll disallow zero prefixing and be sure to include this in acceptance tests.
  • Note: I can't find any place in the code where we presently need a signed varint. The only reason to implement the signed varint that I can think of is to support https://github.com/filecoin-project/specs/pull/78/files#diff-958e7270f96f5407d7d980f500805b1bR62 -- I'm inclined to do it for that reason...(?)
  • For the API we'll mimic big.Int and do the small corrective surgery to convert our tokes/bytes types over.

Other things that I considered/encountered:

BTW @aboodman thanks for the discussion this thinking in particular was very helpful:

I would really encourage people to think of this kind of work in the other direction: from serialization up. The serialization is the most important thing.

@aboodman
Copy link
Contributor

aboodman commented Jun 1, 2018

Why mimic big.Int instead of just using it directly?

Because these types are found in the spec and big.Int is a Go thing? I have assumed -- perhaps incorrectly -- that the spec needs to define the semantics of all the data types used in algorithms covered by the spec. For example the payment channel actor implementation:
https://github.com/filecoin-project/specs/blob/master/drafts/retrieval-market.md
It seems like we're going to need to give precise algorithms for the implementations of its methods, which means defining exactly what operations you can do with data types and how they should work.

I think I see where you are coming from now.

My thinking was just that big integer arithmetic is sufficiently well understood that the answers we get from, e.g., big.Int.Add are going to be the same answers that our spec will demand. Not because we're cribbing from Go but because there's only one possible correct answer for adding two integers (maybe I'm wrong and there's subtlety here).

As a comparison, we don't implement sha2 ourselves. The behavior is well-defined and we use a library that implements it per the spec.

Serialization is a different matter. Our spec will demand a very specific serialization and we will likely have to implement that ourselves no matter what runtime library we use. Just as we implement the serialization of sha2 hashes ourselves today.

The reason I bring this up is because as a user, it would be nice to be able to use int64 or big.Int in the cases that I can, because I already know those APIs well.

I admit I don't fully understand how the spec is supposed to facilitate separate implementations. At one extreme a strategy is to precisely define all the consensus/chain-affecting types in use and how they work. This strategy means it's pretty straightforward to do a separate implementation: filecoin defines a bunch of types and interfaces and gives you the algorithms in terms of those types (in Go seems fine, or something close to it).

I think this is the intent.

You implement the types and interfaces as described.

Or maybe you use off-the-shelf libraries for some of the types and interfaces, which have the same behavior?

@phritz
Copy link
Contributor

phritz commented Jun 1, 2018

I talked to aaron directly. In all worlds we have to precisely specify serialization formats. Beyond that there are a couple of questions about how prescriptive we are in the spec:

  1. Are the types we use in the spec (eg, bigints, maps, arrays, etc) supposed to be understood as abstract types or concrete types for which we give a ~full interface and description of semantics?
  2. How closely if at all does our implementation have to track the type interfaces in the spec? For example if the spec shows an array used in an algorithm is it important that in that algorithm the thing we use looks and acts like an array? Or could it for example be a sequence accessed by an cursor?

It seems like the answers should be "types in the spec are abstract types" and "our implementation does not have to track the types interfaces in the spec". Seems like this has to be the case to give implementors (ourselves) freedom to choose the best interface for the job given their language and constraints. There will be some places where we'll have to dial up our prescriptiveness on implementation (eg, "rounding must work this way"), but we shouldn't feel the need to implement what's in the spec.

Correct?

@dignifiedquire
Copy link
Contributor Author

How closely if at all does our implementation have to track the type interfaces in the spec? For example if the spec shows an array used in an algorithm is it important that in that algorithm the thing we use looks and acts like an array? Or could it for example be a sequence accessed by an cursor?

The point of the spec here is to enable interop, so if your implementation gets the same result using MyFancyCursorThatIterates100000timesFasterThanYourArray, your implementation is welcome to use it. Remember, we hash our results, so as long as the hash is the same, and subsequently the data you put on the wire things are good.

@dignifiedquire
Copy link
Contributor Author

Are the types we use in the spec (eg, bigints, maps, arrays, etc) supposed to be understood as abstract types or concrete types for which we give a ~full interface and description of semantics?

I think they are more abstract types and should map to a clear definition of how they behave under certain operations. If there is ambiguity we should spell it out in the spec (if it influences the end result).

@mishmosh mishmosh modified the milestones: Sprint 12, Sprint 13 👻 Jun 5, 2018
phritz added a commit to filecoin-project/go-ipld-cbor that referenced this issue Jun 7, 2018
Filecoin is using a version of go-ipld-cbor from the refmt branch of ipfs/go-ipld-cbor. Here I'm merging that branch into master in our local fork. We need to make changes to it and it is more obvious for everyone if master reflects the actual state of the code.

Changing this code is a pre-req for filecoin-project/venus#340
phritz added a commit to filecoin-project/go-ipld-cbor that referenced this issue Jun 8, 2018
Do this so we can control the big.Int encoding from within the filecoin project. Filecoin will register a custom atlas, other consumers can register the now-exported atlas defined here.

This is work towards filecoin-project/venus#340
@mishmosh mishmosh modified the milestones: Sprint 13 👻, Sprint 14 Jun 13, 2018
@phritz
Copy link
Contributor

phritz commented Jun 14, 2018

OK, finally get back around to this. @whyrusleeping points out in #527 (comment) why infinite precision is a bad idea. I want to get consensus on what to do before we move forward.

The easiest thing to do is pick a minimum unit of FIL and store a varint count of of those units. This is basically a fixed-point representation with the point at 10**-18 and is what ethereum does. So for example if the minimum unit is 10**-18 which I've heard several times then we'd store 1 FIL as 1000000000000000000 encoded as a varint. This is simple but it's also really expensive, storage-wise. Storing the quantity 1 FIL takes 9 bytes.

There are a few alternatives, for example

  • we could have the minimum unit be 10**-9 which saves us 4 bytes for every number we store. 10**-9 means that if FIL goes to $1,000,000 we can still represent quantities of 1/1000th of a cent. If FIL goes to $1,000,000 then it stores the entire present net worth of the world. The problem is that there are implications for how we want represent prices on the storage market (https://github.com/filecoin-project/aq/issues/53). I would like to imagine that we can express price in the storage market without being too constrained by having a minimum denomination of 10**-9, but there are requirements that I think I'm missing.
  • we could still have the representation be in units of 10**-18 so operations are fixed cost but store in scientific notation at a known, bounded cost of divisions and multiplications, thus saving storage bytes when the precision is less than 10**-18. There is no way for the user to affect cost beyond a few dozen multiplications or divisions, but I guess it's still technically arbitrary precision on chain so going to not go this route unless I hear other opinions.

@whyrusleeping
Copy link
Member

we could have the minimum unit be 10**-9

No, this isnt good. For the reasons in my other issue and just in general.

we could still have the representation be in units of 10**-18 so operations are fixed cost but store in scientific notation at a known

This sounds perfectly reasonable to me. We can even restrict the exponent field to be a single byte varint.

It also helps to correct our thinking on the units here. Instead of thinking of the minimum unit of filecoin being represented here by the number 10**-18, it could instead be 1 (as it is in ethereum). This should make the base-change logic for combining numbers much simpler.

phritz added a commit that referenced this issue Jun 18, 2018
Ensure filecoin and hamt are using the same new version of ipld-cbor, which contains the global dictionary they need to share. Work towards #340.
phritz added a commit that referenced this issue Jun 19, 2018
This is work towards #340 in preparation to switch from big.Int's byte encoding to our leb128 encoding.

This change does not modify what you give to and (mostly) see in commands for token amounts, which is units of filecoin. This change is already enormous so that can be a followon for another day. I don't plan to get to it right now.
phritz added a commit that referenced this issue Jun 21, 2018
phritz added a commit that referenced this issue Jun 21, 2018
Work towards #340. All that remains to convert are primitive types.
@phritz
Copy link
Contributor

phritz commented Jun 21, 2018

I'm going to close this issue out. The only thing left to do here that is a priority is the following which I put in the Ready queue:

#587

Also, I dumped state into a designdoc for the benefit of posterity:

https://docs.google.com/document/d/12xNPzVCPSC2bTv7myxNRoMGx79AqFHQb89LfmRcFpGc/edit#

Other, lower priority items include:
#559
#585
#584

@phritz phritz closed this as completed Jun 21, 2018
@mishmosh mishmosh modified the milestones: Sprint 14, Sprint 15 Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants