Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default 5 digit prefix for flatfs is no longer sufficient for CidV1 #3463

Closed
kevina opened this issue Dec 3, 2016 · 55 comments
Closed

Default 5 digit prefix for flatfs is no longer sufficient for CidV1 #3463

kevina opened this issue Dec 3, 2016 · 55 comments
Assignees
Labels
need/community-input Needs input from the wider community
Milestone

Comments

@kevina
Copy link
Contributor

kevina commented Dec 3, 2016

When we switch to Cidv1 the assumptions we built into the 5 (base 32) digit prefix:

	// 5 bytes of prefix gives us 25 bits of freedom, 16 of which are taken by
	// by the Qm prefix. Leaving us with 9 bits, or 512 way sharding

are no longer correct, in fact when raw leaves are used most of the blocks will be clustered in the AFZBE directory. See #3462. There are several ways to fix this:

  1. One way to fix this is to increase prefix length, but then we we be back to the problem we had before where we will have lots of directories with just a few files in them (when CidV0 is used).
  2. Another alternative is proposed in Consider using the last N digits instead of the first N for sharding. go-ds-flatfs#5. That is use the last N digits (which should likely be a 2 digit suffix, which will give us 1024 way sharding).
  3. Finally, and the most work, it to stick with using a prefix but implement a dynamic prefix width.

I think (2) would be the best alternative until (3) can be implemented.

@kevina kevina added the need/community-input Needs input from the wider community label Dec 3, 2016
@Kubuxu Kubuxu self-assigned this Dec 3, 2016
@kevina kevina changed the title Default 5 digit prefix for flatfs is no longer suffent for CidV1 Default 5 digit prefix for flatfs is no longer sufficient for CidV1 Dec 3, 2016
@kevina kevina added the status/in-progress In progress label Dec 5, 2016
@kevina
Copy link
Contributor Author

kevina commented Dec 5, 2016

@whyrusleeping I can likely implement the last two digits quickly, the dynamic solution will likely take more work and I don't think it should be for 0.4.5.

@kevina kevina added this to the ipfs 0.4.5 milestone Dec 5, 2016
@kevina
Copy link
Contributor Author

kevina commented Dec 5, 2016

Also as a followup from the sprint meeting: The flatfs datastore doesn't even implement range queries, so I don't think this will be an issue at all. Once we figure out how to properly implement something dynamic than we can revisit the issue of range queries.

@Kubuxu Kubuxu removed their assignment Dec 5, 2016
@Kubuxu
Copy link
Member

Kubuxu commented Dec 5, 2016

I agree with that dynamic solution might be too complex to tackle for 0.4.5.

@kevina kevina self-assigned this Dec 5, 2016
@jbenet
Copy link
Member

jbenet commented Dec 6, 2016

the problem we had before where we will have lots of directories with just a few files in them

How big of a problem is this? has anyone quantified this with a graph on read latency overhead? particularly for a node under lots of activity?

I dont think we should do (2). you're shifting the problem to an unconventional thing, without telling users that this has shifted. i may be fine if we add a file called .ipfs/blocks/_README that says something like:

This is a repository of IPLD objects. Each IPLD object is in a single file,
named `<hash-of-object>.ipld`. All the object files are placed in a tree
of directories, based on a function on the object hashes. This is a form 
of sharding similar to the objects directory in git repositories. Previously, 
we used prefixes. Now, we use suffixes. The exact function is:

    func FilePath(objectHash string) {
      return string.Split(string(objectHash[objectHash.length - 6 : ]), "/")
    }

For example, an object that hashes to:

    <hash of object>

Will be placed at

    a/b/c/objhashabc.ipld
  • the example should be real
  • the function should be correct (i sketched it)

I think this file o/ would make me feel ok about changing the function that figures out the location of the content.

@jbenet
Copy link
Member

jbenet commented Dec 6, 2016

  • We should add such a file to signal anyway.
    • add file that describes what's going on.
  • I think (1), then (3). but fine with (2), then (3).

@kevina
Copy link
Contributor Author

kevina commented Dec 6, 2016

@jbenet, the read latency overhead is really dependent on the filesystem. In addition some file systems (such as ext4 when dir_index is enabled) start to have a problem when the directory gets too large. I will get some numbers for you on why (1) would be a problem now (and wasn't before) in a bit.

@jbenet
Copy link
Member

jbenet commented Dec 6, 2016

its fine-- just go with (2) and add that file.

@kevina
Copy link
Contributor Author

kevina commented Dec 6, 2016

@jbenet, okay will do. Thanks!

@kevina
Copy link
Contributor Author

kevina commented Dec 6, 2016

@jbenet @whyrusleeping should we just modify go-ds-flatfs to use suffix or create a new version and call it maybe go-ds-flatfs-suffix? (Modifying it too do both will likely create a lot of a special case code I would like to avoid).

@whyrusleeping
Copy link
Member

@kevina make the sharding selector a function so we can specify on the creation of the flatfs datastore. This will make migrations so much easier for me.

@kevina
Copy link
Contributor Author

kevina commented Dec 6, 2016

@whyrusleeping All right I will see if I can make that work.

I am also willing to handle most of the migration.

@kevina
Copy link
Contributor Author

kevina commented Dec 7, 2016

See ipfs/go-ds-flatfs#6 for a start. The README will be generated by go-ipfs not flatfs as it is go-ipfs specific.

@kevina
Copy link
Contributor Author

kevina commented Dec 8, 2016

Because we use Base32 encoding the very last byte/digit is not quite as random as I would hope and also depends on the length of the binary representation:

Length of Key in Bits Bits in last byte
8 3
16 1
24 4
32 2
40 5
... ...

When converting the repo created for #3462 that has a mixture of CidV0 and CidV1 raw leaves a 2 digit suffix I got me 256 levels of shading (instead of the expected 1024) and I didn't check but the distribution is unlikely to be even. With additional Cid types this number could go up to 1024 directories depending on the length of the key.

Increasing the suffix to 3 digits will increase the number of directories to 8192, but that could go up to 32768 directories.

Another possibility (that will be easy for me to implement based on what I got now), is disregard the very last digit and take the next to last 2 digits, then we get an evenly distributed 1024 way sharding. For example if the key is AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY the directory used will be CA, (instead of AY or CAY).

@whyrusleeping what do you think.

Sorry for this oversight.

@kevina
Copy link
Contributor Author

kevina commented Dec 8, 2016

Okay I ran some simulations using the keys from the repo created in #3462. I confirmed that using a fixed prefix just won't work when using both CidV0 and CidV1 keys, to make it long enough to handle CidV1 you end with lots of directories with only a few files in them when CidV0 is used. The following options could work:

How Data Set Num Dirs Max Dirs
Suffix length 2 CidV0 128 1024
Suffix length 2 CidV0 and Raw Leaves 256 1024
Next to Last 2 CidV0 1024 1024
Next to Last 2 CidV0 and Raw Leaves 1024 1024
Suffix length 3 CidV0 4096 32768
Suffix length 3 CidV0 and Raw Leaves 8092 32768

The distribution in all cases is reasonable even (I was wrong in my initial guess) in all cases.

Here is the list of keys used and various results: https://gateway.ipfs.io/ipfs/QmNbugVGQDcMpDYezAELasskUWyPms9JAjFBRYuJcZtqFC

@kevina
Copy link
Contributor Author

kevina commented Dec 8, 2016

I would either go with the suffix length of 2 or use the next to last 2 digits. A suffix length of 3 will create two many directories for an average sized repo, especially if they are keys that have a length that is a multiple of 5 binary bytes (in which case there could be up to 32768 directories).

We could also make the suffix length a configurable option with the default value being 2. I am not sure what additional complications this will bring though.

@whyrusleeping
Copy link
Member

Ah, so the last character in the base32 encoded keys don't contain an entire 5 bits of data. Given thats the case, lets go ahead and use a suffix length of three.

@kevina
Copy link
Contributor Author

kevina commented Dec 8, 2016

@whyrusleeping are you sure? That can lead to a large numbers of directives with only a few a few files per directory for an average sized repo. If you don't like my next-to-last idea, I would stick with 2 or make it configurable.

@whyrusleeping
Copy link
Member

yep

@Kubuxu
Copy link
Member

Kubuxu commented Dec 14, 2016

Other idea is, how about hashes of the CID, with some fast or extremely fast hashing algorithm.
Fast:

  • Murmur3
  • Blake2b/s

Extremely fast, http://www.cse.yorku.ca/~oz/hash.html:

  • djb2
  • sdbm

@kevina
Copy link
Contributor Author

kevina commented Dec 14, 2016

I am not sure I like the idea of hashing again @Kubuxu as it would make it difficult for users to manually find where a key is. I would still prefer the next to last 2 digits as that will give a perfect 1024 way fanout no matter what the key is and can still make it possible to manually find the key. I can live with the last 3 digits though.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 14, 2016

Using last digits (or N from last M) will already be quite hard so I don't think that manual user usability is the first priority.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 16, 2016

I am not against last X or next to last X, but if next to last X gives us better distribution I think we should go for it.

@daviddias
Copy link
Member

daviddias commented Dec 16, 2016

Just for clarification, when the word digit is used, what we mean is char, correct?

Hashing again to increase the distribution would bite us in terms of inspection and would probably force us to add a multihash in from of each block to say which second hash is used, therefore changing the blocks.

Dynamic prefix width would be that we would have to have a way to signal that, or to do several reads.

AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY
I can easily determine that the last 3 digits are CAY and the next-to-last 2 digits are CA I don't see what is hard about tha

Why not just pick the last 3 if the last char is not as random? Does this create 'too much' sharding?

Update from @whyrusleeping:

The reason for not using 'last 3' is because it could result in having 32k directories under the blocks dir
its not not random
its not uniformly random


I would really appreciate that this PR followed a proposed changed in the spec and even better, provide in the spec repo some repo examples with a bunch of blocks in CIDv0 and CIDv1 space that you use for tests so that other implementations can do too.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 16, 2016

Second hashing would be only choosing which directory-shard the key falls in, there in no need for multihash there (it can be changed by migration or be flat-fs specific).

chooseShard(key) = djb2(key)[:2]

or something

Lengths and shards, see: #3463 (comment)

The flat-fs sharding is not speced out anywhere.

@kevina
Copy link
Contributor Author

kevina commented Dec 16, 2016

@diasdavid yes by digit I mean a char in a base 32 representation sorry for the confusing terminology. I was trying to avoid byte to distinguish it from a byte in the raw binary representation.

@daviddias
Copy link
Member

daviddias commented Dec 16, 2016

The next to last doesn't seem that a thing that would be unreasonable to ask for contributors to understand, it just needs to be on the spec :) Consider this my 👍

@kevina
Copy link
Contributor Author

kevina commented Dec 16, 2016

Okay it sounds like we are in agreement, use the next-to-last 2 characters to get a nice guaranteed 1024 level fanout no matter what type of Cid is used. @whyrusleeping agree?

I will update the code and also update the spec.

@whyrusleeping
Copy link
Member

@kevina sounds good to me, ship it

@jbenet
Copy link
Member

jbenet commented Dec 19, 2016

  • Last characters still hurt inspection-- when working with the hashes and dirs manually. But fine by me. I'll deal.
  • BTW, i think we want a larger fanout, may be a good time to do that too. I think very large repos see pain today. How big is the current fanout tuned for?

Separately, it would be nice to make sure the sharding is good for repos of any size. having a constant amount of sharding is bad because repos of different sizes will perform poorly. I think the right solution starts small then auto-scales up and "rearranges all the files" when necessary. this is a whole separate discussion though. For The Future.

@jbenet
Copy link
Member

jbenet commented Dec 19, 2016

Wait, I didn't catch why "the next-to-last 2 chars" instead of "the last 2"-- what was the reason?

@Kubuxu
Copy link
Member

Kubuxu commented Dec 19, 2016

base32 has different count of entropy bits in last byte depending on the length of the input as it converts 5 bytes of input to 8 bytes output. So if you hash is 101 bytes long then last byte will be have only 4.8 bits of entropy, last two bytes will have 12.8 bits instead of 16bits.

@kevina
Copy link
Contributor Author

kevina commented Dec 19, 2016

@jbenet, so that we are on the same page, the current Fanout should be 512, if we implement this as planned it will be 1024.

See #3463 (comment) and my followup comments for the justification for next-to-last. Without it we won't get consistent shading.

I strongly disagree with having a very large fanout by default. For the average size repo that could lead to lots of directories with only one or two files per directory (like we had before we switched to base32 encoding). I have no problem making the fanout configurable for the uncommon case of huge repos.

@jbenet
Copy link
Member

jbenet commented Dec 20, 2016

Ah right. nice.

I think the fanout should adjust automatically. but we can get there.

Another option is to make it 2 tiers by default, instead of one.

@kevina
Copy link
Contributor Author

kevina commented Dec 20, 2016

@jbenet I'm confused. Are you in support of a configurable option? That should be fairly easy to implement.

@whyrusleeping
Copy link
Member

@jbenet this change discussion is a fairly temporary one to make sure that we don't end up with horrible performance between now and when we can come up with a better block storage solution.

I'm okay moving forward with the next to last two given the discussion in this thread so far.

@whyrusleeping
Copy link
Member

Alright, lets move forward with 'next to last two'.

With that, i want two things. I want the README file as @jbenet suggested to be generated by the codebase (filling in a template). If you like, I can write up roughly what i think this should look like, but the gist of it is you should show an example echo foobar | ipfs add to get a hash, then show the series of operations it takes to go from that hash, to a block on disk.

We should also add a file with a multicodec that describes the format, so in this case /repo/flatfs/shard-next-to-last/2 or something (@Kubuxu may have input here). This way you can open up a flatfs datastore without having to know what its sharding format is beforehand.

Then, the conversion tool should be able to handle the required changes to both of those files.

@kevina
Copy link
Contributor Author

kevina commented Dec 21, 2016

@whyrusleeping I will create a README and work on updating the spec.

We should also add a file with a multicodec that describes the format

I am unclear what you are after here.

@whyrusleeping
Copy link
Member

Make sure the README file gets autogenerated based on the flatfs sharding method being used.

For the multicodec, have a file called VERSION or SHARDING or something in the flatfs directory, next to the README that contains a multicodec (a string) that tells us what sharding function is being used there. That way, we can 'open' an existing flatfs datastore without having to have the user specify the correct sharding function

@kevina
Copy link
Contributor Author

kevina commented Dec 22, 2016

@whyrusleeping, autogenerating the README was not what I had in mind and will take some additional work. In addition it may be difficult to include all the information @jbenet wants in an auto-generated README, mainly the code for the sharding function in use.

@kevina
Copy link
Contributor Author

kevina commented Dec 22, 2016

Here is a draft of the README as I believe @jbenet wanted it: https://gist.github.com/kevina/e217dd4c763aaaafdab9657935920da5

@kevina
Copy link
Contributor Author

kevina commented Dec 22, 2016

For the version file I am going to go with a file name SHARDING and am going to use the string

v1/next-to-last/2

The v1 is to allow for an upgrading when we do something more complicated than a simple function. If you want to prefix it with something maybe

/repo/flatfs/shard/v1/next-to-last/2

@kevina
Copy link
Contributor Author

kevina commented Dec 22, 2016

@whyrusleeping fell free to modify what I wrote to something that can more easily to autogenerted, I was going of of what @jbenet originally wrote.

@kevina
Copy link
Contributor Author

kevina commented Dec 22, 2016

See ipfs/go-ds-flatfs#13

@Kubuxu
Copy link
Member

Kubuxu commented Dec 22, 2016

I am for the more verbose version, it can't hurt but might be useful.

@whyrusleeping
Copy link
Member

@kevina go ahead and move forward with the one you wrote (including my comments on the gist), but make it a string constant in the source that gets written on datastore creation.

@whyrusleeping
Copy link
Member

closed by #3608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/community-input Needs input from the wider community
Projects
None yet
Development

No branches or pull requests

6 participants