-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default 5 digit prefix for flatfs is no longer sufficient for CidV1 #3463
Comments
@whyrusleeping I can likely implement the last two digits quickly, the dynamic solution will likely take more work and I don't think it should be for 0.4.5. |
Also as a followup from the sprint meeting: The flatfs datastore doesn't even implement range queries, so I don't think this will be an issue at all. Once we figure out how to properly implement something dynamic than we can revisit the issue of range queries. |
I agree with that dynamic solution might be too complex to tackle for 0.4.5. |
How big of a problem is this? has anyone quantified this with a graph on read latency overhead? particularly for a node under lots of activity? I dont think we should do (2). you're shifting the problem to an unconventional thing, without telling users that this has shifted. i may be fine if we add a file called
I think this file o/ would make me feel ok about changing the function that figures out the location of the content. |
|
@jbenet, the read latency overhead is really dependent on the filesystem. In addition some file systems (such as ext4 when dir_index is enabled) start to have a problem when the directory gets too large. I will get some numbers for you on why (1) would be a problem now (and wasn't before) in a bit. |
its fine-- just go with (2) and add that file. |
@jbenet, okay will do. Thanks! |
@jbenet @whyrusleeping should we just modify go-ds-flatfs to use suffix or create a new version and call it maybe go-ds-flatfs-suffix? (Modifying it too do both will likely create a lot of a special case code I would like to avoid). |
@kevina make the sharding selector a function so we can specify on the creation of the flatfs datastore. This will make migrations so much easier for me. |
@whyrusleeping All right I will see if I can make that work. I am also willing to handle most of the migration. |
See ipfs/go-ds-flatfs#6 for a start. The README will be generated by go-ipfs not flatfs as it is go-ipfs specific. |
Because we use Base32 encoding the very last byte/digit is not quite as random as I would hope and also depends on the length of the binary representation:
When converting the repo created for #3462 that has a mixture of CidV0 and CidV1 raw leaves a 2 digit suffix I got me 256 levels of shading (instead of the expected 1024) and I didn't check but the distribution is unlikely to be even. With additional Cid types this number could go up to 1024 directories depending on the length of the key. Increasing the suffix to 3 digits will increase the number of directories to 8192, but that could go up to 32768 directories. Another possibility (that will be easy for me to implement based on what I got now), is disregard the very last digit and take the next to last 2 digits, then we get an evenly distributed 1024 way sharding. For example if the key is AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY the directory used will be @whyrusleeping what do you think. Sorry for this oversight. |
Okay I ran some simulations using the keys from the repo created in #3462. I confirmed that using a fixed prefix just won't work when using both CidV0 and CidV1 keys, to make it long enough to handle CidV1 you end with lots of directories with only a few files in them when CidV0 is used. The following options could work:
The distribution in all cases is reasonable even (I was wrong in my initial guess) in all cases. Here is the list of keys used and various results: https://gateway.ipfs.io/ipfs/QmNbugVGQDcMpDYezAELasskUWyPms9JAjFBRYuJcZtqFC |
I would either go with the suffix length of 2 or use the next to last 2 digits. A suffix length of 3 will create two many directories for an average sized repo, especially if they are keys that have a length that is a multiple of 5 binary bytes (in which case there could be up to 32768 directories). We could also make the suffix length a configurable option with the default value being 2. I am not sure what additional complications this will bring though. |
Ah, so the last character in the base32 encoded keys don't contain an entire 5 bits of data. Given thats the case, lets go ahead and use a suffix length of three. |
@whyrusleeping are you sure? That can lead to a large numbers of directives with only a few a few files per directory for an average sized repo. If you don't like my next-to-last idea, I would stick with 2 or make it configurable. |
yep |
Other idea is, how about hashes of the CID, with some fast or extremely fast hashing algorithm.
Extremely fast, http://www.cse.yorku.ca/~oz/hash.html:
|
I am not sure I like the idea of hashing again @Kubuxu as it would make it difficult for users to manually find where a key is. I would still prefer the next to last 2 digits as that will give a perfect 1024 way fanout no matter what the key is and can still make it possible to manually find the key. I can live with the last 3 digits though. |
Using last digits (or N from last M) will already be quite hard so I don't think that manual user usability is the first priority. |
I am not against last X or next to last X, but if next to last X gives us better distribution I think we should go for it. |
Just for clarification, when the word Hashing again to increase the distribution would bite us in terms of inspection and would probably force us to add a multihash in from of each block to say which second hash is used, therefore changing the blocks. Dynamic prefix width would be that we would have to have a way to signal that, or to do several reads.
Why not just pick the last 3 if the last char is not as random? Does this create 'too much' sharding? Update from @whyrusleeping:
I would really appreciate that this PR followed a proposed changed in the spec and even better, provide in the spec repo some repo examples with a bunch of blocks in CIDv0 and CIDv1 space that you use for tests so that other implementations can do too. |
Second hashing would be only choosing which directory-shard the key falls in, there in no need for multihash there (it can be changed by migration or be flat-fs specific).
or something Lengths and shards, see: #3463 (comment) The flat-fs sharding is not speced out anywhere. |
@diasdavid yes by digit I mean a char in a base 32 representation sorry for the confusing terminology. I was trying to avoid byte to distinguish it from a byte in the raw binary representation. |
The next to last doesn't seem that a thing that would be unreasonable to ask for contributors to understand, it just needs to be on the spec :) Consider this my 👍 |
Okay it sounds like we are in agreement, use the next-to-last 2 characters to get a nice guaranteed 1024 level fanout no matter what type of Cid is used. @whyrusleeping agree? I will update the code and also update the spec. |
@kevina sounds good to me, ship it |
Separately, it would be nice to make sure the sharding is good for repos of any size. having a constant amount of sharding is bad because repos of different sizes will perform poorly. I think the right solution starts small then auto-scales up and "rearranges all the files" when necessary. this is a whole separate discussion though. For The Future. |
Wait, I didn't catch why "the next-to-last 2 chars" instead of "the last 2"-- what was the reason? |
base32 has different count of entropy bits in last byte depending on the length of the input as it converts 5 bytes of input to 8 bytes output. So if you hash is 101 bytes long then last byte will be have only 4.8 bits of entropy, last two bytes will have 12.8 bits instead of 16bits. |
@jbenet, so that we are on the same page, the current Fanout should be 512, if we implement this as planned it will be 1024. See #3463 (comment) and my followup comments for the justification for next-to-last. Without it we won't get consistent shading. I strongly disagree with having a very large fanout by default. For the average size repo that could lead to lots of directories with only one or two files per directory (like we had before we switched to base32 encoding). I have no problem making the fanout configurable for the uncommon case of huge repos. |
Ah right. nice. I think the fanout should adjust automatically. but we can get there. Another option is to make it 2 tiers by default, instead of one. |
@jbenet I'm confused. Are you in support of a configurable option? That should be fairly easy to implement. |
@jbenet this change discussion is a fairly temporary one to make sure that we don't end up with horrible performance between now and when we can come up with a better block storage solution. I'm okay moving forward with the next to last two given the discussion in this thread so far. |
Alright, lets move forward with 'next to last two'. With that, i want two things. I want the README file as @jbenet suggested to be generated by the codebase (filling in a template). If you like, I can write up roughly what i think this should look like, but the gist of it is you should show an example We should also add a file with a multicodec that describes the format, so in this case Then, the conversion tool should be able to handle the required changes to both of those files. |
@whyrusleeping I will create a README and work on updating the spec.
I am unclear what you are after here. |
Make sure the README file gets autogenerated based on the flatfs sharding method being used. For the multicodec, have a file called VERSION or SHARDING or something in the flatfs directory, next to the README that contains a multicodec (a string) that tells us what sharding function is being used there. That way, we can 'open' an existing flatfs datastore without having to have the user specify the correct sharding function |
@whyrusleeping, autogenerating the README was not what I had in mind and will take some additional work. In addition it may be difficult to include all the information @jbenet wants in an auto-generated README, mainly the code for the sharding function in use. |
Here is a draft of the README as I believe @jbenet wanted it: https://gist.github.com/kevina/e217dd4c763aaaafdab9657935920da5 |
For the version file I am going to go with a file name
The v1 is to allow for an upgrading when we do something more complicated than a simple function. If you want to prefix it with something maybe
|
@whyrusleeping fell free to modify what I wrote to something that can more easily to autogenerted, I was going of of what @jbenet originally wrote. |
I am for the more verbose version, it can't hurt but might be useful. |
@kevina go ahead and move forward with the one you wrote (including my comments on the gist), but make it a string constant in the source that gets written on datastore creation. |
closed by #3608 |
When we switch to Cidv1 the assumptions we built into the 5 (base 32) digit prefix:
are no longer correct, in fact when raw leaves are used most of the blocks will be clustered in the
AFZBE
directory. See #3462. There are several ways to fix this:I think (2) would be the best alternative until (3) can be implemented.
The text was updated successfully, but these errors were encountered: