-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add information about bloom filter to config.md #4924
Conversation
To clear up possible confusion, the config input is in bytes. Also this website is quite useful https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7 |
91cf85b
to
bdedec2
Compare
@Kubuxu |
docs/config.md
Outdated
|
||
This site generates useful graphs for various bloom filter values: <https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7> | ||
You may use it to find a preferred optimal value, where 'm' is BloomFilterSize. | ||
For example, for 1,000,000 blocks, expecting a 1% false positive rate, you'd end up with a filter size of 9592955 bytes. [Currently](https://github.com/ipfs/go-ipfs/blob/9c194aa7e2febeab0cbd895067d7d90d82b137f9/blocks/blockstore/caching.go), 7 hash functions are used by default, so the constant k is 7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is 1199120 bytes. The number m
in the tool is the number of bits.
6013188
to
ac92988
Compare
License: MIT Signed-off-by: Dominic Della Valle <[email protected]>
Okay, I've corrected the unit size and added a reminder for users to do that as well. This should be all the information a user needs to set an appropriate value. How does it look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way more informative than it was before! But reading it still left me with some questions:
-
Is there anything we can reliably say here about when & why you’d want to use this, e.g. the typical performance boost over the built-in ARC cache?
-
Are there use cases where this does or does not make sense to use?
docs/config.md
Outdated
This site generates useful graphs for various bloom filter values: <https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7> | ||
You may use it to find a preferred optimal value, where `m` is `BloomFilterSize` in bits. Remember to convert the value `m` from bits, into bytes for use as `BloomFilterSize` in the config file. | ||
For example, for 1,000,000 blocks, expecting a 1% false positive rate, you'd end up with a filter size of 9592955 bits, so for `BloomFilterSize` we'd want to use 1199120 bytes. | ||
[Currently](https://github.com/ipfs/go-ipfs/blob/9c194aa7e2febeab0cbd895067d7d90d82b137f9/blocks/blockstore/caching.go), 7 hash functions are used by default, so the constant `k` is 7 in the formula. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file no longer exists in v0.4.14; should it be https://github.com/ipfs/go-ipfs-blockstore/blob/547442836ade055cc114b562a3cc193d4e57c884/caching.go#L22 ?
Saying 7 is the default makes it sound like another config option should be able to change it, but that doesn’t appear to be the case. As far as I can see from reading the code, this can only be adjusted if you are using go-ipfs-blockstore
directly as a library (there’s no realistic path to changing it even when using go-ipfs
directly unless you are willing to skip builder.setupNode()
entirely, which seems pretty painful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 is optimal for 1% false positive rate.
License: MIT Signed-off-by: Dominic Della Valle <[email protected]>
@Mr0grog In relation to performance gains, I don't have any stats on hand. There's this but it's old and experimental:
I'm not sure myself, maybe low memory machines would want to avoid this. If there's an inherent gain in all cases, maybe we should consider changing the default and adding a deferring to @Kubuxu for more info |
The reason this isn't used by default is: we don't have an estimate of the size of blockstore so we can't select good bloom filter size. We could use the 1MiB as a reasonable default. |
That makes a lot of sense, so it seems like it would be good to say in the docs. Something like:
I’m assuming that, because there’s such a sharp rise in probability, you could pretty easily surpass the optimal size enough that the work of doing the hashing and lookup in the bloom filter will be an overall waste of time. Is that realistic or unnecessarily alarmist? Side note: do you have any sense of the practical average size of a block? (Has anyone ever done any analytics on the public gateway for this?) I know a typical file created with a typical IPFS configuration will have 256 KB blocks, but what about the node for the file itself, for directories, or for non UnixFS nodes? (I’m assuming that the items in the filter are ultimately just the DAG node hashes, whether or not they are leaves with data. Is that right?) Having a rough sense of |
During an IRC conversation, this information came up. I figured it might make for a useful suggestion here.
yay or nay?