Add information about bloom filter to config.md #4924

djdv · 2018-04-06T19:57:56Z

During an IRC conversation, this information came up. I figured it might make for a useful suggestion here.
yay or nay?

[2018.04.04] 14:28:47 <@hsanjuan> Can someone remind me the recommended values for bloom filter size in go-ipfs?
[2018.04.04] 14:46:58 <@Stebalien> - num_blocks * 1.44 * logtwo(probability_of_false_positive)
[2018.04.04] 14:48:27 < djdv> I wonder if that should be noted in https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#datastore
[2018.04.04] 14:49:02 <@Stebalien> So, for a 1% false positive rate and 1m blocks, you'd want a ~1MiB (mebibyte) filter.
[2018.04.04] 14:49:19 <@Stebalien> Note: I havent' tested that, I'm just going off of wikipedia.
[2018.04.04] 14:49:28 <@Stebalien> https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions

Kubuxu · 2018-04-06T22:29:41Z

To clear up possible confusion, the config input is in bytes.

Also this website is quite useful https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7

djdv · 2018-04-08T23:30:08Z

@Kubuxu
I added a reference to that tool. Does the new statement look accurate and helpful?

Kubuxu · 2018-04-09T05:00:41Z

docs/config.md

+
+This site generates useful graphs for various bloom filter values: <https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7>
+You may use it to find a preferred optimal value, where 'm' is BloomFilterSize.
+For example, for 1,000,000 blocks, expecting a 1% false positive rate, you'd end up with a filter size of 9592955 bytes. [Currently](https://github.com/ipfs/go-ipfs/blob/9c194aa7e2febeab0cbd895067d7d90d82b137f9/blocks/blockstore/caching.go), 7 hash functions are used by default, so the constant k is 7.


It is 1199120 bytes. The number m in the tool is the number of bits.

License: MIT Signed-off-by: Dominic Della Valle <[email protected]>

djdv · 2018-04-09T12:24:20Z

Okay, I've corrected the unit size and added a reminder for users to do that as well. This should be all the information a user needs to set an appropriate value. How does it look?

Mr0grog

This is way more informative than it was before! But reading it still left me with some questions:

Is there anything we can reliably say here about when & why you’d want to use this, e.g. the typical performance boost over the built-in ARC cache?
Are there use cases where this does or does not make sense to use?

Mr0grog · 2018-04-09T18:54:27Z

docs/config.md

+This site generates useful graphs for various bloom filter values: <https://hur.st/bloomfilter/?n=1e6&p=0.01&m=&k=7>  
+You may use it to find a preferred optimal value, where `m` is `BloomFilterSize` in bits. Remember to convert the value `m` from bits, into bytes for use as `BloomFilterSize` in the config file.  
+For example, for 1,000,000 blocks, expecting a 1% false positive rate, you'd end up with a filter size of 9592955 bits, so for `BloomFilterSize` we'd want to use 1199120 bytes.  
+[Currently](https://github.com/ipfs/go-ipfs/blob/9c194aa7e2febeab0cbd895067d7d90d82b137f9/blocks/blockstore/caching.go), 7 hash functions are used by default, so the constant `k` is 7 in the formula.


This file no longer exists in v0.4.14; should it be https://github.com/ipfs/go-ipfs-blockstore/blob/547442836ade055cc114b562a3cc193d4e57c884/caching.go#L22 ?

Saying 7 is the default makes it sound like another config option should be able to change it, but that doesn’t appear to be the case. As far as I can see from reading the code, this can only be adjusted if you are using go-ipfs-blockstore directly as a library (there’s no realistic path to changing it even when using go-ipfs directly unless you are willing to skip builder.setupNode() entirely, which seems pretty painful).

7 is optimal for 1% false positive rate.

License: MIT Signed-off-by: Dominic Della Valle <[email protected]>

djdv · 2018-04-09T19:52:39Z

@Mr0grog
I've updated the link and omitted "by default".
In addition, since the link is permanent and the default is subject to change upstream (even if unlikely), I've reworded things a bit and changed the link anchor.

In relation to performance gains, I don't have any stats on hand. There's this but it's old and experimental:
#3479

Are there use cases where this does or does not make sense to use?

I'm not sure myself, maybe low memory machines would want to avoid this. If there's an inherent gain in all cases, maybe we should consider changing the default and adding a lowmem profile to init.

deferring to @Kubuxu for more info

Kubuxu · 2018-04-10T03:07:11Z

The reason this isn't used by default is: we don't have an estimate of the size of blockstore so we can't select good bloom filter size. We could use the 1MiB as a reasonable default.

Mr0grog · 2018-04-10T05:41:44Z

That makes a lot of sense, so it seems like it would be good to say in the docs. Something like:

The bloom filter is disabled by default because the most appropriate size depends heavily on how many blocks you expect to store. A value that works well for a small storage scenario could make performance worse in a large storage scenario.

I’m assuming that, because there’s such a sharp rise in probability, you could pretty easily surpass the optimal size enough that the work of doing the hashing and lookup in the bloom filter will be an overall waste of time. Is that realistic or unnecessarily alarmist?

Side note: do you have any sense of the practical average size of a block? (Has anyone ever done any analytics on the public gateway for this?) I know a typical file created with a typical IPFS configuration will have 256 KB blocks, but what about the node for the file itself, for directories, or for non UnixFS nodes? (I’m assuming that the items in the filter are ultimately just the DAG node hashes, whether or not they are leaves with data. Is that right?) Having a rough sense of N nodes ≈ X MB of storage might help people estimate an ideal filter size.

djdv requested a review from Kubuxu as a code owner April 6, 2018 19:57

ghost assigned djdv Apr 6, 2018

ghost added the status/in-progress In progress label Apr 6, 2018

Kubuxu added the need_signoff label Apr 7, 2018

djdv force-pushed the docs/config branch 2 times, most recently from 91cf85b to bdedec2 Compare April 8, 2018 23:27

djdv removed the need_signoff label Apr 8, 2018

Kubuxu reviewed Apr 9, 2018

View reviewed changes

djdv force-pushed the docs/config branch 2 times, most recently from 6013188 to ac92988 Compare April 9, 2018 12:18

Add information about bloom filter to config.md

0ecfd78

License: MIT Signed-off-by: Dominic Della Valle <[email protected]>

djdv force-pushed the docs/config branch from ac92988 to 0ecfd78 Compare April 9, 2018 12:20

Mr0grog reviewed Apr 9, 2018

View reviewed changes

Revised

c3e3da4

License: MIT Signed-off-by: Dominic Della Valle <[email protected]>

Kubuxu approved these changes Jul 10, 2018

View reviewed changes

Kubuxu added the RFM label Jul 10, 2018

whyrusleeping merged commit 419bfdc into master Jul 16, 2018

ghost removed the status/in-progress In progress label Jul 16, 2018

whyrusleeping deleted the docs/config branch July 16, 2018 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add information about bloom filter to config.md #4924

Add information about bloom filter to config.md #4924

djdv commented Apr 6, 2018

Kubuxu commented Apr 6, 2018

djdv commented Apr 8, 2018

Kubuxu Apr 9, 2018

djdv commented Apr 9, 2018

Mr0grog left a comment

Mr0grog Apr 9, 2018 •

edited

Loading

Kubuxu Apr 10, 2018

djdv commented Apr 9, 2018

Kubuxu commented Apr 10, 2018 •

edited

Loading

Mr0grog commented Apr 10, 2018

Add information about bloom filter to config.md #4924

Add information about bloom filter to config.md #4924

Conversation

djdv commented Apr 6, 2018

Kubuxu commented Apr 6, 2018

djdv commented Apr 8, 2018

Kubuxu Apr 9, 2018

Choose a reason for hiding this comment

djdv commented Apr 9, 2018

Mr0grog left a comment

Choose a reason for hiding this comment

Mr0grog Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Kubuxu Apr 10, 2018

Choose a reason for hiding this comment

djdv commented Apr 9, 2018

Kubuxu commented Apr 10, 2018 • edited Loading

Mr0grog commented Apr 10, 2018

Mr0grog Apr 9, 2018 •

edited

Loading

Kubuxu commented Apr 10, 2018 •

edited

Loading