bloom filter intersection failure #57

sfletc · 2021-03-22T05:53:31Z

Tried an intersection of 2 bloom filters both with est_elements=16000000, got a list index out of range error

Works fine if both have est_elements=16000001.

If one is 160000000 and the other is 16000001, get a None return on the intersection, rather than throwing an error explaining what the problem is.

barrust · 2021-03-22T12:02:33Z

Sorry you are running into issues. Can you provide a code sample where you are seeing this issue? Using the following code I was unable to reproduce the error you are seeing.

from probables import BloomFilter

blm1 = BloomFilter(est_elements=16000000, false_positive_rate=.05)
blm2 = BloomFilter(est_elements=16000000, false_positive_rate=.05)

for i in range(0, 50000):
    blm1.add(str(i))
    blm2.add(str(i*2))

print(blm1.estimate_elements())
blm3 = blm1.intersection(blm2)
print(blm3.estimate_elements())

Yes, it currently returns None if the two bloom filters are not the same false_positive_rate and number of est_elements since it is unable to create the intersection/union/jaccard index if they do not match. I toyed with the idea of having it throw an exception.

sfletc · 2021-03-23T00:21:35Z

Working:

kmer is an iterator that yields strings from text of length 21 in this case.

from probables import (BloomFilter)
blm = BloomFilter(est_elements=16000001, false_positive_rate=0.001, hash_function=default_sha256)
for i in a.kmer(21):
    blm.add(i)
blm2 = BloomFilter(est_elements=16000001, false_positive_rate=0.001, hash_function=default_sha256)
for i in b.kmer(21):
    blm2.add(i)

blm.elements_added

15560749

blm2.elements_added

13664688

c=blm.intersection(blm2)
c.elements_added

7690975

blm.jaccard_index(blm2)

0.4361343208524868

Not working:

from probables import (BloomFilter)
blm = BloomFilter(est_elements=16000000, false_positive_rate=0.001, hash_function=default_sha256)
for i in a.kmer(21):
    blm.add(i)
blm2 = BloomFilter(est_elements=16000000, false_positive_rate=0.001, hash_function=default_sha256)
for i in b.kmer(21):
    blm2.add(i)

blm.elements_added

15560749

blm2.elements_added

13664688

c=blm.intersection(blm2)

IndexError: list assignment index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-5384312b8cc6> in <module>
----> 1 c=blm.intersection(blm2)
~/miniconda3/lib/python3.8/site-packages/probables/blooms/bloom.py in intersection(self, second)
    165             return None
    166 
--> 167         return _tmp_intersection(self, second)
    168 
    169     def union(self, second):
~/miniconda3/lib/python3.8/site-packages/probables/blooms/bloom.py in _tmp_intersection(first, second)
     65 
     66     for i in list(range(0, first.bloom_length)):
---> 67         res.bloom[i] = first._get_element(i) & second._get_element(i)
     68     res.elements_added = res.estimate_elements()
     69     return res
IndexError: list assignment index out of range

barrust · 2021-03-23T01:57:27Z

Wow, this is a great catch. This issue arises from the desire to be able to export and import the Bloom filters to disk. To make that work we pack the false positive rate into a struct (C) and use that value for importing and exporting. In this exact case, it causes a us to need a single bin difference which causes the overflow. I have a PR coming that should resolve the issue by storing and using the original false positive rate for when a new fpr is needed.

barrust mentioned this issue Mar 23, 2021

FPR issue in some edge cases #58

Merged

barrust closed this as completed in #58 Apr 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bloom filter intersection failure #57

bloom filter intersection failure #57

sfletc commented Mar 22, 2021

barrust commented Mar 22, 2021

sfletc commented Mar 23, 2021

barrust commented Mar 23, 2021

bloom filter intersection failure #57

bloom filter intersection failure #57

Comments

sfletc commented Mar 22, 2021

barrust commented Mar 22, 2021

sfletc commented Mar 23, 2021

Working:

Not working:

barrust commented Mar 23, 2021