Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bloom filter intersection failure #57

Closed
sfletc opened this issue Mar 22, 2021 · 3 comments · Fixed by #58
Closed

bloom filter intersection failure #57

sfletc opened this issue Mar 22, 2021 · 3 comments · Fixed by #58

Comments

@sfletc
Copy link

sfletc commented Mar 22, 2021

Tried an intersection of 2 bloom filters both with est_elements=16000000, got a list index out of range error

Works fine if both have est_elements=16000001.

If one is 160000000 and the other is 16000001, get a None return on the intersection, rather than throwing an error explaining what the problem is.

@barrust
Copy link
Owner

barrust commented Mar 22, 2021

Sorry you are running into issues. Can you provide a code sample where you are seeing this issue? Using the following code I was unable to reproduce the error you are seeing.

from probables import BloomFilter

blm1 = BloomFilter(est_elements=16000000, false_positive_rate=.05)
blm2 = BloomFilter(est_elements=16000000, false_positive_rate=.05)

for i in range(0, 50000):
    blm1.add(str(i))
    blm2.add(str(i*2))

print(blm1.estimate_elements())
blm3 = blm1.intersection(blm2)
print(blm3.estimate_elements())

Yes, it currently returns None if the two bloom filters are not the same false_positive_rate and number of est_elements since it is unable to create the intersection/union/jaccard index if they do not match. I toyed with the idea of having it throw an exception.

@sfletc
Copy link
Author

sfletc commented Mar 23, 2021

Working:

kmer is an iterator that yields strings from text of length 21 in this case.

from probables import (BloomFilter)
blm = BloomFilter(est_elements=16000001, false_positive_rate=0.001, hash_function=default_sha256)
for i in a.kmer(21):
    blm.add(i)
blm2 = BloomFilter(est_elements=16000001, false_positive_rate=0.001, hash_function=default_sha256)
for i in b.kmer(21):
    blm2.add(i)
blm.elements_added

15560749

blm2.elements_added

13664688

c=blm.intersection(blm2)
c.elements_added

7690975

blm.jaccard_index(blm2)

0.4361343208524868

Not working:

from probables import (BloomFilter)
blm = BloomFilter(est_elements=16000000, false_positive_rate=0.001, hash_function=default_sha256)
for i in a.kmer(21):
    blm.add(i)
blm2 = BloomFilter(est_elements=16000000, false_positive_rate=0.001, hash_function=default_sha256)
for i in b.kmer(21):
    blm2.add(i)
blm.elements_added

15560749

blm2.elements_added

13664688

c=blm.intersection(blm2)
IndexError: list assignment index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-5384312b8cc6> in <module>
----> 1 c=blm.intersection(blm2)
~/miniconda3/lib/python3.8/site-packages/probables/blooms/bloom.py in intersection(self, second)
    165             return None
    166 
--> 167         return _tmp_intersection(self, second)
    168 
    169     def union(self, second):
~/miniconda3/lib/python3.8/site-packages/probables/blooms/bloom.py in _tmp_intersection(first, second)
     65 
     66     for i in list(range(0, first.bloom_length)):
---> 67         res.bloom[i] = first._get_element(i) & second._get_element(i)
     68     res.elements_added = res.estimate_elements()
     69     return res
IndexError: list assignment index out of range

@barrust
Copy link
Owner

barrust commented Mar 23, 2021

Wow, this is a great catch. This issue arises from the desire to be able to export and import the Bloom filters to disk. To make that work we pack the false positive rate into a struct (C) and use that value for importing and exporting. In this exact case, it causes a us to need a single bin difference which causes the overflow. I have a PR coming that should resolve the issue by storing and using the original false positive rate for when a new fpr is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants