Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add counting bloom filter #519

Merged
merged 6 commits into from
Apr 7, 2024

Conversation

proost
Copy link
Contributor

@proost proost commented Mar 31, 2024

Previous Discussion: #510

There are 2 different points from origin design.

  1. Count returns all inserted items. not distinct items.
  2. RemoveMulti deduplicate keys. count in Counting Bloom Filter can't be negative. because of these nature, RemoveMulti operation can be complicated and performance can be bad. so add constraints to avoid difficulties.

@codecov-commenter
Copy link

codecov-commenter commented Mar 31, 2024

Codecov Report

Attention: Patch coverage is 93.78238% with 12 lines in your changes are missing coverage. Please review.

Project coverage is 95.59%. Comparing base (a74b679) to head (f959dc3).
Report is 14 commits behind head on main.

Files Patch % Lines
rueidisprob/countingbloomfilter.go 93.40% 8 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #519      +/-   ##
==========================================
+ Coverage   95.57%   95.59%   +0.01%     
==========================================
  Files          80       81       +1     
  Lines       33178    33351     +173     
==========================================
+ Hits        31710    31882     +172     
+ Misses       1267     1266       -1     
- Partials      201      203       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rueian
Copy link
Collaborator

rueian commented Mar 31, 2024

Count returns all inserted items. not distinct items.

I guess, a method for getting the distinct count is still needed for users to know whether this filter is close to full or not.

And instead of Exists, should we provide a method for getting the minimum count of a key? Otherwise, users may just use the standard bloom filter.

@proost
Copy link
Contributor Author

proost commented Apr 1, 2024

@rueian

And instead of Exists, should we provide a method for getting the minimum count of a key?

Do you want to remove Exists or Adding a new method? According to definition, Counting Bloom Filter can insert same key several times.

@rueian
Copy link
Collaborator

rueian commented Apr 1, 2024

I feel the Esists is not necessary if we have a method to get the minimum count of a key, but we can keep it anyway. It is still a handy method.

@proost
Copy link
Contributor Author

proost commented Apr 2, 2024

@rueian
I understand what you said. If we had a method that returns minimum count of a key, it is count-min sketch. not counting bloom filter. do you think count-min sketch is more proper? or do you want to integrate with both data structures?

@rueian
Copy link
Collaborator

rueian commented Apr 2, 2024

do you think count-min sketch is more proper? or do you want to integrate with both data structures?

The wiki you referenced above also says that count-min sketch and counting bloom filter are essentially the same data structure. Could you elaborate more on their differences?

I have no preference between count-min sketch and counting bloom filter. I just want this filter to have more functionalities than the previous standard bloom filter.

@proost
Copy link
Contributor Author

proost commented Apr 2, 2024

The wiki you referenced above also says that count-min sketch and counting bloom filter are essentially the same data structure.

what i understand is both data structures almost same logic

  1. use hash function to given key to reduce space usage.
  2. each hashed value store count.

but what i know purpose, functionalities, internal data structure is different.
In terms of purpose, Count-Min Sketch store and get item's frequency but Counting Bloom Filter check whether item is in or not.
In terms of internal data structure, Count-Min Sketch uses 2d array, but Counting Bloom Filter use 1d array.
In terms of functionalities, Count-Min Sketch explicitly supports inserting same data multiple, like this update(data, delta). but Counting Bloom Filter doesn't. AddMulti is just batch function to use Add conveniently in this PR, but not originally created function in the paper.

@proost
Copy link
Contributor Author

proost commented Apr 2, 2024

I have no preference between count-min sketch and counting bloom filter. I just want this filter to have more functionalities than the previous standard bloom filter.

I see. But this implementation may have bigger error rate than count-min sketch(This is uncertain.)

@rueian
Copy link
Collaborator

rueian commented Apr 3, 2024

but Counting Bloom Filter check whether item is in or not.

This makes it look the same as the original bloom filter, except that it could still report exists=true even after removing an item (since the item could have been added multiple times). From this perspective, it seems that having a method to retrieve the minimum count of an item is still reasonable.

@proost
Copy link
Contributor Author

proost commented Apr 6, 2024

@rueian
abda0c8
Adding ItemMinCount and ItemMinCountMulti. But considering purpose of Counting Bloom Filter, leaving left Exists and ExistsMulti makes sense to me.

@proost proost requested a review from rueian April 6, 2024 01:38
for _, key := range keys {
if _, ok := keySet[key]; !ok {
keySet[key] = struct{}{}
deduplicatedKeys = append(deduplicatedKeys, key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this deduplication? It feels to me that duplication should be allowed.

Copy link
Contributor Author

@proost proost Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f959dc3

To mitigate complex logic and guarantee that count is not negative, i added deduplication. But I remove it it is more complex and can downgrade performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It is indeed complex but looks good to me now.

@proost proost requested a review from rueian April 7, 2024 13:08
@rueian rueian merged commit 6a8ba48 into redis:main Apr 7, 2024
2 checks passed
@rueian
Copy link
Collaborator

rueian commented Apr 7, 2024

Merged. Thank you @proost!

@proost proost deleted the feat-counting-bloom-filter branch April 8, 2024 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants