-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Bloom Filter in parquet reader #4512
Comments
@alamb - Interested in picking this up unless you or someone else is working on this that I know of |
Thanks @ajayaa -- that is great news. No one is actively working on this, though I have time set aside to help with implementation People who might be interested and were involved with other parts of the implementation might be @tustvold @jimexist @thinkharderdev and @Ted-Jiang |
Thanks @alamb . Pretty new to rust-lang - please bear with me. I should have something in the next 4-5 days. |
If no one has started yet , i will start this one 😄 |
Awesome -- thanks @Ted-Jiang . Another interesting project might be #4085 ;) |
Hey @alamb! it seems like this was postponed? Can I take this if @Ted-Jiang isn't working on it anymore? |
Hi @ozgrakkurt -- it is fine with me ! I don't know of anyone else working on this at this time. Maybe @tustvold knows more but I suspect the community would be very appreciative of contributions in this area. |
Thanks! for now I changed to external indexing implementation in my project but will try to do this when I get free time |
@Ted-Jiang Are you still working on it? |
@ozgrakkurt Do you have time to do this? this is an awesome feature |
@ozgrakkurt Sure plz go ahead ! I will be glad if this feature is supported 👍 |
Maybe you should start with the arrow-rs |
@Ted-Jiang I am looking into this issue. i looked at your draft PR and the latest code of datafusion, we can create a method in |
@hengfeiyang have you seen this issue apache/arrow-rs#3851 , I used to decide going this way but something in my company stop me move on.. |
@Ted-Jiang Thanks, let me check. |
Completed by #7821 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Bloom filter support was added to arrow-rs in 28.0.0 (as part of apache/arrow-rs#3023). Here is some of that background copy/pasted:
There are usecases where one wants to search a large amount of parquet data for a relatively small number of rows. For example, if you have distributed tracing data stored as parquet files and want to find the data for a particular trace.
In general, the pattern is "needle in a haystack type query" -- specifically a very selective predicate (passes on only a few rows) on high cardinality (many distinct values) columns.
Datafusion has fairly advanced support for
These techniques are quite effective when data is sorted and large contiguous ranges of rows can be skipped. However, doing needle in the haystack queries still often requires substantial amounts of CPU and IO
One challenge is that for typical high cardinality columns such as ids, they often (by design) span the entire range of values of the data type
For example, given the best case when the data is "optimally sorted" by id within a row group, min/max statistics can not help skip row groups or pages. Instead the entire column must be decoded to search for a particular value
The parquet file format has support for bloom filters: https://github.com/apache/parquet-format/blob/master/BloomFilter.md
A bloom filter is a space efficient structure that allows determining if a value is not in a set quickly. So for a parquet file with bloom filters for
id
in the metadata, the entire row group can be skipped if the id is not present:Describe the solution you'd like
I would like the ParquetReader in DataFusion to take advantage of Bloom filters when they are present.
This would be in addition to
page_filter
and row_filterSome high level steps are probably:
OPT_PARQUET_PUSHDOWN_FILTERS
: https://github.com/apache/arrow-datafusion/blob/34d9bb5e64e01e1baca4f636c855082f4cadc270/datafusion/core/src/config.rs#L53col = <constant>
)<constant>
in the bloom filter for that column) in https://github.com/apache/arrow-datafusion/blob/34d9bb5e64e01e1baca4f636c855082f4cadc270/datafusion/core/src/physical_plan/file_format/parquet.rs#L481-L486Describe alternatives you've considered
Don't add support ?
Additional context
Some additional support to properly write bloom filters: apache/arrow-rs#3275
The text was updated successfully, but these errors were encountered: