-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parquet page filtering for string columns #4132
Conversation
Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => { | ||
Index::BYTE_ARRAY(index) => { | ||
let vec = &index.indexes; | ||
let array: StringArray = vec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure if this is ok (like what if the parquet data got mapped to a LargeStringArray? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need check the logical type for the value.
BYTE_ARRAY in the parquet can represent many logical types, such as DECIMAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also support the type BYTE_ARRAY
in the null_counts
of PagesPruningStatistics
.with_page_index_filtering_expected(PageIndexFilteringExpected::Some) | ||
.with_expected_rows(2574) | ||
.run() | ||
.await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this now passes!
@alamb does it make sense to also include support for strings inside |
Thanks for the comments @liukun4515 and @isidentical -- I agree about the nullcounts. I will work on this PR more to address your comments |
maybe you can just implement the case about utf8 type, I can help you to implement other logical data type for example Decimal. |
@alamb Would your mind i continue on this, with pr on your branch?😄 |
Signed-off-by: yangjiang <[email protected]>
@Ted-Jiang -- go right ahead (or also feel free to make another PR and I can close this one0 -- I haven't had the time to work on this for the last few days. I will likely get back to it either later this week or next if you don't have a chance to |
Support parquet page filtering for decimal128 columns
@Ted-Jiang I am not sure what happened to this branch -- can you please make a new branch /new PR for this feature? I didn't have a chance to work on (or review) your PR today because some other unexpected work appeared. 😢 There is so much going on in DataFusion I can barely keep up! |
of course! You've given too much to this community ❤️ |
Draft as it builds on tests in #4131
Which issue does this PR close?
Part of #3833
Rationale for this change
I want to be able to use parquet page index filtering for string datatypes in IOx.
What changes are included in this PR?
Are there any user-facing changes?
Hopefully faster parquet predicate evaluation on string columns