Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parquet page filtering for string columns #4132

Closed
wants to merge 4 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 7, 2022

Draft as it builds on tests in #4131

Which issue does this PR close?

Part of #3833

Rationale for this change

I want to be able to use parquet page index filtering for string datatypes in IOx.

What changes are included in this PR?

  1. Implement page index filtering for string columns
  2. Test

Are there any user-facing changes?

Hopefully faster parquet predicate evaluation on string columns

@github-actions github-actions bot added the core Core DataFusion crate label Nov 7, 2022
Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
Index::BYTE_ARRAY(index) => {
let vec = &index.indexes;
let array: StringArray = vec
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure if this is ok (like what if the parquet data got mapped to a LargeStringArray? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need check the logical type for the value.
BYTE_ARRAY in the parquet can represent many logical types, such as DECIMAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also support the type BYTE_ARRAY in the null_counts of PagesPruningStatistics

.with_page_index_filtering_expected(PageIndexFilteringExpected::Some)
.with_expected_rows(2574)
.run()
.await;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this now passes!

@isidentical
Copy link
Contributor

isidentical commented Nov 7, 2022

@alamb
Copy link
Contributor Author

alamb commented Nov 8, 2022

Thanks for the comments @liukun4515 and @isidentical -- I agree about the nullcounts. I will work on this PR more to address your comments

@alamb alamb marked this pull request as draft November 8, 2022 14:45
@liukun4515
Copy link
Contributor

Thanks for the comments @liukun4515 and @isidentical -- I agree about the nullcounts. I will work on this PR more to address your comments

maybe you can just implement the case about utf8 type, I can help you to implement other logical data type for example Decimal.

@Ted-Jiang
Copy link
Member

@alamb Would your mind i continue on this, with pr on your branch?😄

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2022

@Ted-Jiang -- go right ahead (or also feel free to make another PR and I can close this one0 -- I haven't had the time to work on this for the last few days. I will likely get back to it either later this week or next if you don't have a chance to

Support parquet page filtering for decimal128 columns
@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2022

@Ted-Jiang I am not sure what happened to this branch -- can you please make a new branch /new PR for this feature? I didn't have a chance to work on (or review) your PR today because some other unexpected work appeared. 😢

There is so much going on in DataFusion I can barely keep up!

@alamb alamb closed this Nov 16, 2022
@Ted-Jiang
Copy link
Member

There is so much going on in DataFusion I can barely keep up!

of course! You've given too much to this community ❤️

@alamb alamb deleted the alamb/support_string_stats branch August 8, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants