-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Extract parquet data page statistics API #10852
Conversation
@alamb I think the most interesting part might be the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very impressive @marvinlanhenke -- I think this is pretty close to being able to merge
Given the complexity of this code I think it would be good to get in a version sooner and start iterating / filling out support
I wouldn't be surprised if we figured out some additional changes to make to the iterators based on experience implementing other types, but that should be totally fine.
So how about this for a plan:
- You polish up this PR (it looks like there are some doc comments to update)
- I will file a follow on ticket to make a test with multiple datapages in a row group
- Once we have done a few more types, I'll file a bunch more tickets to fill out the DataTypes as we did for row group
// Thus no statistics that can be extracted. | ||
// We return vec![None] to effectively | ||
// create an arrow null-array. | ||
_ => Some(vec![None]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning vec![None]
would return only a single entry -- I think it needs to be the same size as what would come from native_index
So I think this should be something like this
_ => Some(vec![None]), | |
_ => Some(vec![None; native_index.len()]), |
However, in order to have this make a difference we need to make a test that has multiple data pages in a row group, which the current setup does not. I can help write such a test as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right - and we definitely need test coverage here (I'm still confused sometimes (actually all the time) about this logic 🤯).
However, I think we need a different approach since we wouldn't have access to native_index.indexes
, if we cannot match the index, or we encounter the Index::NONE
variant.
Based on the implementation in page_filter.rs
here we should probably construct a vec![None; len]
where len = page_offset_index.len()
and page_offset_index: Vec<PageLocation>
.
I'll try make some changes here (also the API needs to change slightly) and we can discuss further?
I'll also try to make a test-case with multiple data_pages per row_group (haven't found the setting, yet...)
WDYT @alamb
EDIT: I made some changes to the API. PTAL. Now, I'll try to add a test_case to validate the correct behavior.
@alamb Perhaps we can do the test with multiple datapages in this PR? So we have a correct version in place before merging and filing out tickets for more datatype support? However, I could use some pointers here, haven't found a working setting to produce mutliple data pages per row group yet. So I'd like to finish polishing this PR with adding docs and the extra test setup for multiple datapages. Some other todos which I have on my list (which could all be follow up PRs):
|
Thanks @marvinlanhenke -- this PR is very much on my list. I hope to find time tomorrow to help get this one ready (e.g. help write the tests) as well as do some ticket cleanup so the current state of statistics extraction code is cleraer |
Thanks @alamb and no worries, I know you are traveling so this can wait. |
I plan to sort this PR and follow on tickets out tomorrow |
Ok, I spent some time working on a test that had multiple data pages and have checked it in in. @marvinlanhenke any chance you can give this PR another reveiew to see if you think it is ready? |
}; | ||
|
||
let iter = row_group_indices.into_iter().map(|rg_index| { | ||
let column_page_index_per_row_group_per_column = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow those structures are hard to use 🤯 -- seems like having an accessor would help a lot. Something to consider upstream maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes a lot of sense.
I think once we are happy with this PR we can merge it in and then I'll file tickets for filling out the other types. I feel like we may be able to make the tests a little better too 🤔 |
} | ||
.build(); | ||
|
||
// Data layout looks like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb
Thanks for your help on the test. I think its nearly ready, left a question about the test setup. Unfortunately I'm pretty busy this weekend; so it might take till monday before I can work on this.
No worries -- thanks @marvinlanhenke ! I'll merge this PR in and we can keep iterating as some follow ons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again
🚀 |
* feat: enable page statistics * feat: prototype int64 data_page_min * feat: prototype MinInt64DataPageStatsIterator * feat: add make_data_page_stats_iterator macro * feat: add get_data_page_statistics macro * feat: add MaxInt64DataPageStatsIterator * feat: add test_data_page_stats param * chore: add testcase int64_with_nulls * feat: add data page null_counts * fix: clippy * chore: rename column_page_index * feat: add data page row counts * feat: add num_data_pages to iterator * chore: update docs * fix: use colum_offset len in data_page_null_counts * fix: docs * tweak comments * update test helper * Add explicit multi-data page tests to statistics test * Add explicit data page test * remove duplicate test * update coverage --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #10806.
However, for now this is only a draft for further discussions.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?