Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Extract parquet data page statistics API #10852

Merged
merged 23 commits into from
Jun 15, 2024

Conversation

marvinlanhenke
Copy link
Contributor

Which issue does this PR close?

Closes #10806.

However, for now this is only a draft for further discussions.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@marvinlanhenke marvinlanhenke marked this pull request as draft June 10, 2024 09:43
@github-actions github-actions bot added the core Core DataFusion crate label Jun 10, 2024
@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Jun 10, 2024

@alamb
If you have some time to spare, I'd appreciate if you can take a quick look if I'm on the right track here (with regards to our discussion in #10806).

I think the most interesting part might be the MinInt64DataPageStatsIterator, perhaps there is a more rust idiomatic way to handle things here?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very impressive @marvinlanhenke -- I think this is pretty close to being able to merge

Given the complexity of this code I think it would be good to get in a version sooner and start iterating / filling out support

I wouldn't be surprised if we figured out some additional changes to make to the iterators based on experience implementing other types, but that should be totally fine.

So how about this for a plan:

  1. You polish up this PR (it looks like there are some doc comments to update)
  2. I will file a follow on ticket to make a test with multiple datapages in a row group
  3. Once we have done a few more types, I'll file a bunch more tickets to fill out the DataTypes as we did for row group

// Thus no statistics that can be extracted.
// We return vec![None] to effectively
// create an arrow null-array.
_ => Some(vec![None]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning vec![None] would return only a single entry -- I think it needs to be the same size as what would come from native_index

So I think this should be something like this

Suggested change
_ => Some(vec![None]),
_ => Some(vec![None; native_index.len()]),

However, in order to have this make a difference we need to make a test that has multiple data pages in a row group, which the current setup does not. I can help write such a test as a follow on PR

Copy link
Contributor Author

@marvinlanhenke marvinlanhenke Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right - and we definitely need test coverage here (I'm still confused sometimes (actually all the time) about this logic 🤯).

However, I think we need a different approach since we wouldn't have access to native_index.indexes, if we cannot match the index, or we encounter the Index::NONE variant.

Based on the implementation in page_filter.rs here we should probably construct a vec![None; len] where len = page_offset_index.len() and page_offset_index: Vec<PageLocation>.

I'll try make some changes here (also the API needs to change slightly) and we can discuss further?
I'll also try to make a test-case with multiple data_pages per row_group (haven't found the setting, yet...)

WDYT @alamb

EDIT: I made some changes to the API. PTAL. Now, I'll try to add a test_case to validate the correct behavior.

@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Jun 11, 2024

So how about this for a plan:

  1. You polish up this PR (it looks like there are some doc comments to update)
  2. I will file a follow on ticket to make a test with multiple datapages in a row group
  3. Once we have done a few more types, I'll file a bunch more tickets to fill out the DataTypes as we did for row group

@alamb
Thanks for the early review here. Your plan how to proceed sounds good to me.

Perhaps we can do the test with multiple datapages in this PR? So we have a correct version in place before merging and filing out tickets for more datatype support? However, I could use some pointers here, haven't found a working setting to produce mutliple data pages per row group yet.

So I'd like to finish polishing this PR with adding docs and the extra test setup for multiple datapages.

Some other todos which I have on my list (which could all be follow up PRs):

  • support other datatypes (+ add tests -> test_data_page_statistics:bool = true)
  • integrate new API in page_filter.rs
  • add benchmark

@alamb
Copy link
Contributor

alamb commented Jun 12, 2024

Thanks @marvinlanhenke -- this PR is very much on my list. I hope to find time tomorrow to help get this one ready (e.g. help write the tests) as well as do some ticket cleanup so the current state of statistics extraction code is cleraer

@marvinlanhenke
Copy link
Contributor Author

Thanks @alamb and no worries, I know you are traveling so this can wait.

@alamb
Copy link
Contributor

alamb commented Jun 14, 2024

I plan to sort this PR and follow on tickets out tomorrow

@alamb alamb changed the title WIP: Extract parquet data page statistics Initial Extract parquet data page statistics API Jun 14, 2024
@alamb
Copy link
Contributor

alamb commented Jun 14, 2024

Ok, I spent some time working on a test that had multiple data pages and have checked it in in.

@marvinlanhenke any chance you can give this PR another reveiew to see if you think it is ready?

@alamb alamb marked this pull request as ready for review June 14, 2024 23:10
};

let iter = row_group_indices.into_iter().map(|rg_index| {
let column_page_index_per_row_group_per_column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow those structures are hard to use 🤯 -- seems like having an accessor would help a lot. Something to consider upstream maybe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes a lot of sense.

@alamb
Copy link
Contributor

alamb commented Jun 14, 2024

I think once we are happy with this PR we can merge it in and then I'll file tickets for filling out the other types.

I feel like we may be able to make the tests a little better too 🤔

}
.build();

// Data layout looks like this:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor Author

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Thanks for your help on the test. I think its nearly ready, left a question about the test setup. Unfortunately I'm pretty busy this weekend; so it might take till monday before I can work on this.

@alamb
Copy link
Contributor

alamb commented Jun 15, 2024

Thanks for your help on the test. I think its nearly ready, left a question about the test setup. Unfortunately I'm pretty busy this weekend; so it might take till monday before I can work on this.

No worries -- thanks @marvinlanhenke ! I'll merge this PR in and we can keep iterating as some follow ons

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again

@alamb
Copy link
Contributor

alamb commented Jun 15, 2024

🚀

@alamb alamb merged commit 2f43476 into apache:main Jun 15, 2024
23 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: enable page statistics

* feat: prototype int64 data_page_min

* feat: prototype MinInt64DataPageStatsIterator

* feat: add make_data_page_stats_iterator macro

* feat: add get_data_page_statistics macro

* feat: add MaxInt64DataPageStatsIterator

* feat: add test_data_page_stats param

* chore: add testcase int64_with_nulls

* feat: add data page null_counts

* fix: clippy

* chore: rename column_page_index

* feat: add data page row counts

* feat: add num_data_pages to iterator

* chore: update docs

* fix: use colum_offset len in data_page_null_counts

* fix: docs

* tweak comments

* update test helper

* Add explicit multi-data page tests to statistics test

* Add explicit data page test

* remove duplicate test

* update coverage

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficiently and correctly Extract Page Index statistics into ArrayRefs
2 participants