Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrich filter statistics with known column boundaries #4519

Merged
merged 3 commits into from
Dec 10, 2022

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Dec 5, 2022

Which issue does this PR close?

Closes #4518.

Rationale for this change

Allowing the propagation of column statistics in filter estimations theoretically unlocks all the parent estimations which in turn should benefit us greatly when dealing with nested joins.

What changes are included in this PR?

The major change here is the new analysis context API which allows it to be attached with an expression boundaries alongside the information it already holds (column boundaries). With this in our hand, we can derive the column statistics for filter's result and all other use cases where the expression analysis used.

The initial design was around sharing a single &mut Context around and returning ExprBoundaries as we do now, but thanks to @alamb's suggestion on my fork we can achieve a similar thing (albeit a bit more verbose) while still keeping the analysis process mut-free.

Are these changes tested?

Yes

Are there any user-facing changes?

No. This includes a break in the expression analysis API but it was specifically marked as experimental for stuff like this (it is currently evolving).

@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Dec 5, 2022
@isidentical isidentical marked this pull request as ready for review December 5, 2022 20:53
@jackwener
Copy link
Member

I prepare to review in the later day. Thanks @isidentical .❤️

@alamb
Copy link
Contributor

alamb commented Dec 6, 2022

I plan to review this carefully tomorrow morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a very nice improvement to me -- thank you @isidentical . I had a few minor suggestions but nothing that would prevent this from merging in my opinion

@jackwener
Copy link
Member

jackwener commented Dec 8, 2022

A question that don't relate with this PR. Why Statistics save total_byte_size instead of column_avg_byte_size, the latter is generally what I see in other systems. Although they have same effect.

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a nice job.👍

@Dandandan
Copy link
Contributor

A question that don't relate with this PR. Why Statistics save total_byte_size instead of column_avg_byte_size, the latter is generally what I see in other systems. Although they have same effect.

I think I may have added this. It also makes sense to have average column size, although they can easily be derived if you have the estimated number of rows and the other way around.

@isidentical
Copy link
Contributor Author

Thanks a lot for the reviews people (and sorry for the delay), added a new test and handled other suggestions so hopefully it should be good to go!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great to me -- thanks @isidentical

assert_eq!(
statistics.column_statistics,
Some(vec![ColumnStatistics {
min_value: Some(ScalarValue::Int32(Some(10))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@alamb alamb merged commit 8966eac into apache:master Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enrich filter statistics predictions with estimated column boundaries
4 participants