Enrich filter statistics with known column boundaries #4519

isidentical · 2022-12-05T20:52:18Z

Which issue does this PR close?

Closes #4518.

Rationale for this change

Allowing the propagation of column statistics in filter estimations theoretically unlocks all the parent estimations which in turn should benefit us greatly when dealing with nested joins.

What changes are included in this PR?

The major change here is the new analysis context API which allows it to be attached with an expression boundaries alongside the information it already holds (column boundaries). With this in our hand, we can derive the column statistics for filter's result and all other use cases where the expression analysis used.

The initial design was around sharing a single &mut Context around and returning ExprBoundaries as we do now, but thanks to @alamb's suggestion on my fork we can achieve a similar thing (albeit a bit more verbose) while still keeping the analysis process mut-free.

Are these changes tested?

Yes

Are there any user-facing changes?

No. This includes a break in the expression analysis API but it was specifically marked as experimental for stuff like this (it is currently evolving).

jackwener · 2022-12-06T11:15:11Z

I prepare to review in the later day. Thanks @isidentical .❤️

alamb · 2022-12-06T21:15:50Z

I plan to review this carefully tomorrow morning

alamb

Looks like a very nice improvement to me -- thank you @isidentical . I had a few minor suggestions but nothing that would prevent this from merging in my opinion

datafusion/core/src/physical_plan/filter.rs

datafusion/physical-expr/src/expressions/binary.rs

jackwener · 2022-12-08T15:50:43Z

A question that don't relate with this PR. Why Statistics save total_byte_size instead of column_avg_byte_size, the latter is generally what I see in other systems. Although they have same effect.

jackwener

LGTM, a nice job.👍

datafusion/physical-expr/src/expressions/binary.rs

Dandandan · 2022-12-08T16:01:31Z

A question that don't relate with this PR. Why Statistics save total_byte_size instead of column_avg_byte_size, the latter is generally what I see in other systems. Although they have same effect.

I think I may have added this. It also makes sense to have average column size, although they can easily be derived if you have the estimated number of rows and the other way around.

isidentical · 2022-12-09T20:40:25Z

Thanks a lot for the reviews people (and sorry for the delay), added a new test and handled other suggestions so hopefully it should be good to go!

alamb

Look great to me -- thanks @isidentical

alamb · 2022-12-10T10:36:37Z

datafusion/core/src/physical_plan/filter.rs

+        assert_eq!(
+            statistics.column_statistics,
+            Some(vec![ColumnStatistics {
+                min_value: Some(ScalarValue::Int32(Some(10))),


github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Dec 5, 2022

isidentical marked this pull request as ready for review December 5, 2022 20:53

alamb approved these changes Dec 7, 2022

View reviewed changes

jackwener approved these changes Dec 8, 2022

View reviewed changes

datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved

isidentical added 3 commits December 9, 2022 23:29

Propagation of column boundary changes in subexpressions

25257ff

Mut-free re-implementation

2e55f59

New test & minor code suggestions

629c6e4

isidentical force-pushed the gh-3845-phasse-3 branch from 12d0cc6 to 629c6e4 Compare December 9, 2022 20:40

alamb approved these changes Dec 10, 2022

View reviewed changes

alamb merged commit 8966eac into apache:master Dec 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrich filter statistics with known column boundaries #4519

Enrich filter statistics with known column boundaries #4519

isidentical commented Dec 5, 2022 •

edited

Loading

jackwener commented Dec 6, 2022

alamb commented Dec 6, 2022

alamb left a comment

jackwener commented Dec 8, 2022 •

edited

Loading

jackwener left a comment

Dandandan commented Dec 8, 2022

isidentical commented Dec 9, 2022

alamb left a comment

alamb Dec 10, 2022

Enrich filter statistics with known column boundaries #4519

Enrich filter statistics with known column boundaries #4519

Conversation

isidentical commented Dec 5, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener commented Dec 6, 2022

alamb commented Dec 6, 2022

alamb left a comment

Choose a reason for hiding this comment

jackwener commented Dec 8, 2022 • edited Loading

jackwener left a comment

Choose a reason for hiding this comment

Dandandan commented Dec 8, 2022

isidentical commented Dec 9, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 10, 2022

Choose a reason for hiding this comment

isidentical commented Dec 5, 2022 •

edited

Loading

jackwener commented Dec 8, 2022 •

edited

Loading