Derive filter statistic estimates from the predicate expression #4162

isidentical · 2022-11-09T23:27:24Z

Which issue does this PR close?

Closes #3845.

Rationale for this change

Being able to estimate a filter's cardinality and therefore populate the physical plan with more statistics to leverage cost based optimizations.

What changes are included in this PR?

Physical filter operations can now estimate their end cardinality when we know the predicate's selectivity and the cardinality of the input. The selectivity analysis is pretty limited at the moment, but the good part is that, every addition to the selectivity analysis will be automatically used here so this should be all the code we need for estimating filter's cardinality during planning.

Are these changes tested?

Yes.

Are there any user-facing changes?

More statistics.

isidentical · 2022-11-11T23:33:48Z

datafusion/core/src/physical_plan/filter.rs

+    #[ignore]
+    // This test requires propagation of column boundaries from the comparison analysis
+    // to the analysis context. This is not yet implemented.
+    async fn test_filter_statistics_column_level_basic_expr() -> Result<()> {


@alamb while working on this, I've noticed the initial application of propagation of new column limits. Since we don't have an API to represent changes to the boundaries during an expression's analysis (like a becomes [1, 25] in the example below) we can't generate the column_statistics which is essentially rendering nested join optimizations unusable (and potentially any other analysis that needs column level stats).

This doesn't mean it is completely ineffecttive as is, since we can at least find the cardinality of filter itself and do the local filter <-> table switch in the case below. But I think it might make sense to at least investigate potential ways to deal with this.

I have a simple solution for this problem (isidentical#5) that essentially implements a much more narrow-scoped version of the apply() API from the previous iteration. It doesn't add any new methods to the physical expressions, but it still shares a mutable context reference (I kind of resonate this with other similiar APIs in datafusion like expr_to_columns) so not sure if the same reservations still apply. I'd be really interested in your feedback on this.

I don't understand what about this test requires column level analysis -- your figure has a join in it, but thietest just seems to be the same as test_filter_statistics_basic_expr above it. I will look at isidentical#5 shortly

alamb · 2022-11-12T13:33:55Z

Thanks @isidentical -- I plan to review this carefully later today or tomorrow

alamb

This looks great @isidentical -- thank you.

I am sorry it took so long to review this PR (especially as you have made it small to make the reviewing easier)

I plan to leave it open for a day prior to merging

cc @Dandandan and @mingmwang who I think were also interested in this feature

alamb · 2022-11-15T19:15:40Z

datafusion/core/src/physical_plan/filter.rs

+                num_rows: input_stats
+                    .num_rows
+                    .map(|num_rows| (num_rows as f64 * selectivity).ceil() as usize),
+                ..Default::default()


I wonder if we should explicitly list out is_exact: false here? Default::default() gets the same result but maybe being explicit would be better 🤔

alamb · 2022-11-15T19:17:18Z

datafusion/core/src/physical_plan/filter.rs

+            Arc::new(FilterExec::try_new(predicate, input)?);
+
+        let statistics = filter.statistics();
+        assert_eq!(statistics.num_rows, Some(25));


👨‍🍳 👌

Very nice

alamb · 2022-11-15T19:19:00Z

datafusion/core/src/physical_plan/filter.rs

+    #[ignore]
+    // This test requires propagation of column boundaries from the comparison analysis
+    // to the analysis context. This is not yet implemented.
+    async fn test_filter_statistics_column_level_basic_expr() -> Result<()> {


I don't understand what about this test requires column level analysis -- your figure has a join in it, but thietest just seems to be the same as test_filter_statistics_basic_expr above it. I will look at isidentical#5 shortly

alamb · 2022-11-15T19:19:28Z

datafusion/core/tests/statistics.rs

-    assert_eq!(Statistics::default(), physical_plan.statistics());
+    let stats = physical_plan.statistics();
+    assert!(!stats.is_exact);
+    assert_eq!(stats.num_rows, Some(1));


alamb · 2022-11-16T21:36:21Z

🚀

isidentical · 2022-11-16T21:38:18Z

Didn't notice the reviews, thank you a lot @alamb for them! Re the problem on column level, I'll try to write back to you soon on my fork (is it OK to keep it there until we have a design, then I can create a PR directly against apache/arrow-datafusion)

ursabot · 2022-11-16T21:44:32Z

Benchmark runs are scheduled for baseline = 74199d6 and contender = f0359a7. f0359a7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Nov 9, 2022

isidentical force-pushed the gh-3845-phase-2 branch 2 times, most recently from c44fc7e to d61391a Compare November 10, 2022 00:17

Derive filter statistic estimates from the predicate expression

0896313

isidentical force-pushed the gh-3845-phase-2 branch from d61391a to 0896313 Compare November 11, 2022 23:26

isidentical marked this pull request as ready for review November 11, 2022 23:27

isidentical commented Nov 11, 2022

View reviewed changes

alamb approved these changes Nov 15, 2022

View reviewed changes

alamb merged commit f0359a7 into apache:master Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive filter statistic estimates from the predicate expression #4162

Derive filter statistic estimates from the predicate expression #4162

isidentical commented Nov 9, 2022

isidentical Nov 11, 2022

isidentical Nov 11, 2022 •

edited

Loading

alamb Nov 15, 2022

alamb commented Nov 12, 2022

alamb left a comment

alamb Nov 15, 2022

alamb Nov 15, 2022

alamb Nov 15, 2022

alamb Nov 15, 2022

alamb commented Nov 16, 2022

isidentical commented Nov 16, 2022

ursabot commented Nov 16, 2022

Derive filter statistic estimates from the predicate expression #4162

Derive filter statistic estimates from the predicate expression #4162

Conversation

isidentical commented Nov 9, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

isidentical Nov 11, 2022

Choose a reason for hiding this comment

isidentical Nov 11, 2022 • edited Loading

Choose a reason for hiding this comment

alamb Nov 15, 2022

Choose a reason for hiding this comment

alamb commented Nov 12, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 15, 2022

Choose a reason for hiding this comment

alamb Nov 15, 2022

Choose a reason for hiding this comment

alamb Nov 15, 2022

Choose a reason for hiding this comment

alamb Nov 15, 2022

Choose a reason for hiding this comment

alamb commented Nov 16, 2022

isidentical commented Nov 16, 2022

ursabot commented Nov 16, 2022

isidentical Nov 11, 2022 •

edited

Loading