Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive filter statistic estimates from the predicate expression #4162

Merged
merged 1 commit into from
Nov 16, 2022

Conversation

isidentical
Copy link
Contributor

Which issue does this PR close?

Closes #3845.

Rationale for this change

Being able to estimate a filter's cardinality and therefore populate the physical plan with more statistics to leverage cost based optimizations.

What changes are included in this PR?

Physical filter operations can now estimate their end cardinality when we know the predicate's selectivity and the cardinality of the input. The selectivity analysis is pretty limited at the moment, but the good part is that, every addition to the selectivity analysis will be automatically used here so this should be all the code we need for estimating filter's cardinality during planning.

Are these changes tested?

Yes.

Are there any user-facing changes?

More statistics.

@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Nov 9, 2022
@isidentical isidentical force-pushed the gh-3845-phase-2 branch 2 times, most recently from c44fc7e to d61391a Compare November 10, 2022 00:17
@isidentical isidentical marked this pull request as ready for review November 11, 2022 23:27
#[ignore]
// This test requires propagation of column boundaries from the comparison analysis
// to the analysis context. This is not yet implemented.
async fn test_filter_statistics_column_level_basic_expr() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb while working on this, I've noticed the initial application of propagation of new column limits. Since we don't have an API to represent changes to the boundaries during an expression's analysis (like a becomes [1, 25] in the example below) we can't generate the column_statistics which is essentially rendering nested join optimizations unusable (and potentially any other analysis that needs column level stats).

This doesn't mean it is completely ineffecttive as is, since we can at least find the cardinality of filter itself and do the local filter <-> table switch in the case below. But I think it might make sense to at least investigate potential ways to deal with this.

image

Copy link
Contributor Author

@isidentical isidentical Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a simple solution for this problem (isidentical#5) that essentially implements a much more narrow-scoped version of the apply() API from the previous iteration. It doesn't add any new methods to the physical expressions, but it still shares a mutable context reference (I kind of resonate this with other similiar APIs in datafusion like expr_to_columns) so not sure if the same reservations still apply. I'd be really interested in your feedback on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what about this test requires column level analysis -- your figure has a join in it, but thietest just seems to be the same as test_filter_statistics_basic_expr above it. I will look at isidentical#5 shortly

@alamb
Copy link
Contributor

alamb commented Nov 12, 2022

Thanks @isidentical -- I plan to review this carefully later today or tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @isidentical -- thank you.

I am sorry it took so long to review this PR (especially as you have made it small to make the reviewing easier)

I plan to leave it open for a day prior to merging

cc @Dandandan and @mingmwang who I think were also interested in this feature

num_rows: input_stats
.num_rows
.map(|num_rows| (num_rows as f64 * selectivity).ceil() as usize),
..Default::default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should explicitly list out is_exact: false here? Default::default() gets the same result but maybe being explicit would be better 🤔

Arc::new(FilterExec::try_new(predicate, input)?);

let statistics = filter.statistics();
assert_eq!(statistics.num_rows, Some(25));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌

Very nice

#[ignore]
// This test requires propagation of column boundaries from the comparison analysis
// to the analysis context. This is not yet implemented.
async fn test_filter_statistics_column_level_basic_expr() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what about this test requires column level analysis -- your figure has a join in it, but thietest just seems to be the same as test_filter_statistics_basic_expr above it. I will look at isidentical#5 shortly

assert_eq!(Statistics::default(), physical_plan.statistics());
let stats = physical_plan.statistics();
assert!(!stats.is_exact);
assert_eq!(stats.num_rows, Some(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@alamb alamb merged commit f0359a7 into apache:master Nov 16, 2022
@alamb
Copy link
Contributor

alamb commented Nov 16, 2022

🚀

@isidentical
Copy link
Contributor Author

Didn't notice the reviews, thank you a lot @alamb for them! Re the problem on column level, I'll try to write back to you soon on my fork (is it OK to keep it there until we have a design, then I can create a PR directly against apache/arrow-datafusion)

@ursabot
Copy link

ursabot commented Nov 16, 2022

Benchmark runs are scheduled for baseline = 74199d6 and contender = f0359a7. f0359a7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement statistics estimation for FilterExec
3 participants