-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derive filter statistic estimates from the predicate expression #4162
Conversation
c44fc7e
to
d61391a
Compare
d61391a
to
0896313
Compare
#[ignore] | ||
// This test requires propagation of column boundaries from the comparison analysis | ||
// to the analysis context. This is not yet implemented. | ||
async fn test_filter_statistics_column_level_basic_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb while working on this, I've noticed the initial application of propagation of new column limits. Since we don't have an API to represent changes to the boundaries during an expression's analysis (like a
becomes [1, 25]
in the example below) we can't generate the column_statistics
which is essentially rendering nested join optimizations unusable (and potentially any other analysis that needs column level stats).
This doesn't mean it is completely ineffecttive as is, since we can at least find the cardinality of filter itself and do the local filter <-> table switch in the case below. But I think it might make sense to at least investigate potential ways to deal with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a simple solution for this problem (isidentical#5) that essentially implements a much more narrow-scoped version of the apply()
API from the previous iteration. It doesn't add any new methods to the physical expressions, but it still shares a mutable context reference (I kind of resonate this with other similiar APIs in datafusion like expr_to_columns
) so not sure if the same reservations still apply. I'd be really interested in your feedback on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what about this test requires column level analysis -- your figure has a join in it, but thietest just seems to be the same as test_filter_statistics_basic_expr
above it. I will look at isidentical#5 shortly
Thanks @isidentical -- I plan to review this carefully later today or tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @isidentical -- thank you.
I am sorry it took so long to review this PR (especially as you have made it small to make the reviewing easier)
I plan to leave it open for a day prior to merging
cc @Dandandan and @mingmwang who I think were also interested in this feature
num_rows: input_stats | ||
.num_rows | ||
.map(|num_rows| (num_rows as f64 * selectivity).ceil() as usize), | ||
..Default::default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should explicitly list out is_exact: false
here? Default::default()
gets the same result but maybe being explicit would be better 🤔
Arc::new(FilterExec::try_new(predicate, input)?); | ||
|
||
let statistics = filter.statistics(); | ||
assert_eq!(statistics.num_rows, Some(25)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👨🍳 👌
Very nice
#[ignore] | ||
// This test requires propagation of column boundaries from the comparison analysis | ||
// to the analysis context. This is not yet implemented. | ||
async fn test_filter_statistics_column_level_basic_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what about this test requires column level analysis -- your figure has a join in it, but thietest just seems to be the same as test_filter_statistics_basic_expr
above it. I will look at isidentical#5 shortly
assert_eq!(Statistics::default(), physical_plan.statistics()); | ||
let stats = physical_plan.statistics(); | ||
assert!(!stats.is_exact); | ||
assert_eq!(stats.num_rows, Some(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
🚀 |
Didn't notice the reviews, thank you a lot @alamb for them! Re the problem on column level, I'll try to write back to you soon on my fork (is it OK to keep it there until we have a design, then I can create a PR directly against apache/arrow-datafusion) |
Benchmark runs are scheduled for baseline = 74199d6 and contender = f0359a7. f0359a7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3845.
Rationale for this change
Being able to estimate a filter's cardinality and therefore populate the physical plan with more statistics to leverage cost based optimizations.
What changes are included in this PR?
Physical filter operations can now estimate their end cardinality when we know the predicate's selectivity and the cardinality of the input. The selectivity analysis is pretty limited at the moment, but the good part is that, every addition to the selectivity analysis will be automatically used here so this should be all the code we need for estimating filter's cardinality during planning.
Are these changes tested?
Yes.
Are there any user-facing changes?
More statistics.