-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enrich filter statistics with known column boundaries #4519
Conversation
I prepare to review in the later day. Thanks @isidentical .❤️ |
I plan to review this carefully tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a very nice improvement to me -- thank you @isidentical . I had a few minor suggestions but nothing that would prevent this from merging in my opinion
A question that don't relate with this PR. Why Statistics save |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a nice job.👍
I think I may have added this. It also makes sense to have average column size, although they can easily be derived if you have the estimated number of rows and the other way around. |
Thanks a lot for the reviews people (and sorry for the delay), added a new test and handled other suggestions so hopefully it should be good to go! |
12d0cc6
to
629c6e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look great to me -- thanks @isidentical
assert_eq!( | ||
statistics.column_statistics, | ||
Some(vec![ColumnStatistics { | ||
min_value: Some(ScalarValue::Int32(Some(10))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Which issue does this PR close?
Closes #4518.
Rationale for this change
Allowing the propagation of column statistics in filter estimations theoretically unlocks all the parent estimations which in turn should benefit us greatly when dealing with nested joins.
What changes are included in this PR?
The major change here is the new analysis context API which allows it to be attached with an expression boundaries alongside the information it already holds (column boundaries). With this in our hand, we can derive the column statistics for filter's result and all other use cases where the expression analysis used.
The initial design was around sharing a single
&mut Context
around and returningExprBoundaries
as we do now, but thanks to @alamb's suggestion on my fork we can achieve a similar thing (albeit a bit more verbose) while still keeping the analysis process mut-free.Are these changes tested?
Yes
Are there any user-facing changes?
No. This includes a break in the expression analysis API but it was specifically marked as experimental for stuff like this (it is currently evolving).