-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed changes for more flexible user defined Aggregate and window functions #12
Proposed changes for more flexible user defined Aggregate and window functions #12
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mustafasrepo, I really like this PR and its direction 👍 I have one open question but otherwise I think this is awesome
Open question
How does include_rank
(link) fit into this model?
Suggested next step:
- Make a PR to apache/arrow-datafusion with the changes to PartitionEvaluator that are in this PR (I believe you plan to do that
I will open the PR that unify evaluate and evaluate_stateful fields on the main repo once it is ready.
) - I will work on various tests / examples for WindowUDF on RFC: User Defined Window Functions apache/datafusion#6617 (which I will port to use the new API when it is ready)
Comments / Responses
First of all current state is a bit complex for end user to handle. After examining the PartitionEvaluator trait we have decided that evaluate_stateful and evaluate_inside_range can be combined. Its new name is evaluate with the following API ..
I also had this observation and I think your solution is very elegant 👍
With the new API we have following options for the end user?
I really like the tabular format of this analysis and it makes sense to me. Adding that table to the comments of PartitionEvaluator
would really help people understand it.
In short, I think with the current approach in apache#6617. We are in very good shape (I will simplify evaluate logic with another PR).
I agree.
Thank you so much for all your help!
/// If this function returns true, [`Self::create_evaluator`] must | ||
/// implement [`PartitionEvaluator::evaluate`] | ||
fn supports_bounded_execution(&self) -> bool { | ||
false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that the special case forinclude_rank
would also be an option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When include_rank
flag is true
, evaluate_with_rank_all
will be called. This method is basically same with evaluate_all
in the spirit. (It takes all the data and produces all the output in single pass). However, since evaluate_with_rank_all
requires additional arguments (such as rank boundaries). We do not unify their API, to not recalculate rank boundaries each time (even if we do not use them).
Certainly, we can move this trait to PartitionEvaluator
also. However, I thought this would be confusing. Hence didn't move it. (I will think about how to combine evaluate_with_rank_all
and evaluate_all
without calculating rank boundaries unnecessarily).
Maybe we can present to the user just a subset of the PartitionEvaluator
methods. They wouldn't see evaluate_with_rank_all
either (Just like your suggestion in option 2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone following along, I think we went with the evaluate_with_rank_all
approach: https://github.com/apache/arrow-datafusion/blob/c3d5d77e447e51c2cca814a67706e5ab3e050ced/datafusion/physical-expr/src/window/partition_evaluator.rs#L209-L217
@@ -155,8 +155,9 @@ impl WindowExpr for PlainAggregateWindowExpr { | |||
} | |||
|
|||
fn uses_bounded_memory(&self) -> bool { | |||
self.aggregate.supports_bounded_execution() | |||
&& !self.window_frame.end_bound.is_unbounded() | |||
let supports_bounded_execution = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this check explicit by adding the API described by @stuartcarnie here: apache#6611 (so that the accumulator can report on its capabilities)
} | ||
} | ||
|
||
// TODO show how to use other evaluate methods | ||
/// These different evaluation methods are called depending on the various settings of WindowUDF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I have opened the PR for stage 1. It can be found in the #6655 |
* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (#12) * Configurable data type instead of flag for Utf8 unparsing * Fix type in comment
Which issue does this PR close?
Closes #.
Rationale for this change
In #6617. We have discussed how to make user defined aggregate and window functions more flexible.
First of all current state is a bit complex for end user to handle. After examining the
PartitionEvaluator
trait we have decided thatevaluate_stateful
andevaluate_inside_range
can be combined. Its new name isevaluate
with the following APIfn evaluate(&mut self,_values: &[ArrayRef],_range: &Range<usize>,) -> Result<ScalarValue>
(Existingevaluate
is renamed withevaluate_all
to reflect better what function does).It returns a single
ScalarValue
for the given input which is the result of window function (If functionuses_window_frame
result calculated according to given range).Existing
fn evaluate(&self, _values: &[ArrayRef], _num_rows: usize) -> Result<ArrayRef>
is replaced byfn evaluate_all(&mut self, values: &[ArrayRef], num_rows: usize) -> Result<ArrayRef>
.This function receives whole of the data as single batch and produces all of the window result as bulk, which maybe more optimal for some use cases.
With the new API we have following options for the end user?
evaluate_all
(if we were to implementPERCENT_RANK
it would end up in this quadrant, we cannot produce any result without seeing whole data)evaluate
(optionally can also implementevaluate_all
for more optimized implementation. However, there will be default implementation that is suboptimal) . If we were to implementROW_NUMBER
it will end up in this quadrant. ExampleOddRowNumber
showcases this use caseevaluate
(I think as long asuses_window_frame
istrue
. There is no way forsupports_bounded_execution
to be false). I couldn't come up with any example for this quadrantevaluate
. If we were to implementFIRST_VALUE
, it would end up in this quadrantTo support end user to set flag
uses_window_frame
andsupports_bounded_execution
. I have moved these methods fromBuiltInWindowFunctionExpr
toPartitionEvaluator
. However, in the following commit @alamb could find another way to add this support (I think his version is better. However, since this is showcase PR for new API, I didn't bother with retracting changes.).In short, I think with the current approach in #6617. We are in very good shape (I will simplify evaluate logic with another PR). Hopefully, after these changes end user, by setting
uses_window_frame
andsupports_bounded_execution
properly. Then implementing corresponding evaluator (evaluate
orevaluate_all
) will be able to accomplish desired behavior for most of the use casesWhat changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?