-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add median and count_distinct aggregation functions #278
feat: add median and count_distinct aggregation functions #278
Conversation
e641541
to
4a4921c
Compare
@jvanstraten would it be better introduce an option like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to update this.
would it be better introduce an option like precision or and introduce EXACT and APPROXIMATE instead of writing two separate functions
I don't really have an opinion, but unless there's a statistical definition I'm not aware of, I do believe we need to specify what "approximate" means. Is 100.0 an acceptable approximate median of [101.2, 101.3, 101.4]? Probably not, so where's the limit?
extensions/functions_arithmetic.yaml
Outdated
@@ -742,6 +742,105 @@ aggregate_functions: | |||
- value: fp64 | |||
nullability: DECLARED_OUTPUT | |||
return: fp64? | |||
- name: "median" | |||
description: Calculate the median for a set of values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should include something like "Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the description
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
@vibhatha Please extend the descriptions. Here's a good example of the level of descriptiveness I'm looking for: substrait/extensions/functions_string.yaml Lines 571 to 608 in add698f
Also, IIRC we went for an option for approximate vs |
@jvanstraten we had a function for I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability. I am not sure if this is possible, but just a suggestion. Not asking about array of items, just operator_options:
- population
- precision
- ....
And the |
4c30d4a
to
d2dc8de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability.
We've run into similar things with the YAMLs before, and the conclusion has always been that it's easier on consumers of the YAML files to just repeat everything. If at some point it becomes really annoying we could always just generate the YAML files. However, the descriptions of the standard options we use all over the core extensions (like overflow, rounding, and I guess precision and population) could IMO just be defined on the website instead of in the YAMLs. A computer that's parsing these YAML files won't care about the description anyway. There's no page for this yet, but I'm planning to rewrite/add to the extension pages of the website soon, and there will absolutely be a section for this. So I'm fine with just omitting the descriptions of precision
for now.
extensions/functions_arithmetic.yaml
Outdated
on saving memory bandwidth, the precision of the end result can be | ||
the highest possible accuracy of an approximation. | ||
|
||
- EXACT: provides the highest accurate output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- EXACT: provides the highest accurate output | |
- EXACT: provides the exact result, rounded if needed according | |
to the rounding option |
extensions/functions_arithmetic.yaml
Outdated
the highest possible accuracy of an approximation. | ||
|
||
- EXACT: provides the highest accurate output | ||
- APPROXIMATE: provides a sub-optimal output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still doesn't specify how approximate the result may be. Is this what you mean, or is this too broad?
- APPROXIMATE: provides a sub-optimal output | |
- APPROXIMATE: provides only an estimate; the result must lie | |
between the minimum and maximum values in the input | |
(inclusive), but otherwise the accuracy is left up to the | |
consumer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to mean this one, but may be we could elaborate this a bit?
Although, this is broad since the optimization strategy is very hard to specify. Again I think the choice of approximation is up to the engine to decide and that depends on various optimization techniques. But that's not the job of Substrait to define that optimization strategy. Substrait can only specify that it is going to be an approximation. So I wanted to put forward that idea.
Should we enhance more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we can do better than what I suggested to constrain the approximation method further, but you're welcome to try. I at least can't think of any approximate median algorithm that doesn't at least satisfy that constraint, and only trivial constraints is better than no constraints, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the suggested once for now. I will update this.
I get your point. I think this sounds good for now. |
e7e68fc
to
7db8984
Compare
Sorry, just saw this come through. I think we should revert the addition of count distinct here. Distinct isn't generally a property of the function, it is a property of the aggregate. That's why we have distinct as a property of function invocation. You achieve count distinct by combining count + that property. |
Oops, you're right, I didn't think of that. #311 |
This PR includes yaml configs for
median
,approx_median
andcount_distinct
for aggregate functions.