-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add standard deviation functions #257
Conversation
#251 applies. I would expect to see only fp32 -> fp32 and fp64 -> fp64 stddev, mean, etc. |
I thought about this, since the discussio is active here, I decided to keep the promotion. I am also aligned to non-promotion. I will update this. |
extensions/functions_arithmetic.yaml
Outdated
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: i8 | ||
nullability: DECLARED_OUTPUT | ||
decomposable: MANY | ||
intermediate: "fp64?" | ||
return: i8? | ||
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: i16 | ||
nullability: DECLARED_OUTPUT | ||
decomposable: MANY | ||
intermediate: "fp64?" | ||
return: i16? | ||
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: i32 | ||
nullability: DECLARED_OUTPUT | ||
decomposable: MANY | ||
intermediate: "fp64?" | ||
return: i32? | ||
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: i64 | ||
nullability: DECLARED_OUTPUT | ||
decomposable: MANY | ||
intermediate: "fp64?" | ||
return: i64? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a use case in mind for stddev on integer types? This seems like one of the things where you'd promote to fp32 or fp64 first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I looked into the usage in higher level, so even if we use integers the output should be in float. That's why I included the return as fp64
. I am not sure about a practical use case for this, assuming we are excluding a data type, does that mean Substrait doesn't approve standard deviation calculated on integer values?
extensions/functions_arithmetic.yaml
Outdated
required: false | ||
- value: fp32 | ||
nullability: DECLARED_OUTPUT | ||
decomposable: MANY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure standard deviation isn't decomposable. AFAIK it can only be computed once the mean is already known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong.
Standard deviation is the sqrt(variance)
.
Variance can be calculated by using mean
(avg
) and deviation for each sample.
So if we partition the data, we can calculate the mean for each partition in an embarassingly parallel manner. Then using an allreduce
(aggregate in parallel) call we can find the global mean
. Then we can calculate the deviation
for each sample in parallel for each partition. Again we can use allreduce
to sum these values up and also get the count of values in each partition. Now we have total sample count and deviation sum for all values. In the rank=0 process we can divide the sum by sample count. Then taking the sqrt
we can calculate std_dev
.
Is this wrong? Am I not following the definition of decomposability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if this is correct, my definition is wrong, there has to be a struct<V1, V2>
, I need to fix this either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for parallelizing stddev in general, but violates causality for the way Substrait deals with decomposition. Decomposition in Substrait means separating the aggregation between independent plans, so for instance one plan is pushed to several storage nodes and does the first part of the aggregation, and then another plan operates on a column containing the results from those plans and finishes the computation. That's why the intermediate type matters for as far as Substrait is concerned; how it's parallelized at the level of a single plan is not Substrait's concern. When applied to stddev, you can first indeed compute the mean in a decomposable manner, but then you need that mean and all individual rows in order to compute the variance (which again you could do in parallel, but you don't have the individual rows anymore at this point). At best the intermediate type could be something like struct<fp64, list<fp64>>
to track the mean and all the values, but that seems a bit silly; at that point you can make all aggregates "decomposable" by just making the intermediate type the list of all input values thus far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this is the way that Calcite does decomposition of these functions:
AVG(x) → SUM(x) / COUNT(x)
STDDEV_POP(x) → SQRT( (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / COUNT(x))
STDDEV_SAMP(x) → SQRT( (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / CASE COUNT(x) WHEN 1 THEN NULL ELSE COUNT(x) - 1 END)
VAR_POP(x) → (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / COUNT(x)
VAR_SAMP(x) → (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / CASE COUNT(x) WHEN 1 THEN NULL ELSE COUNT(x) - 1 END
COVAR_POP(x, y) → (SUM(x * y) - SUM(x, y) * SUM(y, x) / REGR_COUNT(x, y)) / REGR_COUNT(x, y)
COVAR_SAMP(x, y) → (SUM(x * y) - SUM(x, y) * SUM(y, x) / REGR_COUNT(x, y)) / CASE REGR_COUNT(x, y) WHEN 1 THEN NULL ELSE REGR_COUNT(x, y) - 1 END
REGR_SXX(x, y) → REGR_COUNT(x, y) * VAR_POP(y)
REGR_SYY(x, y) → REGR_COUNT(x, y) * VAR_POP(x)
From there, SUM is decomposed into SUM0 in numerous places.
That being said, I don't think this decomposition needs to be defined within the context of the function. I'd be inclined for this to be done as a rewrite as opposed to a decomposable function. (The decomposition here is far more complex than what is intended for the decomposable concept.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation @jacques-n
Should these be included in an Appendix section somewhere in the documentation. For instance, there are terminologies/standards that the Substrait community is trying to bring forward. So it would be very helpful if pointers to such definitions are there with more facts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is the audience for the decomposition field? Is this advice to consumers / planners on how they might be able to decompose something? Is it a restriction that limits what decompositions are allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the audience is a hypothetical optimizer & task distribution engine. If it gets a plan with a decomposable aggregation function, it can use the decomposability information and intermediate type to split the plan up and distribute it over nodes. Using [X]
to signify a column of type X
, saying an aggregate function from [X] -> Z
is ONE
- or MANY
-decomposable with intermediate type Y
implies that [X] -> Y
(INITIAL_TO_INTERMEDIATE
) and [Y] -> Z
(INTERMEDIATE_TO_RESULT
) can also be used in plans, and that piping the results of a number of [X] -> Y
invocations to a single [Y] -> Z
invocation yields the same result as doing [X] -> Z
on all X
s in one go. If it's MANY
-decomposable [Y] -> Y
(INTERMEDIATE_TO_INTERMEDIATE
) also exists, allowing a distributor to use more exotic distribution patterns as well. Selection between these modes of operation happen using the "aggregation phase" option when binding the aggregate function (the default being INITIAL_TO_RESULT
for the [X] -> Z
variant), so effectively you're declaring several aggregate functions in one go by defining one to be decomposable.
How an engine implements a function internally (parallelization strategy, distribution over nodes, etc.) is out of scope for Substrait. It's ultimately just an interface, after all; such things are implementation details that are intentionally hidden by that interface.
a2791cb
to
e95afb5
Compare
@jvanstraten updated the PR with the recent suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't know how I feel about the integer variants of these functions. I don't really see a use case for them; in fact I'd sooner implement them for decimals than for integers.
extensions/functions_arithmetic.yaml
Outdated
decomposable: MANY | ||
intermediate: "fp64?" | ||
return: fp64? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what doesn't look right is that these three lines came out of nowhere at the bottom of the file. They should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't catch that sorry. Now it should be okay.
extensions/functions_arithmetic.yaml
Outdated
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: fp32 | ||
nullability: DECLARED_OUTPUT | ||
return: fp32? | ||
- args: | ||
- options: [ SILENT, SATURATE, ERROR ] | ||
required: false | ||
- value: fp64 | ||
nullability: DECLARED_OUTPUT | ||
return: fp64? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the floating point rounding options rather than integer overflow options for floats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, if I understood this right, we are not including this for integer inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still an open question to me if the integer input versions should even exist (I feel pretty strongly that they shouldn't).
If they do, well, you're going to have to specify something about rounding for sure, just not the floating point options because those are specific to floating points. I mean, you're doing divisions and a sqrt at the end of the computation there. Division is defined for integers but won't do what you want (it rounds toward zero or negative infinity depending on who you ask, which in itself is a problem, but also means it's statistically biased), but sqrt certainly isn't.
Anyone intending to use the integer versions is still going to expect floating point return types from these. But then the question is which, i.e. fp32 or fp64? Or even some decimal? The solution to that, from a user-perspective, would be to just cast to the real-number type of your choice first, and then use that version of the function. So the integer versions would never be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That make sense, let me remove the integer types from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the integer types.
ca7bb91
to
34e1368
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
Hm, CI doesn't agree with me though.
|
@jvanstraten, I think the intermediate change was only for window functions. Looks like the same should be applied to aggregates. |
Ah, I forgot those were separate things in the YAML schema. That makes sense. |
34e1368
to
6725d86
Compare
@jacques-n @jvanstraten I squashed it to a single commit message. Let's check the CI's. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made the schema change myself. LGTM now.
@vibhatha , can you rebase so there aren't merge conflicts? |
20d5959
to
995fc2f
Compare
6ada42b
to
ac57859
Compare
@jacques-n rebased and resolved conflicts. Please check it once more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vibhatha !
This PR includes standard_deviation definition.