-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Variance and Standard Deviation Aggregation Functions #9910
Conversation
b5fa523
to
30e7324
Compare
30e7324
to
de8dcc4
Compare
Codecov Report
@@ Coverage Diff @@
## master #9910 +/- ##
============================================
+ Coverage 64.30% 64.33% +0.03%
- Complexity 5034 5531 +497
============================================
Files 1928 1931 +3
Lines 103874 104243 +369
Branches 15823 15882 +59
============================================
+ Hits 66793 67067 +274
- Misses 32254 32333 +79
- Partials 4827 4843 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
de8dcc4
to
11cb6e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM over all.
pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java
Outdated
Show resolved
Hide resolved
checkWithPrecisionForStandardDeviation((VarianceTuple) aggregationResult.get(7), NUM_RECORDS, expectedDoubleSum, | ||
expectedStdDevs[7].getResult(), true); | ||
|
||
// Update the expected result by 3 more times (broker query will compute 4 identical segments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variance on 4 identical segments is the same as variance on one of each segment. Should probably to consider using getDistinctInstances
in BaseQueriesTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasperjiaguo Can you elaborate on your suggestion?
I referred covariance test and it also uses getBrokerResponse()
to get the result. According to the documentation, getBrokerResponse()
seems to compute the result from 4 identical segments. FYI,getBrokerResponse()
internally calls getDistinctInstances()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in CovarianceQueriesTest
, @SabrinaZhaozyf has added the getDistinctInstances() to use 4 distinct segments for these statistical functions, because Covar on 4 identical segments is the same as Covar on one of each segment, as kind of defeats the purpose of testing merge logic over multiple segments. The getDistinctInstances
in base class is still returning 4 identical segments though for backward compat. We can probably inherit CovarianceQueriesTest
in VarianceQueriesTest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasperjiaguo Thanks for pointing out. I actually merged Covariance & Variance tests
. This actually helped me to identify a bug when we merging empty VarianceTuple
with non empty VarianceTuple
!
Can you go over the pr once more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java
Show resolved
Hide resolved
* Sample Variances" by Chan et al. Please refer to the "Parallel Algorithm" section from the following wiki: | ||
* - https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm | ||
*/ | ||
public class VarianceAggregationFunction extends BaseSingleInputAggregationFunction<VarianceTuple, Double> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
majority of this code is shared with CoVariance agg. can we abstract out some comment utilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, they look very similar but the implementations of all functions are slightly different due to the algorithm difference. Also, we use a different intermediate class VarianceTuple
instead of CovarianceTuple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took out a common function that can be shared and created a util class. Please take a look
11cb6e2
to
dfc4304
Compare
This PR adds `VAR_POP()`, `VAR_SAMP()`, `STDDEV_POP()`, `STDDEV_SAMP()` aggregation functions.
bf2e529
to
01c4be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm overall. will let @Jackie-Jiang take another pass.
refactoring of taking out common utils can be done separately IMO.
01c4be6
to
b70d7dd
Compare
847996c
to
38c8821
Compare
} | ||
|
||
@Test | ||
public void testStandardDeviationAggregationOnly() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saw some instability on https://github.com/apache/pinot/actions/runs/3652424011/jobs/6170799903.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed with #9948
This PR adds
VAR_POP()
,VAR_SAMP()
,STDDEV_POP()
,STDDEV_SAMP()
aggregation functions.