Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add histogram aggregation function #8724

Merged
merged 7 commits into from
May 25, 2022

Conversation

jasperjiaguo
Copy link
Contributor

@jasperjiaguo jasperjiaguo commented May 18, 2022

This effort is part of the issue #8493
The histogram aggregation function for single value numerical columns is implemented.
usage example:
Histogram(columnName, ARRAY[0,1,10,100]) to specify bins [0,1), [1,10), [10,1000]
HISTOGRAM(intColumn,ARRAY["-Infinity",0,1,10,100,1000, "+Infinity"]) to specify bins (-inf, 0), [0,1), [1,10), [10,1000), [1000, +inf)
Histogram(columnName, 0, 1000, 10) to specify 10 equal-length bins [0,100), [100,200), ..., [900,1000]
This is a proof of concept histogram function, the following are some TODOs that we want to address in a follow up pr:

  1. Add support for multi-value column
  2. Add support for automatically collect min/max values when lower/upper bounds are not specified
  3. Count for outlier values as an additional bin

@jasperjiaguo jasperjiaguo force-pushed the histogram branch 5 times, most recently from da24002 to 9f2fa24 Compare May 18, 2022 06:39
@codecov-commenter
Copy link

codecov-commenter commented May 18, 2022

Codecov Report

Merging #8724 (fced48a) into master (d2396dd) will decrease coverage by 0.01%.
The diff coverage is 64.02%.

@@             Coverage Diff              @@
##             master    #8724      +/-   ##
============================================
- Coverage     69.68%   69.67%   -0.02%     
- Complexity     4577     4618      +41     
============================================
  Files          1729     1736       +7     
  Lines         90183    91180     +997     
  Branches      13415    13632     +217     
============================================
+ Hits          62848    63529     +681     
- Misses        22971    23226     +255     
- Partials       4364     4425      +61     
Flag Coverage Δ
integration1 26.98% <0.00%> (-0.12%) ⬇️
integration2 25.32% <0.00%> (-0.10%) ⬇️
unittests1 66.15% <64.02%> (-0.01%) ⬇️
unittests2 14.18% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...java/org/apache/pinot/common/utils/DataSchema.java 69.00% <50.00%> (+0.25%) ⬆️
...e/query/aggregation/utils/DoubleVectorOpUtils.java 52.94% <52.94%> (ø)
...egation/function/HistogramAggregationFunction.java 64.45% <64.45%> (ø)
...gregation/function/AggregationFunctionFactory.java 81.20% <100.00%> (+0.28%) ⬆️
...che/pinot/segment/spi/AggregationFunctionType.java 89.53% <100.00%> (+0.12%) ⬆️
...g/apache/pinot/server/api/resources/ErrorInfo.java 0.00% <0.00%> (-100.00%) ⬇️
...t/server/api/resources/DefaultExceptionMapper.java 0.00% <0.00%> (-75.00%) ⬇️
...che/pinot/controller/util/TableMetadataReader.java 63.41% <0.00%> (-36.59%) ⬇️
...icate/FSTBasedRegexpPredicateEvaluatorFactory.java 56.25% <0.00%> (-33.75%) ⬇️
.../index/loader/invertedindex/RangeIndexHandler.java 49.55% <0.00%> (-17.70%) ⬇️
... and 116 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2396dd...fced48a. Read the comment docs.

@jasperjiaguo jasperjiaguo changed the title [WIP]Add histogram aggregation function feature Add histogram aggregation function feature May 18, 2022
@jasperjiaguo jasperjiaguo force-pushed the histogram branch 2 times, most recently from 8c43cd7 to 67bb742 Compare May 18, 2022 19:12
@Jackie-Jiang
Copy link
Contributor

Suggest changing the function name to histogram (remove the SV). We should be able to use the same function to handle both SV and MV in the future.

@jasperjiaguo jasperjiaguo force-pushed the histogram branch 5 times, most recently from 6672bac to 1ad061f Compare May 20, 2022 00:28
@siddharthteotia siddharthteotia merged commit d4c99ab into apache:master May 25, 2022
@siddharthteotia siddharthteotia changed the title Add histogram aggregation function feature Add histogram aggregation function May 25, 2022
@siddharthteotia siddharthteotia added the release-notes Referenced by PRs that need attention when compiling the next release notes label May 25, 2022
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the feature. Please also update the pinot doc to include this new function.

}

public static DoubleArrayList vectorAdd(DoubleArrayList a, DoubleArrayList b) {
Preconditions.checkState(a.size() == b.size(), "The two operand arrays are not of the same size! provided %s, %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to:

return vectorAdd(a, b.elements());

}

public static DoubleArrayList vectorAdd(DoubleArrayList a, double[] b) {
Preconditions.checkState(a.size() == b.length, "The two operand arrays are not of the same size! provided %s, %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call a.elements() first and cache the value, then operate on 2 double[] for better performance. Not sure if jvm will inline all the method calls. Same for incrementElement()

public static DoubleArrayList vectorAdd(DoubleArrayList a, double[] b) {
Preconditions.checkState(a.size() == b.length, "The two operand arrays are not of the same size! provided %s, %s",
a.size(), b.length);
;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants