Realtime pre-aggregation for Distinct Count HLL & Big Decimal #10926

priyen · 2023-06-15T20:44:17Z

Feature description

This PR expands realtime aggregation to include supporting DistinctCountHLL & SUM_PRECISION (supporting Big Decimal).

DISTINCTCOUNTHLL, here's an example config in the realtime table:

"aggregationConfigs": [
        {
          "columnName": "distinctcounthll_customer",
          "aggregationFunction": "DISTINCTCOUNTHLL(customer, 12)"
        },

Here, customer is the field that we want to count uniquely, in this case it is a string where customer is the id of the customer. 12 is the log2m which defines the accuracy in the HLL algorithm. The output field, distinctcounthll_customer is the bytes form of that HLL object, when then can be queried as normal using DISTINCTCOUNTHLL or DISTINCTCOUNTHLLRAW query operators.

Here is the schema for above:

"metricFieldSpecs": [
    {
      "name": "distinctcounthll_customer",
      "dataType": "BYTES"
    },

SUMPRECISION, for big decimal pre-aggregation, example:

"aggregationConfigs": [
        {
          "columnName": "sum_precision_parsed_amount",
          "aggregationFunction": "SUMPRECISION(parsed_amount, 38)"
        },
      ],

Here, parsed_amount is the input used to create a big decimal object of precision 38. Precision 38 is used to compute a bytes size upper limit, and this will be the maximum size, and it must be defined). While in theory big decimal's can be unlimited, here we specify a maximum size upfront. Larger big decimal's won't be supported. sum_precision_parsed_amount is the output that can be queried using SUMPRECISION query operator.

Here is the schema for the above:

"metricFieldSpecs": [
    {
      "name": "sum_precision_parsed_amount",
      "dataType": "BIG_DECIMAL"
    },

Both of these features underlying used the FixedByteSVMutableForwardIndex.

Testing

I have added unit tests covering the FixedByteSVMutableForwardIndex, as well as the serialization functions added & tests covering aggregation testing by writing data to the index and trying to read it back.

codecov-commenter · 2023-06-15T22:05:30Z

Codecov Report

Merging #10926 (5e75cf1) into master (a86ba9c) will decrease coverage by 0.01%.
Report is 15 commits behind head on master.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #10926      +/-   ##
==========================================
- Coverage    0.11%    0.11%   -0.01%     
==========================================
  Files        2218     2225       +7     
  Lines      119138   119440     +302     
  Branches    18022    18092      +70     
==========================================
  Hits          137      137              
- Misses     118981   119283     +302     
  Partials       20       20

Flag	Coverage Δ
integration1temurin11	`0.00% <0.00%> (ø)`
integration1temurin17	`0.00% <0.00%> (ø)`
integration1temurin20	`0.00% <0.00%> (ø)`
integration2temurin11	`0.00% <0.00%> (ø)`
integration2temurin17	`?`
integration2temurin20	`?`
unittests1temurin11	`0.00% <0.00%> (ø)`
unittests1temurin17	`0.00% <0.00%> (ø)`
unittests1temurin20	`0.00% <0.00%> (ø)`
unittests2temurin11	`0.11% <0.00%> (-0.01%)`	⬇️
unittests2temurin17	`0.11% <0.00%> (-0.01%)`	⬇️
unittests2temurin20	`0.11% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...al/aggregator/DistinctCountHLLValueAggregator.java	`0.00% <0.00%> (ø)`
.../local/aggregator/SumPrecisionValueAggregator.java	`0.00% <0.00%> (ø)`
...gment/local/aggregator/ValueAggregatorFactory.java	`0.00% <0.00%> (ø)`
...local/indexsegment/mutable/MutableSegmentImpl.java	`0.00% <0.00%> (ø)`
...e/impl/forward/FixedByteSVMutableForwardIndex.java	`0.00% <0.00%> (ø)`
.../local/segment/index/forward/ForwardIndexType.java	`0.00% <0.00%> (ø)`
...cal/startree/v2/builder/BaseSingleTreeBuilder.java	`0.00% <0.00%> (ø)`
...he/pinot/segment/local/utils/HyperLogLogUtils.java	`0.00% <0.00%> (ø)`
...he/pinot/segment/local/utils/TableConfigUtils.java	`0.00% <0.00%> (ø)`
...segment/spi/index/mutable/MutableForwardIndex.java	`0.00% <0.00%> (ø)`
... and 3 more

... and 90 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jackie-Jiang

This is awesome!

Currently we don't have null handling for aggregate metrics. nullHandlingEnabled is modeled as a query option, and aggregate metrics rely on the behavior of default null value. If we want to handle null (basically ignore null for aggregate metrics), we should introduce a config for that, and I think that can be done as a separate PR

pinot-core/src/test/java/org/apache/pinot/core/common/ObjectSerDeUtilsTest.java

...src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountHLLValueAggregator.java

Jackie-Jiang · 2023-06-28T00:35:56Z

...src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountHLLValueAggregator.java

+    if (bytes == null || bytes.length == 0) {
+      return new HyperLogLog(_log2m);
+    }


Is this needed? We don't have this special handling in other functions

deserializeAggregatedValue is often called in getInitialAggregatedValue of some of the implementations (not just HLL or sum precision), one example is in AvgValueAggregator when the input is of type bytes, so to cover that case, I also do it here

I don't follow. Even if it is used in getInitialAggregatedValue(), the input should never be null or empty. Are we trying to handle invalid input data (e.g. empty byte array)? If so, the handling should be added to getInitialAggregatedValue() and applyRawValue() instead of here

Jackie-Jiang · 2023-06-28T00:40:27Z

...egment-local/src/main/java/org/apache/pinot/segment/local/aggregator/SumValueAggregator.java

@@ -37,11 +37,17 @@ public DataType getAggregatedValueType() {

  @Override
  public Double getInitialAggregatedValue(Number rawValue) {
+    if (rawValue == null) {


Can this ever be null?

It could be if the field is null on some of the incoming stream messages. You could always guard against this by filtering out the messages during ingestion, but I think returning 0.0 here as the default should be harmless since this is the sum aggregator

Currently the input should never be null (it should have already been filled with default value). My concern is that we are adding null handling to only this aggregation but not others. In order to completely support null input, we need to allow null value in, and annotate the input value as @Nullable and support it for all aggregations. That is not in the scope of this PR, so suggest doing it separately

priyen · 2023-07-05T01:19:16Z

@Jackie-Jiang, addressed your comments, please review again

Jackie-Jiang

Mostly good

Jackie-Jiang · 2023-07-11T21:45:25Z

...src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountHLLValueAggregator.java

+      String log2mLit = arguments.get(1).getLiteral().getStringValue();
+      Preconditions.checkState(StringUtils.isNumeric(log2mLit), "log2m argument must be a numeric literal");
+
+      _log2m = Integer.parseInt(log2mLit);


(minor)

Suggested change

String log2mLit = arguments.get(1).getLiteral().getStringValue();

Preconditions.checkState(StringUtils.isNumeric(log2mLit), "log2m argument must be a numeric literal");

_log2m = Integer.parseInt(log2mLit);

_log2m = arguments.get(1).getLiteral().getIntValue();

doesn't this remove the check if its numeric though?

Jackie-Jiang · 2023-07-11T21:54:04Z

...src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountHLLValueAggregator.java

+      Preconditions.checkState(StringUtils.isNumeric(log2mLit), "log2m argument must be a numeric literal");
+
+      _log2m = Integer.parseInt(log2mLit);
+      _log2mByteSize = (new HyperLogLog(_log2m)).getBytes().length;


We can add a util to get the byte size without serializing:
byteSize = (RegisterSet.getSizeForCount(1 << log2m) + 2) * Integer.BYTES

Jackie-Jiang · 2023-07-11T22:00:56Z

...src/main/java/org/apache/pinot/segment/local/aggregator/DistinctCountHLLValueAggregator.java

+    if (bytes == null || bytes.length == 0) {
+      return new HyperLogLog(_log2m);
+    }


I don't follow. Even if it is used in getInitialAggregatedValue(), the input should never be null or empty. Are we trying to handle invalid input data (e.g. empty byte array)? If so, the handling should be added to getInitialAggregatedValue() and applyRawValue() instead of here

Jackie-Jiang · 2023-07-11T22:02:38Z

...cal/src/main/java/org/apache/pinot/segment/local/aggregator/SumPrecisionValueAggregator.java

+  }
+
+  /*
+  Aggregate with a optimal maximum precision in mind. Scale is always only 1 32-bit


(code format) We usually indent (add 2 spaces) the block comment

Jackie-Jiang · 2023-07-11T22:03:19Z

...cal/src/main/java/org/apache/pinot/segment/local/aggregator/SumPrecisionValueAggregator.java

+    String precision = arguments.get(1).getLiteral().getStringValue();
+    Preconditions.checkState(StringUtils.isNumeric(precision), "precision must be a numeric literal");
+
+    _fixedSize = BigDecimalUtils.byteSizeForFixedPrecision(Integer.parseInt(precision));


Suggested change

String precision = arguments.get(1).getLiteral().getStringValue();

Preconditions.checkState(StringUtils.isNumeric(precision), "precision must be a numeric literal");

_fixedSize = BigDecimalUtils.byteSizeForFixedPrecision(Integer.parseInt(precision));

_fixedSize = BigDecimalUtils.byteSizeForFixedPrecision(arguments.get(1).getLiteral().getIntValue());

Jackie-Jiang · 2023-07-11T22:33:07Z

.../src/main/java/org/apache/pinot/segment/local/startree/v2/builder/BaseSingleTreeBuilder.java

@@ -142,7 +143,8 @@ static class Record {
    for (AggregationFunctionColumnPair functionColumnPair : functionColumnPairs) {
      _metrics[index] = functionColumnPair.toColumnName();
      _functionColumnPairs[index] = functionColumnPair;
-      _valueAggregators[index] = ValueAggregatorFactory.getValueAggregator(functionColumnPair.getFunctionType());
+      _valueAggregators[index] =
+          ValueAggregatorFactory.getValueAggregator(functionColumnPair.getFunctionType(), Collections.EMPTY_LIST);


(nit)

Suggested change

ValueAggregatorFactory.getValueAggregator(functionColumnPair.getFunctionType(), Collections.EMPTY_LIST);

ValueAggregatorFactory.getValueAggregator(functionColumnPair.getFunctionType(), Collections.emptyList());

Jackie-Jiang · 2023-07-11T22:34:32Z

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/TableConfigUtils.java

+
+          List<ExpressionContext> arguments = functionContext.getArguments();
+
+          if (("distinctcounthll".equals(functionContext.getFunctionName()))


We need to use canonical name (removing underscore). Currently if the function name is distinct_count_hll or sum_precision it will fail

...ava/org/apache/pinot/segment/local/realtime/impl/forward/FixedByteSVMutableForwardIndex.java

Jackie-Jiang · 2023-07-11T22:55:25Z

...cal/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexType.java

@@ -269,13 +272,14 @@ public MutableIndex createMutableIndex(MutableIndexContext context, ForwardIndex
    String column = context.getFieldSpec().getName();
    String segmentName = context.getSegmentName();
    FieldSpec.DataType storedType = context.getFieldSpec().getDataType().getStoredType();
+    int maxLength = context.getFieldSpec().getMaxLength();


(MAJOR) I don't think this is the correct way to pass this information. We can probably add the fixed length info into the MutableIndexContext to avoid modifying the field spec

Jackie-Jiang · 2023-07-11T22:56:55Z

...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java

@@ -621,6 +631,10 @@ private void addNewRow(int docId, GenericRow row) {
          case DOUBLE:
            forwardIndex.add(((Number) value).doubleValue(), -1, docId);
            break;
+          case BIG_DECIMAL:


The above comment no longer apply.
We should probably add some comment about using byte[] to support BIG_DECIMAL. It works because BIG_DECIMAL is actually stored as byte[] underlying

…imal

…gth on schema

priyen · 2023-07-27T22:06:10Z

@Jackie-Jiang pls re-review PR, failed build looks like flaky test

- Avoid HLL dependency in pinot-spi - Simplify byte size computation for HLL - Fix the table config validation logic. We allow type mismatch as long as it can be converted (e.g. numbers are compatible) - Cleanup and reformat

…#10926)

priyen force-pushed the github-fork/pre-agg-hll-and-big-decimal branch from f956879 to 2ba6ce6 Compare June 15, 2023 21:20

Jackie-Jiang added the feature label Jun 20, 2023

priyen changed the title ~~[wip] Realtime pre-aggregation for Distinct Count HLL & Big Decimal~~ Realtime pre-aggregation for Distinct Count HLL & Big Decimal Jun 22, 2023

Jackie-Jiang reviewed Jun 28, 2023

View reviewed changes

priyen requested a review from Jackie-Jiang July 6, 2023 21:31

Jackie-Jiang reviewed Jul 11, 2023

View reviewed changes

priyen-stripe added 6 commits July 26, 2023 16:09

realtime ingestion pre-aggregation for distinct count hll and big dec…

ee085be

…imal

fixed when max length is not the default

e85cf3a

lint

638b048

addr review comments, move some tests

5d8383a

public DistinctCountHLLValueAggregator() not used in prod

74c5279

Address review comments, cleanup, don't rely on schema

2b42235

priyen force-pushed the github-fork/pre-agg-hll-and-big-decimal branch from aa406c4 to 2b42235 Compare July 26, 2023 21:37

priyen-stripe added 5 commits July 26, 2023 17:39

remove unused

5737749

util

529cb04

util and test

f69845c

spotless apply

ee0fa5a

lint fix

38160a4

priyen force-pushed the github-fork/pre-agg-hll-and-big-decimal branch from f210da8 to 38160a4 Compare July 26, 2023 22:03

priyen-stripe added 9 commits July 26, 2023 18:32

fix test

6c964e9

more linting

4805342

canonical names, and hll doesnt need max length set

02df60c

more tests for sum precision and make hll not rely on setting max len…

69dfcc5

…gth on schema

add test that assert null values cause throw

5ff1434

pinot-spi-jdk8 pom updated

b77a505

fix wrong pom addition

4cb2a96

fix assertion

e9cbedc

dont accidentally spawn a var type index

3d29fbd

Some fixes and cleanup

5e75cf1

- Avoid HLL dependency in pinot-spi - Simplify byte size computation for HLL - Fix the table config validation logic. We allow type mismatch as long as it can be converted (e.g. numbers are compatible) - Cleanup and reformat

Jackie-Jiang force-pushed the github-fork/pre-agg-hll-and-big-decimal branch from dfa388e to 5e75cf1 Compare July 28, 2023 07:23

Jackie-Jiang approved these changes Jul 28, 2023

View reviewed changes

Jackie-Jiang merged commit 98e482c into apache:master Jul 28, 2023

s0nskar pushed a commit to s0nskar/pinot that referenced this pull request Aug 10, 2023

Realtime pre-aggregation for Distinct Count HLL & Big Decimal (apache…

982223f

…#10926)

priyen mentioned this pull request May 29, 2024

Update realtime agg function docs - sumprecision and distinctcounthll is supported + missing funcs from function reference page pinot-contrib/pinot-docs#347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realtime pre-aggregation for Distinct Count HLL & Big Decimal #10926

Realtime pre-aggregation for Distinct Count HLL & Big Decimal #10926

priyen commented Jun 15, 2023 •

edited

Loading

codecov-commenter commented Jun 15, 2023 •

edited

Loading

Jackie-Jiang left a comment

Jackie-Jiang Jun 28, 2023

priyen Jul 5, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jun 28, 2023

priyen Jul 4, 2023

Jackie-Jiang Jul 11, 2023

priyen commented Jul 5, 2023

Jackie-Jiang left a comment

Jackie-Jiang Jul 11, 2023

priyen Jul 26, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jul 11, 2023

Jackie-Jiang Jul 11, 2023

priyen commented Jul 27, 2023

	ValueAggregatorFactory.getValueAggregator(functionColumnPair.getFunctionType(), Collections.EMPTY_LIST);
	ValueAggregatorFactory.getValueAggregator(functionColumnPair.getFunctionType(), Collections.emptyList());


		List<ExpressionContext> arguments = functionContext.getArguments();

		if (("distinctcounthll".equals(functionContext.getFunctionName()))

Realtime pre-aggregation for Distinct Count HLL & Big Decimal #10926

Realtime pre-aggregation for Distinct Count HLL & Big Decimal #10926

Conversation

priyen commented Jun 15, 2023 • edited Loading

Feature description

Testing

codecov-commenter commented Jun 15, 2023 • edited Loading

Codecov Report

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

priyen commented Jul 5, 2023

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

priyen commented Jul 27, 2023

priyen commented Jun 15, 2023 •

edited

Loading

codecov-commenter commented Jun 15, 2023 •

edited

Loading