Add PercentileSmartTDigestAggregationFunction #8565

Jackie-Jiang · 2022-04-19T18:41:56Z

Description

Adds PercentileSmartTDigestAggregationFunction which can automatically convert the DoubleArrayList to TDigest if the list size grows too big to protect the servers from running out of memory. This conversion only applies to aggregation only queries, but not the group-by queries.

By default, when the list size exceeds 100K, it will be converted to a TDigest with compression of 100.
The threshold and compression can be configured using the third argument (literal) of the function:

threshold: list size threshold to trigger the conversion, non-positive means never convert (default 100K)
compression: compression of the converted TDigest (default 100)

Example query:
SELECT PERCENTILE_SMART_TDIGEST(myCol, 95, 'threshold=10;compression=50') FROM myTable

Release Notes

Adds PercentileSmartTDigestAggregationFunction which automatically stores values in DoubleArrayList or TDigest based on the number of values

siddharthteotia · 2022-04-20T17:23:09Z

...c/main/java/org/apache/pinot/core/query/aggregation/function/AggregationFunctionFactory.java

@@ -52,6 +52,9 @@ public static AggregationFunction getAggregationFunction(FunctionContext functio
      ExpressionContext firstArgument = arguments.get(0);
      if (upperCaseFunctionName.startsWith("PERCENTILE")) {
        String remainingFunctionName = upperCaseFunctionName.substring(10);
+        if (remainingFunctionName.equals("SMARTTDIGEST")) {


case ignore match ?

It's already canonicalized, so no need to ignore case

siddharthteotia · 2022-04-20T17:24:12Z

.../apache/pinot/core/query/aggregation/function/PercentileSmartTDigestAggregationFunction.java

+  public void aggregate(int length, AggregationResultHolder aggregationResultHolder,
+      Map<ExpressionContext, BlockValSet> blockValSetMap) {
+    BlockValSet blockValSet = blockValSetMap.get(_expression);
+    validateValueType(blockValSet);


Doing this once (upon first call to aggregate) should be enough ?

Yes, but because the aggregation function itself is stateless (shared across threads), we cannot add a variable to the function to track if it is the first call. The overhead of this call should be minimal. Once we enforce schema, we should be able to perform all these checks on the broker side

Add PercentileSmartTDigestAggregationFunction

8f22c80

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Apr 19, 2022

Jackie-Jiang requested a review from xiangfu0 April 19, 2022 18:41

xiangfu0 approved these changes Apr 19, 2022

View reviewed changes

Jackie-Jiang merged commit b025f43 into apache:master Apr 20, 2022

Jackie-Jiang deleted the percentile_smart_tdigest branch April 20, 2022 17:05

siddharthteotia reviewed Apr 20, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PercentileSmartTDigestAggregationFunction #8565

Add PercentileSmartTDigestAggregationFunction #8565

Jackie-Jiang commented Apr 19, 2022

siddharthteotia Apr 20, 2022

Jackie-Jiang Apr 20, 2022

siddharthteotia Apr 20, 2022

Jackie-Jiang Apr 20, 2022

Add PercentileSmartTDigestAggregationFunction #8565

Add PercentileSmartTDigestAggregationFunction #8565

Conversation

Jackie-Jiang commented Apr 19, 2022

Description

Release Notes

siddharthteotia Apr 20, 2022

Choose a reason for hiding this comment

Jackie-Jiang Apr 20, 2022

Choose a reason for hiding this comment

siddharthteotia Apr 20, 2022

Choose a reason for hiding this comment

Jackie-Jiang Apr 20, 2022

Choose a reason for hiding this comment