diff --git a/SUMMARY.md b/SUMMARY.md index 23cd71ab..43afa734 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -303,6 +303,10 @@ * [percentilemv](configuration-reference/functions/percentilemv.md) * [percentiletdigest](configuration-reference/functions/percentiletdigest.md) * [percentiletdigestmv](configuration-reference/functions/percentiletdigestmv.md) + * [percentilekll](configuration-reference/functions/percentilekll.md) + * [percentilerawkll](configuration-reference/functions/percentilerawkll.md) + * [percentilekllmv](configuration-reference/functions/percentilekllmv.md) + * [percentilerawkllmv](configuration-reference/functions/percentilerawkllmv.md) * [quarter](configuration-reference/functions/quarter.md) * [regexpExtract](configuration-reference/functions/regexpextract.md) * [regexpReplace](configuration-reference/functions/regexpreplace.md) diff --git a/configuration-reference/functions/README.md b/configuration-reference/functions/README.md index 5049cb50..44a9badf 100644 --- a/configuration-reference/functions/README.md +++ b/configuration-reference/functions/README.md @@ -375,6 +375,22 @@ This page contains reference documentation for functions in Apache Pinot. [percentilemv.md](percentilemv.md) {% endcontent-ref %} +{% content-ref url="percentilekll.md" %} +[percentilekll.md](percentilekll.md) +{% endcontent-ref %} + +{% content-ref url="percentilerawkll.md" %} +[percentilerawkll.md](percentilerawkll.md) +{% endcontent-ref %} + +{% content-ref url="percentilekllmv.md" %} +[percentilekllmv.md](percentilekllmv.md) +{% endcontent-ref %} + +{% content-ref url="percentilerawkllmv.md" %} +[percentilerawkllmv.md](percentilerawkllmv.md) +{% endcontent-ref %} + {% content-ref url="quarter.md" %} [quarter.md](quarter.md) {% endcontent-ref %} diff --git a/configuration-reference/functions/percentilekll.md b/configuration-reference/functions/percentilekll.md new file mode 100644 index 00000000..01c0d1c9 --- /dev/null +++ b/configuration-reference/functions/percentilekll.md @@ -0,0 +1,47 @@ +--- +description: >- + This section contains reference documentation for the PERCENTILEKLL function. +--- + +# PERCENTILEKLL + +`KLL Sketch` is an approxiamate quantiles algorithm which targets optimal space for a given accuracy. `PERCENTILEKLL` is a percentile calculation aggregation function based on Apache Datasketches [KLL Doubles Sketch](https://datasketches.apache.org/docs/KLL/KLLSketch.html) implementation. + +Pinot also offers a 'raw' variant, `PERCENTILEKLLRAW`, which returns the serialized sketch that can be used for calculating 'rank' or 'histogram'. + +All of the variants of `PercentileKLL` also support raw sketches in Pinot columns. This means you can create KLL Doubles sketches outside of Pinot and ingest them into columns as binary strings. `PercentileKLL` will identify these columns merge them to produce aggregate results. + +## Signature + +> PercentileKLL(column, percentile, kValue) -> Double + +* `column` (required): Name of the column to aggregate on. If the column is a multi value column, use `PERCENTILEKLLMV` variant. +* `percentile` (required): Percentile value to be calculated [0..100] +* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%. For details please see the [accuracy vs size chart](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html). + +## Usage Examples + +```sql +select percentileKLL(ArrDelayMinutes, 90) as DelayP90 +from airlineStats +``` + +| DelayP90 | +| -------- | +| 40 | + +```sql +select Carrier, percentileKll(ArrDelay, 50, 600) as MedianDelay +from airlineStats +where ArrDelay > 0 +group by Carrier +order by 2 desc +limit 3 +``` + +| Carrier | MedianDelay | +| ------- | ----------- | +| MQ | 28 | +| B6 | 28 | +| EV | 24 | + diff --git a/configuration-reference/functions/percentilekllmv.md b/configuration-reference/functions/percentilekllmv.md new file mode 100644 index 00000000..766c9aa9 --- /dev/null +++ b/configuration-reference/functions/percentilekllmv.md @@ -0,0 +1,27 @@ +--- +description: >- + This section contains reference documentation for the PERCENTILEKLLMV function. +--- + +# PERCENTILEKLLMV + +Variant of the `PERCENTILEKLL` aggregation function which accepts multi-value columns. Values in the given column are 'flattened' before aggregation, so the function will produce a single value for the given percentile. + +## Signature + +> PercentileKLLMV(column, percentile, kValue) -> Double + +* `column` (required): Name of the column to aggregate on. +* `percentile` (required): Percentile value to be calculated [0..100] +* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%. For details please see the [accuracy vs size chart](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html). + +## Usage Examples + +```sql +select percentileKLLMV(ArrOfInts, 90) as value +from MyTable +``` + +| value | +| ------ | +| 40 | diff --git a/configuration-reference/functions/percentilerawkll.md b/configuration-reference/functions/percentilerawkll.md new file mode 100644 index 00000000..6b4bcd69 --- /dev/null +++ b/configuration-reference/functions/percentilerawkll.md @@ -0,0 +1,26 @@ +--- +description: >- + This section contains reference documentation for the PERCENTILERAWKLL function. +--- + +# PERCENTILERAWKLL + +Raw variant of the `PERCENTILEKLL` which returns a Base64 encoded string of the KLLSketch object. The response can be deserialized back to a KLLSketch using Apache Datasketches library and used to do further analysis. For example you can use this approach to calculate the CDF (Cumulative Density Function) or PMF (Probability Mass Function) of a dataset. + +## Signature + +> PercentileRawKLL(column, percentile, kValue) -> Double + +* `column` (required): Name of the column to aggregate on. If the column is a multi value column, use `PERCENTILERAWKLLMV` variant. +* `percentile` (required): Percentile value to be calculated [0..100]. For 'raw' versions of the function, this value is used for ordering (ORDER BY). + +## Usage Examples + +```sql +select percentileRawKll(ArrDelayMinutes, 90) as sketch +from airlineStats +``` + +| sketch | +| -------- | +| BQEPC... | diff --git a/configuration-reference/functions/percentilerawkllmv.md b/configuration-reference/functions/percentilerawkllmv.md new file mode 100644 index 00000000..9432dbc1 --- /dev/null +++ b/configuration-reference/functions/percentilerawkllmv.md @@ -0,0 +1,26 @@ +--- +description: >- + This section contains reference documentation for the PERCENTILERAWKLLMV function. +--- + +# PERCENTILRAWEKLLMV + +Variant of the `PERCENTILERAWKLL` aggregation function which accepts multi-value columns. Values in the given column are 'flattened' before aggregation. + +## Signature + +> PercentileRAWKLLMV(column, percentile) -> Double + +* `column` (required): Name of the column to aggregate on. +* `percentile` (required): Percentile value to be calculated [0..100]. For raw versions of the function, this value is used for ordering (ORDER BY). + +## Usage Examples + +```sql +select percentileKLLMV(ArrOfInts, 90) as value +from MyTable +``` + +| sketch | +| -------- | +| BQEPC... |