Skip to content

Commit

Permalink
Merge pull request #206 from cbalci/percentil-kll-docs
Browse files Browse the repository at this point in the history
Add documentation for PercentileKLL and its variants
  • Loading branch information
mayankshriv authored Aug 5, 2023
2 parents ef61ab1 + cc865dd commit a6d2f2f
Show file tree
Hide file tree
Showing 6 changed files with 146 additions and 0 deletions.
4 changes: 4 additions & 0 deletions SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,10 @@
* [percentilemv](configuration-reference/functions/percentilemv.md)
* [percentiletdigest](configuration-reference/functions/percentiletdigest.md)
* [percentiletdigestmv](configuration-reference/functions/percentiletdigestmv.md)
* [percentilekll](configuration-reference/functions/percentilekll.md)
* [percentilerawkll](configuration-reference/functions/percentilerawkll.md)
* [percentilekllmv](configuration-reference/functions/percentilekllmv.md)
* [percentilerawkllmv](configuration-reference/functions/percentilerawkllmv.md)
* [quarter](configuration-reference/functions/quarter.md)
* [regexpExtract](configuration-reference/functions/regexpextract.md)
* [regexpReplace](configuration-reference/functions/regexpreplace.md)
Expand Down
16 changes: 16 additions & 0 deletions configuration-reference/functions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,22 @@ This page contains reference documentation for functions in Apache Pinot.
[percentilemv.md](percentilemv.md)
{% endcontent-ref %}

{% content-ref url="percentilekll.md" %}
[percentilekll.md](percentilekll.md)
{% endcontent-ref %}

{% content-ref url="percentilerawkll.md" %}
[percentilerawkll.md](percentilerawkll.md)
{% endcontent-ref %}

{% content-ref url="percentilekllmv.md" %}
[percentilekllmv.md](percentilekllmv.md)
{% endcontent-ref %}

{% content-ref url="percentilerawkllmv.md" %}
[percentilerawkllmv.md](percentilerawkllmv.md)
{% endcontent-ref %}

{% content-ref url="quarter.md" %}
[quarter.md](quarter.md)
{% endcontent-ref %}
Expand Down
47 changes: 47 additions & 0 deletions configuration-reference/functions/percentilekll.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
description: >-
This section contains reference documentation for the PERCENTILEKLL function.
---

# PERCENTILEKLL

`KLL Sketch` is an approxiamate quantiles algorithm which targets optimal space for a given accuracy. `PERCENTILEKLL` is a percentile calculation aggregation function based on Apache Datasketches [KLL Doubles Sketch](https://datasketches.apache.org/docs/KLL/KLLSketch.html) implementation.

Pinot also offers a 'raw' variant, `PERCENTILEKLLRAW`, which returns the serialized sketch that can be used for calculating 'rank' or 'histogram'.

All of the variants of `PercentileKLL` also support raw sketches in Pinot columns. This means you can create KLL Doubles sketches outside of Pinot and ingest them into columns as binary strings. `PercentileKLL` will identify these columns merge them to produce aggregate results.

## Signature

> PercentileKLL(column, percentile, kValue) -> Double
* `column` (required): Name of the column to aggregate on. If the column is a multi value column, use `PERCENTILEKLLMV` variant.
* `percentile` (required): Percentile value to be calculated [0..100]
* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%. For details please see the [accuracy vs size chart](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).

## Usage Examples

```sql
select percentileKLL(ArrDelayMinutes, 90) as DelayP90
from airlineStats
```

| DelayP90 |
| -------- |
| 40 |

```sql
select Carrier, percentileKll(ArrDelay, 50, 600) as MedianDelay
from airlineStats
where ArrDelay > 0
group by Carrier
order by 2 desc
limit 3
```

| Carrier | MedianDelay |
| ------- | ----------- |
| MQ | 28 |
| B6 | 28 |
| EV | 24 |

27 changes: 27 additions & 0 deletions configuration-reference/functions/percentilekllmv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
description: >-
This section contains reference documentation for the PERCENTILEKLLMV function.
---

# PERCENTILEKLLMV

Variant of the `PERCENTILEKLL` aggregation function which accepts multi-value columns. Values in the given column are 'flattened' before aggregation, so the function will produce a single value for the given percentile.

## Signature

> PercentileKLLMV(column, percentile, kValue) -> Double
* `column` (required): Name of the column to aggregate on.
* `percentile` (required): Percentile value to be calculated [0..100]
* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%. For details please see the [accuracy vs size chart](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).

## Usage Examples

```sql
select percentileKLLMV(ArrOfInts, 90) as value
from MyTable
```

| value |
| ------ |
| 40 |
26 changes: 26 additions & 0 deletions configuration-reference/functions/percentilerawkll.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
description: >-
This section contains reference documentation for the PERCENTILERAWKLL function.
---

# PERCENTILERAWKLL

Raw variant of the `PERCENTILEKLL` which returns a Base64 encoded string of the KLLSketch object. The response can be deserialized back to a KLLSketch using Apache Datasketches library and used to do further analysis. For example you can use this approach to calculate the CDF (Cumulative Density Function) or PMF (Probability Mass Function) of a dataset.

## Signature

> PercentileRawKLL(column, percentile, kValue) -> Double
* `column` (required): Name of the column to aggregate on. If the column is a multi value column, use `PERCENTILERAWKLLMV` variant.
* `percentile` (required): Percentile value to be calculated [0..100]. For 'raw' versions of the function, this value is used for ordering (ORDER BY).

## Usage Examples

```sql
select percentileRawKll(ArrDelayMinutes, 90) as sketch
from airlineStats
```

| sketch |
| -------- |
| BQEPC... |
26 changes: 26 additions & 0 deletions configuration-reference/functions/percentilerawkllmv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
description: >-
This section contains reference documentation for the PERCENTILERAWKLLMV function.
---

# PERCENTILRAWEKLLMV

Variant of the `PERCENTILERAWKLL` aggregation function which accepts multi-value columns. Values in the given column are 'flattened' before aggregation.

## Signature

> PercentileRAWKLLMV(column, percentile) -> Double
* `column` (required): Name of the column to aggregate on.
* `percentile` (required): Percentile value to be calculated [0..100]. For raw versions of the function, this value is used for ordering (ORDER BY).

## Usage Examples

```sql
select percentileKLLMV(ArrOfInts, 90) as value
from MyTable
```

| sketch |
| -------- |
| BQEPC... |

0 comments on commit a6d2f2f

Please sign in to comment.