Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for PercentileKLL and its variants #206

Merged
merged 2 commits into from
Aug 5, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,10 @@
* [percentilemv](configuration-reference/functions/percentilemv.md)
* [percentiletdigest](configuration-reference/functions/percentiletdigest.md)
* [percentiletdigestmv](configuration-reference/functions/percentiletdigestmv.md)
* [percentilekll](configuration-reference/functions/percentilekll.md)
* [percentilerawkll](configuration-reference/functions/percentilerawkll.md)
* [percentilekllmv](configuration-reference/functions/percentilekllmv.md)
* [percentilerawkllmv](configuration-reference/functions/percentilerawkllmv.md)
* [quarter](configuration-reference/functions/quarter.md)
* [regexpExtract](configuration-reference/functions/regexpextract.md)
* [regexpReplace](configuration-reference/functions/regexpreplace.md)
Expand Down
16 changes: 16 additions & 0 deletions configuration-reference/functions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,22 @@ This page contains reference documentation for functions in Apache Pinot.
[percentilemv.md](percentilemv.md)
{% endcontent-ref %}

{% content-ref url="percentilekll.md" %}
[percentilekll.md](percentilekll.md)
{% endcontent-ref %}

{% content-ref url="percentilerawkll.md" %}
[percentilerawkll.md](percentilerawkll.md)
{% endcontent-ref %}

{% content-ref url="percentilekllmv.md" %}
[percentilekllmv.md](percentilekllmv.md)
{% endcontent-ref %}

{% content-ref url="percentilerawkllmv.md" %}
[percentilerawkllmv.md](percentilerawkllmv.md)
{% endcontent-ref %}

{% content-ref url="quarter.md" %}
[quarter.md](quarter.md)
{% endcontent-ref %}
Expand Down
47 changes: 47 additions & 0 deletions configuration-reference/functions/percentilekll.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
description: >-
This section contains reference documentation for the PERCENTILEKLL function.
---

# PERCENTILEKLL

`KLL Sketch` is an approxiamate quantiles algorithm which targets optimal space for a given accuracy. `PERCENTILEKLL` is a percentile calculation aggregation function based on Apache Datasketches [KLL Doubles Sketch](https://datasketches.apache.org/docs/KLL/KLLSketch.html) implementation.

Pinot also offers a 'raw' variant, `PERCENTILEKLLRAW`, which returns the serialized sketch that can be used for calculating 'rank' or 'histogram'.

All of the variants of `PercentileKLL` also support raw sketches in Pinot columns. This means you can create KLL Doubles sketches outside of Pinot and ingest them into columns as binary strings. `PercentileKLL` will identify these columns merge them to produce aggregate results.

## Signature

> PercentileKLL(column, percentile, kValue) -> Double

* `column` (required): Name of the column to aggregate on. If the column is a multi value column, use `PERCENTILEKLLMV` variant.
* `percentile` (required): Percentile value to be calculated [0..100]
* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%. For defails please see the [accuracy vs size chart](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/defails/details/ in all 4 .md files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Fixed.


## Usage Examples

```sql
select percentileKLL(ArrDelayMinutes, 90) as DelayP90
from airlineStats
```

| DelayP90 |
| -------- |
| 40 |

```sql
select Carrier, percentileKll(ArrDelay, 50, 600) as MedianDelay
from airlineStats
where ArrDelay > 0
group by Carrier
order by 2 desc
limit 3
```

| Carrier | MedianDelay |
| ------- | ----------- |
| MQ | 28 |
| B6 | 28 |
| EV | 24 |

27 changes: 27 additions & 0 deletions configuration-reference/functions/percentilekllmv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
description: >-
This section contains reference documentation for the PERCENTILEKLLMV function.
---

# PERCENTILEKLLMV

Variant of the `PERCENTILEKLL` aggregation function which accepts multi-value columns. Values in the given column are 'flattened' before aggregation, so the function will produce a single value for the given percentile.

## Signature

> PercentileKLLMV(column, percentile, kValue) -> Double

* `column` (required): Name of the column to aggregate on.
* `percentile` (required): Percentile value to be calculated [0..100]
* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%. For defails please see the [accuracy vs size chart](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).

## Usage Examples

```sql
select percentileKLLMV(ArrOfInts, 90) as value
from MyTable
```

| value |
| ------ |
| 40 |
27 changes: 27 additions & 0 deletions configuration-reference/functions/percentilerawkll.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
description: >-
This section contains reference documentation for the PERCENTILERAWKLL function.
---

# PERCENTILERAWKLL

Raw variant of the `PERCENTILEKLL` which returns a Base64 encoded string of the KLLSketch object. The response can be deserialized back to a KLLSketch using Apache Datasketches library and used to do further analysis. For example you can use this approach to calculate the CDF (Cumulative Density Function) or PMF (Probability Mass Function) of a dataset.

## Signature

> PercentileRawKLL(column, percentile, kValue) -> Double

* `column` (required): Name of the column to aggregate on. If the column is a multi value column, use `PERCENTILERAWKLLMV` variant.
* `percentile` (required): Percentile value to be calculated [0..100]
* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%.

## Usage Examples

```sql
select percentileRawKll(ArrDelayMinutes, 90) as sketch
from airlineStats
```

| sketch |
| -------- |
| BQEPC... |
27 changes: 27 additions & 0 deletions configuration-reference/functions/percentilerawkllmv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
description: >-
This section contains reference documentation for the PERCENTILERAWKLLMV function.
---

# PERCENTILRAWEKLLMV

Variant of the `PERCENTILERAWKLL` aggregation function which accepts multi-value columns. Values in the given column are 'flattened' before aggregation.

## Signature

> PercentileRAWKLLMV(column, percentile, kValue) -> Double

* `column` (required): Name of the column to aggregate on.
* `percentile` (required): Percentile value to be calculated [0..100]
* `kValue`: Integer value which determines the size of the sketch. Default value is `200` which corresponds to a normalized rank error of about 1.65%. For defails please see the [accuracy vs size chart](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).

## Usage Examples

```sql
select percentileKLLMV(ArrOfInts, 90) as value
from MyTable
```

| sketch |
| -------- |
| BQEPC... |