Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label #401

kevinmingtarja · 2024-09-30T07:26:21Z

Background

We want to calculate p50, p90, etc using this metric, and the main challenge is in picking the proper bucket/bin layout.

To calculate a quantile on a histogram is actually an estimate where the error depends on the granularity of the histogram’s bin widths. Being that the data in a histogram is naturally ordered you know exactly what bin contains an arbitrary quantile. Prometheus (and many other tools, as its about the only way we have) then estimates the correct value by doing linear approximation over the selected bin. - https://linuxczar.net/blog/2016/12/31/prometheus-histograms/

This is a good illustration from https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation:

Imagine your usual request durations are almost all very close to 220ms, or in other words, if you could plot the "true" histogram, you would see a very sharp spike at 220ms. In the Prometheus histogram metric as configured above, almost all observations, and therefore also the 95th percentile, will fall into the bucket labeled {le="0.3"}, i.e. the bucket from 200ms to 300ms. The histogram implementation guarantees that the true 95th percentile is somewhere between 200ms and 300ms. To return a single value (rather than an interval), it applies linear interpolation, which yields 295ms in this case. The calculated quantile gives you the impression that you are close to breaching the SLO, but in reality, the 95th percentile is a tiny bit above 220ms, a quite comfortable distance to your SLO.

What this PR does

This PR improves the bucket layout of this histogram metric, because to get accurate estimates, we also need to define buckets that fits our distribution (or our own estimate of the distribution).

There's no easy way to define the "perfect" bucket layout. Especially since we are running arbitrary code that users write, so we can't fully know what kind of workloads they will run. The Prometheus folks generally advise to chose relatively small bin widths according to the possible range and to be sure to include your SLA (or other important numbers) as one of the boundaries, but this is still pretty vague.

Here are some heuristics I used to help:

I used knative's distribution as a starting point, as they are also a "serverless" platform who runs arbitrary user code, so they're pretty similar with us in that sense. They set the following values: 5, 10, 20, 40, 60, 80, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 2000, 5000, 10000, 20000, 50000, 100000
Looked at some common workloads so far running on Hypermode,
a call to embed text using minilm takes 10-30ms, so our smallest buckets (10, 15, 20, 30) should take care of that with enough granularity
I expect most functions would take <= 1s or 2s, thus we focus our buckets in covering these spots with high granularity:
a. 40-100 has a bucket width of 20 each (40, 60, 80, 100)
b. 100-300 has a width of 25
c. 300-1000 has a width of 50
d. 1000-2000 has a width of 100
Then the rest (2000, 2500, 3000, 3500, 4000, 5000, 10000, 20000, 40000, 60000) are just for catching really high tail latencies

Testing

I tested this using the generateText function from the textgeneration example from functions-as, as it is going to be one of the most common workloads (calling an LLM over the network) people run.

I made 500 calls to the function, then measured the quantiles (in ms) in three ways:

Percentile	Actual	Summary metric	Improved histogram metric	Error (%)
p50	738	711	701	-1.41
p75	915	850	847	-0.35
p90	1104	962	978	+1.66
p99	2192	2181	2047	-6.14

Actual is based on the sorted list of durations on the client, summary metric is the one introduced in #377, and improved histogram metric is the one from this PR.

Comparing the actual and summary metric (summary is supposed to be more accurate than histograms, with the major downside of not being aggregateable), we can see that even that is not accurate, because there really is no way to be fully accurate other than maintaining a sorted list of all observations (which is a lot of memory). It's always going to be a trade-off between space and accuracy.

Next, comparing the improved histogram vs the summary, the results are quite reasonable, within 1-2% error, other than the last one. Because our buckets are not granular enough in that area, after 2000, the next boundary is at 2500, so 2047 is a result of linear approximation within that bucket. If we were to define a new bucket at 2100 or 2250 for example, we will most likely get more accurate. But again, it's a trade-off between space and accuracy.

Scalability

You might ask, why don't we define very granular buckets? Well, the Prometheus Best Practices documents state that the maximum cardinality of a metric should be about 10 unique label/value pairs. This is because each labelset is an additional time series that has RAM, CPU, disk, and network costs.

In our case, the cardinality of this metric is 49 x # of functions. So assuming each user only has 2 functions on average, then it's still within an order of magnitude, so I think it should be fine.

…gram metric

linear · 2024-09-30T07:26:25Z

HYP-2285 improve bucket boundaries for `runtime_function_execution_duration_milliseconds` metric

Improve bucket layout for FunctionExecutionDurationMilliseconds histo…

c319031

…gram metric

kevinmingtarja requested a review from a team as a code owner September 30, 2024 07:26

kevinmingtarja changed the title ~~Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric~~ Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label; Sep 30, 2024

kevinmingtarja changed the title ~~Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label;~~ Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label Sep 30, 2024

kevinmingtarja enabled auto-merge (squash) September 30, 2024 07:27

kevinmingtarja and others added 4 commits September 30, 2024 14:28

Add to changelog

ea581f5

Fix comment

5de3fcc

Merge branch 'main' into kevinm/hyp-2285-improve-bucket-boundaries-for

b074d57

Fix changelog entry

919bcee

mattjohnsonpint approved these changes Sep 30, 2024

View reviewed changes

kevinmingtarja merged commit 41c8368 into main Sep 30, 2024
4 checks passed

kevinmingtarja deleted the kevinm/hyp-2285-improve-bucket-boundaries-for branch September 30, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label #401

Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label #401

kevinmingtarja commented Sep 30, 2024 •

edited

Loading

linear bot commented Sep 30, 2024

Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label #401

Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label #401

Conversation

kevinmingtarja commented Sep 30, 2024 • edited Loading

Background

What this PR does

Testing

Scalability

linear bot commented Sep 30, 2024

kevinmingtarja commented Sep 30, 2024 •

edited

Loading