Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label #401

Merged
merged 5 commits into from
Sep 30, 2024

Conversation

kevinmingtarja
Copy link
Member

@kevinmingtarja kevinmingtarja commented Sep 30, 2024

Background

We want to calculate p50, p90, etc using this metric, and the main challenge is in picking the proper bucket/bin layout.

To calculate a quantile on a histogram is actually an estimate where the error depends on the granularity of the histogram’s bin widths. Being that the data in a histogram is naturally ordered you know exactly what bin contains an arbitrary quantile. Prometheus (and many other tools, as its about the only way we have) then estimates the correct value by doing linear approximation over the selected bin. - https://linuxczar.net/blog/2016/12/31/prometheus-histograms/

This is a good illustration from https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation:

Imagine your usual request durations are almost all very close to 220ms, or in other words, if you could plot the "true" histogram, you would see a very sharp spike at 220ms. In the Prometheus histogram metric as configured above, almost all observations, and therefore also the 95th percentile, will fall into the bucket labeled {le="0.3"}, i.e. the bucket from 200ms to 300ms. The histogram implementation guarantees that the true 95th percentile is somewhere between 200ms and 300ms. To return a single value (rather than an interval), it applies linear interpolation, which yields 295ms in this case. The calculated quantile gives you the impression that you are close to breaching the SLO, but in reality, the 95th percentile is a tiny bit above 220ms, a quite comfortable distance to your SLO.

What this PR does

This PR improves the bucket layout of this histogram metric, because to get accurate estimates, we also need to define buckets that fits our distribution (or our own estimate of the distribution).

There's no easy way to define the "perfect" bucket layout. Especially since we are running arbitrary code that users write, so we can't fully know what kind of workloads they will run. The Prometheus folks generally advise to chose relatively small bin widths according to the possible range and to be sure to include your SLA (or other important numbers) as one of the boundaries, but this is still pretty vague.

Here are some heuristics I used to help:

  1. I used knative's distribution as a starting point, as they are also a "serverless" platform who runs arbitrary user code, so they're pretty similar with us in that sense. They set the following values: 5, 10, 20, 40, 60, 80, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 2000, 5000, 10000, 20000, 50000, 100000
  2. Looked at some common workloads so far running on Hypermode,
  3. a call to embed text using minilm takes 10-30ms, so our smallest buckets (10, 15, 20, 30) should take care of that with enough granularity
  4. I expect most functions would take <= 1s or 2s, thus we focus our buckets in covering these spots with high granularity:
    a. 40-100 has a bucket width of 20 each (40, 60, 80, 100)
    b. 100-300 has a width of 25
    c. 300-1000 has a width of 50
    d. 1000-2000 has a width of 100
  5. Then the rest (2000, 2500, 3000, 3500, 4000, 5000, 10000, 20000, 40000, 60000) are just for catching really high tail latencies

Testing

I tested this using the generateText function from the textgeneration example from functions-as, as it is going to be one of the most common workloads (calling an LLM over the network) people run.

I made 500 calls to the function, then measured the quantiles (in ms) in three ways:

Percentile Actual Summary metric Improved histogram metric Error (%)
p50 738 711 701 -1.41
p75 915 850 847 -0.35
p90 1104 962 978 +1.66
p99 2192 2181 2047 -6.14

Actual is based on the sorted list of durations on the client, summary metric is the one introduced in #377, and improved histogram metric is the one from this PR.

Comparing the actual and summary metric (summary is supposed to be more accurate than histograms, with the major downside of not being aggregateable), we can see that even that is not accurate, because there really is no way to be fully accurate other than maintaining a sorted list of all observations (which is a lot of memory). It's always going to be a trade-off between space and accuracy.

Next, comparing the improved histogram vs the summary, the results are quite reasonable, within 1-2% error, other than the last one. Because our buckets are not granular enough in that area, after 2000, the next boundary is at 2500, so 2047 is a result of linear approximation within that bucket. If we were to define a new bucket at 2100 or 2250 for example, we will most likely get more accurate. But again, it's a trade-off between space and accuracy.

Scalability

You might ask, why don't we define very granular buckets? Well, the Prometheus Best Practices documents state that the maximum cardinality of a metric should be about 10 unique label/value pairs. This is because each labelset is an additional time series that has RAM, CPU, disk, and network costs.

In our case, the cardinality of this metric is 49 x # of functions. So assuming each user only has 2 functions on average, then it's still within an order of magnitude, so I think it should be fine.

@kevinmingtarja kevinmingtarja requested a review from a team as a code owner September 30, 2024 07:26
@kevinmingtarja kevinmingtarja changed the title Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label; Sep 30, 2024
@kevinmingtarja kevinmingtarja changed the title Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label; Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label Sep 30, 2024
@kevinmingtarja kevinmingtarja enabled auto-merge (squash) September 30, 2024 07:27
@kevinmingtarja kevinmingtarja merged commit 41c8368 into main Sep 30, 2024
4 checks passed
@kevinmingtarja kevinmingtarja deleted the kevinm/hyp-2285-improve-bucket-boundaries-for branch September 30, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants