Improve bucket layout for FunctionExecutionDurationMilliseconds histogram metric and add function_name label #401
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
We want to calculate p50, p90, etc using this metric, and the main challenge is in picking the proper bucket/bin layout.
To calculate a quantile on a histogram is actually an estimate where the error depends on the granularity of the histogram’s bin widths. Being that the data in a histogram is naturally ordered you know exactly what bin contains an arbitrary quantile. Prometheus (and many other tools, as its about the only way we have) then estimates the correct value by doing linear approximation over the selected bin. - https://linuxczar.net/blog/2016/12/31/prometheus-histograms/
This is a good illustration from https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation:
What this PR does
This PR improves the bucket layout of this histogram metric, because to get accurate estimates, we also need to define buckets that fits our distribution (or our own estimate of the distribution).
There's no easy way to define the "perfect" bucket layout. Especially since we are running arbitrary code that users write, so we can't fully know what kind of workloads they will run. The Prometheus folks generally advise to chose relatively small bin widths according to the possible range and to be sure to include your SLA (or other important numbers) as one of the boundaries, but this is still pretty vague.
Here are some heuristics I used to help:
5, 10, 20, 40, 60, 80, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 2000, 5000, 10000, 20000, 50000, 100000
10, 15, 20, 30
) should take care of that with enough granularitya.
40-100
has a bucket width of 20 each (40, 60, 80, 100
)b.
100-300
has a width of 25c.
300-1000
has a width of 50d.
1000-2000
has a width of 1002000, 2500, 3000, 3500, 4000, 5000, 10000, 20000, 40000, 60000
) are just for catching really high tail latenciesTesting
I tested this using the
generateText
function from the textgeneration example from functions-as, as it is going to be one of the most common workloads (calling an LLM over the network) people run.I made 500 calls to the function, then measured the quantiles (in ms) in three ways:
Actual is based on the sorted list of durations on the client, summary metric is the one introduced in #377, and improved histogram metric is the one from this PR.
Comparing the actual and summary metric (summary is supposed to be more accurate than histograms, with the major downside of not being aggregateable), we can see that even that is not accurate, because there really is no way to be fully accurate other than maintaining a sorted list of all observations (which is a lot of memory). It's always going to be a trade-off between space and accuracy.
Next, comparing the improved histogram vs the summary, the results are quite reasonable, within 1-2% error, other than the last one. Because our buckets are not granular enough in that area, after
2000
, the next boundary is at2500
, so2047
is a result of linear approximation within that bucket. If we were to define a new bucket at2100
or2250
for example, we will most likely get more accurate. But again, it's a trade-off between space and accuracy.Scalability
You might ask, why don't we define very granular buckets? Well, the Prometheus Best Practices documents state that the maximum cardinality of a metric should be about 10 unique label/value pairs. This is because each labelset is an additional time series that has RAM, CPU, disk, and network costs.
In our case, the cardinality of this metric is
49 x # of functions
. So assuming each user only has 2 functions on average, then it's still within an order of magnitude, so I think it should be fine.