Idea: Make it possible to analyse all values of a label with the `/api/v1/cardinality/label_values` endpoint #10328

LasseHels · 2025-01-02T12:30:40Z

What is the problem you are trying to solve?

We are currently working to reduce the amount of active metric series of the observability platform at Maersk. As part of this effort, we need a way to retrieve all metric names and, ideally, rank them by series count.

Knowing the series count of each metric is important as it allows us to do things like:

Focusing on the highest cardinality metrics first.
Having a lower limit of cardinality below which we ignore the metric (i.e., don't take action on metrics with less than 5000 series).
Building intelligence around how many series we're saving with various measures; answering questions like "how many series would we save by dropping this metric?" or "how many series have we saved in total with the measures implemented thus far?".

The GET,POST <prometheus-http-prefix>/api/v1/cardinality/label_values endpoint is a prime candidate for our use-case. The issue is that the endpoint has a maximum limit of 500. In other words, with this endpoint, we can only get the top 500 highest cardinality metrics of the platform.

This is an example of the request we're currently making:

curl --location --globoff 'localhost:8080/prometheus/api/v1/cardinality/label_values?label_names[]=__name__&count_method=active&limit=500' --header 'X-Scope-OrgID: fake'

Which solution do you envision (roughly)?

Initially, we suspected that the limit parameter was in place to protect server-side resources. However, we inspected the code of the endpoint, and it is our understanding that the endpoint always fetches all label values, and that the only thing the limit parameter does is control how many are returned (source).

We'd like for the limit parameter to have no maximum value:

Behaviour for all requests that don't specify the limit parameter is the same: limit will default to 20.
Behaviour for all requests that specify a limit smaller than 500 is the same.
Behaviour for all requests that specify a limit greater than 500 changes: they would previously fail with 'limit' param cannot be greater than '500' but will succeed after the change.
Clients who would like to analyse a larger set of label values can set limit to a greater value like 100000.

Have you considered any alternatives?

We've considered two alternative solutions:

The label values endpoint can return all metric names: GET /prometheus/api/v1/label/__name__/values. The issue with this endpoint is that it doesn't return series count.
The instant query endpoint is theoretically capable of returning all metric names and count of series for each name with this query: count by (__name__) ({ __name__=~".+"}). This query is prohibitively expensive to run, and fails almost immediately with a the query exceeded the maximum number of chunks error.

Any additional context to share?

We might consider whether to make the same change for the limit parameter of the GET,POST <prometheus-http-prefix>/api/v1/cardinality/label_names endpoint.

My soft opinion might be that the two endpoints should have the same implementation of the limit parameter, but I am not privy to the details of this endpoint.

How long do you think this would take to be developed?

Small (<= 1 month dev)

What are the documentation dependencies?

We'd need to update the documentation for the label cardinality endpoint: https://grafana.com/docs/mimir/v2.14.x/references/http-api/#label-values-cardinality

Proposer?

Lasse Hels

The text was updated successfully, but these errors were encountered:

LasseHels · 2025-01-02T12:32:42Z

I'd be happy to work on the implementation of this change with the blessing of the Mimir team.

LasseHels · 2025-01-08T11:25:16Z

We appreciate that the Mimir team is coming back from vacation and would be happy to get some input on this change request.

LasseHels · 2025-01-13T05:54:44Z

@narqo @dimitarvdimitrov @pracucci Would the Mimir team be able to take a look at this proposal? As mentioned, we are happy to do the implementation.

56quarters · 2025-01-17T14:08:02Z

Hello @LasseHels ,

I think adding functionality to get the rest of the results from the cardinality APIs is great but my preference would be to add an offset parameter to allow results to be paged through instead of making the limit unbounded. However, I haven't worked with this code very much so I don't feel strongly about it. cc @Logiraptor and @flxbk who I believe have more experience with it.

LasseHels · 2025-01-22T06:15:56Z

Happy to hear an opinion from @Logiraptor or @flxbk.

flxbk · 2025-01-22T07:48:58Z

Have you considered using the active series endpoint to retrieve the information you need? That endpoint should give you all the information you need to answer questions like

"how many series would we save by dropping this metric?" or "how many series have we saved in total with the measures implemented thus far?"

and it should also be able to support a more detailed analysis of cardinality data than "just" at the metric level.

LasseHels · 2025-01-22T08:43:40Z

@flxbk We did try out the active series endpoint, but there are a couple of issues.

We consistently get response too large: try increasing the requested shard count errors, even with a relatively narrow selector and a Sharding-Control value of 100000000¹. Our platform has about 300m metric series in total. All of these series have a common set of labels (like k8s_cluster and product_id) that we could split requests by at the client side, but even so, we still run into HTTP 413 errors.

It is our understanding that the active_series_results_max_size_bytes configuration option can be increased, but even if the endpoint could technically return the data, then we're still talking about transmitting a lot of unnecessary bytes across the network which we would then have to manually aggregate.

From my naive perspective, the /api/v1/cardinality/label_values endpoint seems like a better fit for our use-case.

Thoughts?

I'm guessing that a sharding value this high is not sensible. ↩

flxbk · 2025-01-22T09:58:41Z

I agree that if you don't need more granular data for your analysis than what cardinality/label_values can provide, sending all those bytes across the network isn't useful.

As to your original question, I think it's reasonable to remove the maximum value for the limit parameter. However, I think at the same time it would be nice to introduce a memory-based limit for limiting the size of the accumulated response to protect against querier OOMs.

LasseHels · 2025-01-27T13:54:03Z

@flxbk

As to your original question, I think it's reasonable to remove the maximum value for the limit parameter.

👍

However, I think at the same time it would be nice to introduce a memory-based limit for limiting the size of the accumulated response to protect against querier OOMs.

My understanding is that the LabelValuesCardinality() method always fetches all label values, regardless of the value of the limit parameter.

The limit parameter is only applied in toLabelValuesCardinalityResponse() after all the values have been fetched.

If this is correct, then I'm not sure I understand how raising or removing the maximum value of the limit parameter increases the risk of OOMs, as the endpoint has always fetched (but not necessarily returned) all label values.

Can you elaborate a bit on why you think a more generous limit parameter would increase the risk of OOMs?

LasseHels added the enhancement New feature or request label Jan 2, 2025

56quarters added the component/querier label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Make it possible to analyse all values of a label with the `/api/v1/cardinality/label_values` endpoint #10328

Idea: Make it possible to analyse all values of a label with the `/api/v1/cardinality/label_values` endpoint #10328

LasseHels commented Jan 2, 2025 •

edited

Loading

LasseHels commented Jan 2, 2025

LasseHels commented Jan 8, 2025

LasseHels commented Jan 13, 2025

56quarters commented Jan 17, 2025

LasseHels commented Jan 22, 2025

flxbk commented Jan 22, 2025

LasseHels commented Jan 22, 2025

flxbk commented Jan 22, 2025 •

edited

Loading

LasseHels commented Jan 27, 2025

Idea: Make it possible to analyse all values of a label with the /api/v1/cardinality/label_values endpoint #10328

Idea: Make it possible to analyse all values of a label with the /api/v1/cardinality/label_values endpoint #10328

Comments

LasseHels commented Jan 2, 2025 • edited Loading

What is the problem you are trying to solve?

Which solution do you envision (roughly)?

Have you considered any alternatives?

Any additional context to share?

How long do you think this would take to be developed?

What are the documentation dependencies?

Proposer?

LasseHels commented Jan 2, 2025

LasseHels commented Jan 8, 2025

LasseHels commented Jan 13, 2025

56quarters commented Jan 17, 2025

LasseHels commented Jan 22, 2025

flxbk commented Jan 22, 2025

LasseHels commented Jan 22, 2025

Footnotes

flxbk commented Jan 22, 2025 • edited Loading

LasseHels commented Jan 27, 2025

Idea: Make it possible to analyse all values of a label with the `/api/v1/cardinality/label_values` endpoint #10328

Idea: Make it possible to analyse all values of a label with the `/api/v1/cardinality/label_values` endpoint #10328

LasseHels commented Jan 2, 2025 •

edited

Loading

flxbk commented Jan 22, 2025 •

edited

Loading