Restrict recording rules to Cortex containers #283

pracucci · 2021-03-29T11:22:32Z

What this PR does:
The PR #278 has added new recording rules for the scaling dashboard. In our infrastructure, the recording rules for cpu_usage and memory_usage are slow to evaluate (takes about 1m each) because they run for every container, not just Cortex ones.

In this PR I'm proposing to run them only for Cortex containers. If this is not the right approach to fix it, what do you suggest?

Which issue(s) this PR fixes:
N/A

Checklist

CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <[email protected]>

tomwilkie · 2021-03-29T12:03:31Z

I found those kinda useful for other jobs too. We should probably move them elsewhere, but for now can we keep them?

If this is not the right approach to fix it, what do you suggest?

Make the rule interval >1m (maybe 5min?)

beorn7 · 2021-03-29T12:58:51Z

Same as for the scrape interval, a rule interval of more than 2m gets into danger territory (i.e. it will easily collide with the lookback of 5m).

beorn7 · 2021-03-29T13:01:00Z

My general idea here would be the following: Wherever you have a subquery (a range with a : inside), take the inner query, turn it into its own recording rule, and then eliminate the subquery by using the recording rule with a normal range selector.

beorn7 · 2021-03-29T13:32:12Z

Now the alert also fired for cortex_slo_rules. The culprit there is the queries that span a very long range (3d etc.). I think we can play some tricks there, too, happy to help.

But let's first tackle the cortex_scaling_rules, and I think we should try what I suggested above. That might be an easy win.

tomwilkie · 2021-03-29T14:11:19Z

Same as for the scrape interval, a rule interval of more than 2m gets into danger territory (i.e. it will easily collide with the lookback of 5m).

I didn't realise the lookback delta was so short now, I thought it was 15mins.

beorn7 · 2021-03-29T14:19:43Z

In Prometheus, the default ever was 5m. Even before staleness. But perhaps Cortex has a different default?

tomwilkie · 2021-03-29T14:22:25Z

Nah sounds like I've just got it wrong.

pracucci · 2021-03-29T14:55:29Z

I think we should try what I suggested above. That might be an easy win.

To my understanding your suggestion, we would create a recording rule for the following (and same for the memory rule):

sum by (cluster, namespace, deployment) (
                    label_replace(
                      node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate,
                      "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))"
                    )
                  )

If my understanding is correct, in our infrastructure, that would reduce the cardinality from 10467 to 6813 which could help but doesn't look dramatic. Am I missing anything?

beorn7 · 2021-03-29T15:21:15Z

So yes, I suggest to use that recording rule, but the main win is not cardinality, it is in the number of evaluations. With the recording rule, this expression:

                quantile_over_time(0.99,
                  sum by (cluster, namespace, deployment) (
                    label_replace(
                      node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{%(container_names_matcher)s},
                      "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))"
                    )
                  )[24h:5m]
                )

becomes thes:

                quantile_over_time(0.99,
                  your_new_rule_with_an_appropriate_name[24h:5m]
                )

The subquery results in 288 evaluations of the whole expensive sum above. With the below, you evaluate a much simpler query 288 times. You could also get rid of the subquery altogether:

                quantile_over_time(0.99,
                  your_new_rule_with_an_appropriate_name[24h]
                )

That's even more data to access for quantile_over_time, but it is just one evaluation. Not sure what's faster. We have to try it out. Perhaps this all won't make a dent, but I think it's worth a try.

pracucci · 2021-03-30T06:29:14Z

Closing in favour of #284.

Restrict recording rules to Cortex containers

06fc0f6

Signed-off-by: Marco Pracucci <[email protected]>

pracucci requested a review from tomwilkie March 29, 2021 11:22

pracucci requested a review from a team as a code owner March 29, 2021 11:22

pracucci mentioned this pull request Mar 30, 2021

Pre-compute aggregations to optimize scaling recording rules #284

Merged

1 task

pracucci closed this Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict recording rules to Cortex containers #283

Restrict recording rules to Cortex containers #283

pracucci commented Mar 29, 2021

tomwilkie commented Mar 29, 2021

beorn7 commented Mar 29, 2021

beorn7 commented Mar 29, 2021

beorn7 commented Mar 29, 2021

tomwilkie commented Mar 29, 2021

beorn7 commented Mar 29, 2021

tomwilkie commented Mar 29, 2021

pracucci commented Mar 29, 2021

beorn7 commented Mar 29, 2021

pracucci commented Mar 30, 2021

Restrict recording rules to Cortex containers #283

Restrict recording rules to Cortex containers #283

Conversation

pracucci commented Mar 29, 2021

tomwilkie commented Mar 29, 2021

beorn7 commented Mar 29, 2021

beorn7 commented Mar 29, 2021

beorn7 commented Mar 29, 2021

tomwilkie commented Mar 29, 2021

beorn7 commented Mar 29, 2021

tomwilkie commented Mar 29, 2021

pracucci commented Mar 29, 2021

beorn7 commented Mar 29, 2021

pracucci commented Mar 30, 2021