Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict recording rules to Cortex containers #283

Closed
wants to merge 1 commit into from

Conversation

pracucci
Copy link
Collaborator

What this PR does:
The PR #278 has added new recording rules for the scaling dashboard. In our infrastructure, the recording rules for cpu_usage and memory_usage are slow to evaluate (takes about 1m each) because they run for every container, not just Cortex ones.

In this PR I'm proposing to run them only for Cortex containers. If this is not the right approach to fix it, what do you suggest?

Which issue(s) this PR fixes:
N/A

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci requested a review from tomwilkie March 29, 2021 11:22
@pracucci pracucci requested a review from a team as a code owner March 29, 2021 11:22
@tomwilkie
Copy link
Contributor

I found those kinda useful for other jobs too. We should probably move them elsewhere, but for now can we keep them?

If this is not the right approach to fix it, what do you suggest?

Make the rule interval >1m (maybe 5min?)

@beorn7
Copy link
Contributor

beorn7 commented Mar 29, 2021

Same as for the scrape interval, a rule interval of more than 2m gets into danger territory (i.e. it will easily collide with the lookback of 5m).

@beorn7
Copy link
Contributor

beorn7 commented Mar 29, 2021

My general idea here would be the following: Wherever you have a subquery (a range with a : inside), take the inner query, turn it into its own recording rule, and then eliminate the subquery by using the recording rule with a normal range selector.

@beorn7
Copy link
Contributor

beorn7 commented Mar 29, 2021

Now the alert also fired for cortex_slo_rules. The culprit there is the queries that span a very long range (3d etc.). I think we can play some tricks there, too, happy to help.

But let's first tackle the cortex_scaling_rules, and I think we should try what I suggested above. That might be an easy win.

@tomwilkie
Copy link
Contributor

Same as for the scrape interval, a rule interval of more than 2m gets into danger territory (i.e. it will easily collide with the lookback of 5m).

I didn't realise the lookback delta was so short now, I thought it was 15mins.

@beorn7
Copy link
Contributor

beorn7 commented Mar 29, 2021

In Prometheus, the default ever was 5m. Even before staleness. But perhaps Cortex has a different default?

@tomwilkie
Copy link
Contributor

Nah sounds like I've just got it wrong.

@pracucci
Copy link
Collaborator Author

I think we should try what I suggested above. That might be an easy win.

To my understanding your suggestion, we would create a recording rule for the following (and same for the memory rule):

sum by (cluster, namespace, deployment) (
                    label_replace(
                      node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate,
                      "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))"
                    )
                  )

If my understanding is correct, in our infrastructure, that would reduce the cardinality from 10467 to 6813 which could help but doesn't look dramatic. Am I missing anything?

Screenshot 2021-03-29 at 16 53 35

@beorn7
Copy link
Contributor

beorn7 commented Mar 29, 2021

So yes, I suggest to use that recording rule, but the main win is not cardinality, it is in the number of evaluations. With the recording rule, this expression:

                quantile_over_time(0.99,
                  sum by (cluster, namespace, deployment) (
                    label_replace(
                      node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{%(container_names_matcher)s},
                      "deployment", "$1", "pod", "(.*)-(?:([0-9]+)|([a-z0-9]+)-([a-z0-9]+))"
                    )
                  )[24h:5m]
                )

becomes thes:

                quantile_over_time(0.99,
                  your_new_rule_with_an_appropriate_name[24h:5m]
                )

The subquery results in 288 evaluations of the whole expensive sum above. With the below, you evaluate a much simpler query 288 times. You could also get rid of the subquery altogether:

                quantile_over_time(0.99,
                  your_new_rule_with_an_appropriate_name[24h]
                )

That's even more data to access for quantile_over_time, but it is just one evaluation. Not sure what's faster. We have to try it out. Perhaps this all won't make a dent, but I think it's worth a try.

@pracucci
Copy link
Collaborator Author

Closing in favour of #284.

@pracucci pracucci closed this Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants