-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict recording rules to Cortex containers #283
Conversation
Signed-off-by: Marco Pracucci <[email protected]>
I found those kinda useful for other jobs too. We should probably move them elsewhere, but for now can we keep them?
Make the rule interval >1m (maybe 5min?) |
Same as for the scrape interval, a rule interval of more than 2m gets into danger territory (i.e. it will easily collide with the lookback of 5m). |
My general idea here would be the following: Wherever you have a subquery (a range with a |
Now the alert also fired for But let's first tackle the |
I didn't realise the lookback delta was so short now, I thought it was 15mins. |
In Prometheus, the default ever was 5m. Even before staleness. But perhaps Cortex has a different default? |
Nah sounds like I've just got it wrong. |
To my understanding your suggestion, we would create a recording rule for the following (and same for the memory rule):
If my understanding is correct, in our infrastructure, that would reduce the cardinality from 10467 to 6813 which could help but doesn't look dramatic. Am I missing anything? |
So yes, I suggest to use that recording rule, but the main win is not cardinality, it is in the number of evaluations. With the recording rule, this expression:
becomes thes:
The subquery results in 288 evaluations of the whole expensive sum above. With the below, you evaluate a much simpler query 288 times. You could also get rid of the subquery altogether:
That's even more data to access for |
Closing in favour of #284. |
What this PR does:
The PR #278 has added new recording rules for the scaling dashboard. In our infrastructure, the recording rules for
cpu_usage
andmemory_usage
are slow to evaluate (takes about 1m each) because they run for every container, not just Cortex ones.In this PR I'm proposing to run them only for Cortex containers. If this is not the right approach to fix it, what do you suggest?
Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]