-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager: Skip starting the Alertmanager for Grafana tenants unless they have a promoted, non-default configuration #10491
Alertmanager: Skip starting the Alertmanager for Grafana tenants unless they have a promoted, non-default configuration #10491
Conversation
5f2c772
to
c2e7145
Compare
a32f18d
to
27219a7
Compare
27219a7
to
21c4fc9
Compare
ffbc07f
to
1da54f9
Compare
💻 Deploy preview deleted. |
_, initSkipped := amInitSkipped[userID] | ||
if !exists || initSkipped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the skipped Alertmanager was already running, I'm deleting its data and adding it to the map of Alertmanagers to stop.
// If the Grafana configuration is either default, not promoted, or empty, use the Mimir configuration. | ||
if !cfgs.Grafana.Promoted || cfgs.Grafana.Default || cfgs.Grafana.RawConfig == "" { | ||
level.Debug(am.logger).Log("msg", "using mimir config", "user", cfgs.Mimir.User) | ||
isGrafanaTenant := am.cfg.GrafanaAlertmanagerTenantSuffix != "" && strings.HasSuffix(cfgs.Mimir.User, am.cfg.GrafanaAlertmanagerTenantSuffix) | ||
return cfg, !isGrafanaTenant, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to remove the granular debug log lines to shorten the function.
cmd/mimir/help-all.txt.tmpl
Outdated
@@ -239,6 +239,8 @@ Usage of ./cmd/mimir/mimir: | |||
Enables periodic cleanup of alertmanager stateful data (notification logs and silences) from object storage. When enabled, data is removed for any tenant that does not have a configuration. (default true) | |||
-alertmanager.grafana-alertmanager-compatibility-enabled | |||
[experimental] Enable routes to support the migration and operation of the Grafana Alertmanager. | |||
-alertmanager.grafana-tenant-suffix string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a super nitpick, but the flag name is bit generic IMO. Feels like it's missing an action there - e.g. skip-init-for-grafana-tenant-suffix
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Description
We currently run an Alertmanager for each tenant in the database, no matter whether their configurations are usable or not. This results in many unused Alertmanagers consuming resources.
This PR adds the option to skip initializing Alertmanagers for Grafana Alertmanager tenants without a usable Grafana Alertmanager configuration.
The new
-alertmanager.grafana-alertmanager-conditionally-skip-tenant-suffix
argument takes a string that will be compared against tenants' IDs. If it matches, the Alertmanager will only be initialized if there's a promoted, non-default, non-empty Grafana Alertmanager configuration for that tenant.Testing locally
I tested this by spinning up two Alertmanagers:
mimir-1
built frommain
mimir-1-new
build from this branchI added 600 tenants to each one:
Both Alertmanagers owned 600 tenants, but only 200 of them were active in the Alertmanager build from this branch.
Tenants owned per AM.
Tenant config reloads per Alertmanager.
The memory footprint was reduced by roughly half in the Alertmanager filtering out the Grafana Alertmanager tenants without a usable configuration.
avg_over_time(go_memstats_heap_inuse_bytes{job=~"mimir-1.*"}[1m])
CPU time in the new Alertmanager was decreased by roughly ~45%, goroutines by ~55%.
sum by (job) (rate(process_cpu_seconds_total{job=~"mimir-1.*"}[5m]))
go_goroutines{job=~"mimir-1.*"}